Journals & Magazines >IEEE Journal of Selected Topi... >Volume: 15

SWCGAN: Generative Adversarial Network Combining Swin Transformer and CNN for Remote Sensing Image Super-Resolution

Abstract:

Easy and efficient acquisition of high-resolution remote sensing images is of importance in geographic information systems. Previously, deep neural networks composed of c...Show More

Topic: Collaborative Learning and Optimization

Metadata

Abstract:

Easy and efficient acquisition of high-resolution remote sensing images is of importance in geographic information systems. Previously, deep neural networks composed of convolutional layers have achieved impressive progress in super-resolution reconstruction. However, the inherent problems of the convolutional layer, including the difficulty of modeling the long-range dependency, limit the performance of these networks on super-resolution reconstruction. To address the abovementioned problems, we propose a generative adversarial network (GAN) by combining the advantages of the swin transformer and convolutional layers, called SWCGAN. It is different from the previous super-resolution models, which are composed of pure convolutional blocks. The essential idea behind the proposed method is to generate high-resolution images by a generator network with a hybrid of convolutional and swin transformer layers and then to use a pure swin transformer discriminator network for adversarial training. In the proposed method, first, we employ a convolutional layer for shallow feature extraction that can be adapted to flexible input sizes; second, we further propose the residual dense swin transformer block to extract deep features for upsampling to generate high-resolution images; and third, we use a simplified swin transformer as the discriminator for adversarial training. To evaluate the performance of the proposed method, we compare the proposed method with other state-of-the-art methods by utilizing the UCMerced benchmark dataset, and we apply the proposed method to real-world remote sensing images. The results demonstrate that the reconstruction performance of the proposed method outperforms other state-of-the-art methods in most metrics.

Topic: Collaborative Learning and Optimization

Published in: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing ( Volume: 15)

Page(s): 5662 - 5673

Date of Publication: 13 July 2022

ISSN Information:

DOI: 10.1109/JSTARS.2022.3190322

Funding Agency:

Contents

SECTION I.

Introduction

Currently, remote sensing plays an important role in various fields [1]–[4]. As a critical component of remote sensing, optical satellite remote sensing with high spatial resolution provides observation targets with clear spatial texture information, which can present the essential features of the landscape. However, an increase in the pixel density of the sensor will significantly increase the hardware cost of obtaining optical remote sensing images. To acquire high-resolution remote sensing images more conveniently and cost-effectively, super-resolution reconstruction techniques that recover high-resolution images according to low-resolution images have received much attention.

The essential idea behind image super-resolution is to learn the prior knowledge from the image data and then use it to recover the lost details of the low-resolution images. It is worthwhile to mention that some early methods, such as sparse reconstruction-based methods [5], degenerate models [6], and interpolation [7], have achieved success in high-resolution images by acquiring a priori knowledge under the limitation of a lack of learning ability. In recent years, deep learning methods have achieved great success in various fields, and naturally, super-resolution reconstruction methods with powerful learning capabilities based on deep learning surpass traditional methods in terms of performance. Convolutional neural networks (CNNs) have long been the “standard answer” to image processing tasks. A super-resolution CNN, the first CNN with a high learning ability for super-resolution reconstruction, was proposed by Dong et al. [8]. Then, to further boost the performance of super-resolution reconstruction using CNNs, various new techniques, including residual dense blocks [9], [10], residual learning [11], [12], and recursive blocks [13], are also introduced.

Moreover, with generative adversarial networks (GANs) being originally proposed and receiving great attention [14], [15], GANs using convolution modules have also been proven to have impressive performance on the super-resolution reconstruction task. For example, SRGAN [16] first introduced a GAN architecture for the super-resolution reconstruction task and proposed a perceptual similarity-based loss function. CDGAN [17] has improved the discriminator for GAN, where both generated image and its high-resolution ground truth are input to the discriminator for better discrimination. EEGAN [18] reduces the interference of noise in the super-resolution reconstruction of satellite images by purifying the noise-contaminated components with mask processing. Although CNNs with powerful learning capabilities offer a significant performance improvement over traditional methods, they cannot escape from the basic problems that originate from the basic convolutional layer, i.e., the convolutional layer based on the principle of local processing has difficulty, or is even ineffective, for the capture of long-range dependencies.

To solve the abovementioned problem, a self-attention mechanism derived from transformer [19]–[21] was used as an alternative to CNNs that capture global interactions between contexts and shows an excellent performance on several visual tasks. However, the networks designed based on the transformer block often have a number of parameters that exceed those of general convolutional networks. On the other hand, the transformer for image recovery typically segments the input image into patches [22]–[24], which can introduce boundary artifacts around each patch.

Recently, swin transformer [25], a self-attention network that overcomes the abovementioned shortcomings of the transformer, has shown great potential in the field of computational vision. It is capable of both processing large-size images without dividing the images into patches and learning long-range dependencies as a transformer due to the shifted window scheme. Furthermore, it has less computational cost than the transformer. Swin transformer has reached state-of-the-art in image classification and semantic segmentation tasks. However, its application and research in image super-resolution, especially in remote sensing images, is still relatively rare.

Remote sensing images contain more information than natural images, and the pixels of remote sensing images are correlated with each other. CNNs have difficulties in acquiring global information and long-range dependencies between pixels for remote sensing images. Moreover, since the size of real remote sensing images tends to be larger than that of natural images and the complexity of the self-attention network is high, the pure self-attention network is prone to memory bottlenecks if the remote sensing images are used directly as input.

In this article, to adapt the characteristics of remote sensing images, we first introduce the shifted window self-attention mechanism from the swin transformer into the super-resolution research of remote sensing images and propose a GAN that combines the advantages of the swin transformer and convolutional layers, namely, SWCGAN. Specifically, in the proposed SWCGAN,

we employ a convolutional layer for shallow feature extraction that can be adapted to flexible input sizes;
we propose the residual dense swin transformer block (RDSTB) by drawing on the characteristics of DenseNet [9] to build the depth feature extraction module of the generator, which is used to obtain the deep features for upsampling to generate high-resolution images;
we simplify the original swin transformer and use it as a discriminator.

To demonstrate the performance of the proposed SWCGAN, the UCMerced dataset is utilized for training and validation of the proposed method, and the performance of the proposed method outperforms other state-of-the-art methods in most metrics. Moreover, the proposed method is applied to a real-world remote sensing image to verify the effectiveness and applicability of the proposed method.

Our contributions are as follows.

We propose a GAN with a hybrid of convolutional and swin transformer layers for the super-resolution reconstruction of remote sensing images to consider the features of large size, large information, and strong correlation between pixels of remote sensing images for super-resolution reconstruction.
We further propose a depth feature extraction block, namely, RDSTB, which can extract deep image features efficiently by stacking multiple blocks. As a feature extraction block of images, the proposed RDSTB can also be used in other image processing tasks in the future.
We evaluate the proposed method using the UCMerced benchmark dataset and real-world remote sensing images from a high-resolution satellite.

The rest of this article is organized as follows. In Section II, the proposed method, including the network architecture, the loss functions, and the shifted window self-attention mechanism, are described in detail. Section III presents the experimental results and performance evaluations, and the proposed method is applied to a real-world remote sensing image. In Section IV, we present a discussion. Finally, Section V concludes this article.

SECTION II.

Methods

In this section, we will first briefly introduce the essential idea behind the proposed method, i.e., the idea of the SWCGAN for super-resolution. Specifically, for the proposed method, we will describe the generator network, the discriminator network, and the loss function. Furthermore, the advanced shifted window self-attention mechanism is summarized.

A. Overview of SWCGAN

In this article, we proposed a GAN by combining the advantages of the swin transformer and convolutional layers for super-resolution. The workflow of the proposed SWCGAN for super-resolution is illustrated in Fig. 1. A typical GAN model consists of two parts, a generator G and a discriminator D. As shown in Fig. 1(a), for the generator G, the input is the low-resolution image, and then, the features of the input low-resolution image is extracted to obtain the feature maps using the feature extraction module. To obtain the generated high-resolution image, the extracted feature maps will be upsampled by the upsampler (in this article, 4x upsampling is used). For the discriminator D, the input is the generated high-resolution image, and the generated high-resolution image also undergoes a feature extraction step to obtain its deep features, where the sizes of feature maps gradually decrease, unlike in the generator. Finally, a linear layer is classified dichotomously by the output 0 or 1. Naturally, in an image super-resolution GAN, G is used to generate a fake high-resolution image by reducing the difference between the fake high-resolution image and the real high-resolution image, and D is used to distinguish the real high-resolution image from the generated image in training. G and D compete with each other in the training process in such a way that the data distribution of the generated images is gradually close to the real distribution [see Fig. 1(b)].

Fig. 1.

Illustration of GAN for the super-resolution reconstruction task. (a) Network framework. (b) Data distribution.

Show All

Previously, convolutional blocks were commonly used to form GANs for super-resolution reconstruction. However, the convolutional layers that compose CNNs limit the performance of CNNs in the super-resolution reconstruction task due to its inherent problems, especially the inability to simulate long-term dependencies. To address the abovementioned issue, we propose a GAN with a hybrid of convolutional and swin transformer layers named SWCGAN for the super-resolution reconstruction task, where we introduce swin transformer layers with convolutional layers to form RDSTB by dense connection and residual structure to extract image features.

B. Generator Network in SWCGAN

As shown in Figs. 1(a) and 2(a), the generator can be further divided into the following three modules:

Fig. 2.

Architecture of the proposed SWCGAN. (a) Network architecture of generator and discriminator. (b) RDSTB. (c) Swin transformer block.

Show All

shallow feature extraction module;
deep feature extraction module;
upsampling module.

In this shallow feature extraction module, a convolution layer $\text{Conv}(c,n_{f},k,s,p)$ is used to extract shallow features, where $c$ represents the filter channels, $n_{f}$ represents the number of filters, $k$ represents the kernel size, $s$ represents the stride, and $p$ represents the padding. The advantage of the convolution layer in shallow image processing is that it can flexibly set the input image size. Thus, $k$, $s$, and $p$ should be set to 3, 1, and 1, which ensures that the channel of the input image is transformed from $c$ to $n_{f}$ without changing the size of the input image ($H\times W\times 3$ to $H\times W\times n_{f}$).

After extracting shallow features, the $H\times W\times n_{f}$ feature maps are imported into the deep feature extraction module, which is composed of the $L$ layer of the proposed RDSTB (in this article, we set $L=4$). As shown in Fig. 2(b), in this proposed RDSTB, we first introduce a swin transformer layer and combine it with a convolutional layer, which ensures the learning of local information and the modeling of long-range dependency with shifted window self-attention. Then, inspired by DenseNet, we connect these residual blocks in a densely connected scheme ($N$ is set to 6), which keeps us building a complex and deep model. Finally, we keep the dimensions of the input and output of RDSTB constant by a convolution layer.

As shown in Fig. 3, the nearest neighbor interpolation method [26], [27] combined with a convolutional layer is applied to upsample the feature map after extracting deep feature, and its size changes from $H\times W$ to $2H\times \text{2}\,W$. We utilized this upsampling operation twice to achieve a 4x upsampling effect. There have been some investigations showing that this upsampling operation is effective in reducing the noise of the generated high-resolution images compared to the upsampler composed of convolution and PixelShuffle [28]. Moreover, to comprehensively utilize the extracted features, we aggregate shallow features and deep features into the upsampler \begin{equation*} I_{\text{HR}}=H_{\text{up}}(F_{0}+F_{D}) \tag{1} \end{equation*} View Sourcewhere $I_{\text{HR}}$ is the generated high-resolution image, $H_{\text{up}}$ is the function of the upsampler, $F_{0}$ is the shallow features, and $F_{D}$ is the deep features.

Fig. 3.

Illustration of the upsampler in SWCGAN.

Show All

C. Discriminator Network in SWCGAN

For the discriminator, the simplified swin transformer is used to conduct the dichotomous classification task. As shown in Figs. 2(a) and 4, in the original swin transformer, the feature extraction is a total of 4 stages, and the dimensions of the input data are transformed from $H\times W\times 3$ to the feature maps ${H}/{32}\times {W}/{32}\times \text{8}\,C$ after extracting features. In our simplified swin transformer (discriminator), we employ only the first two stages of the original swin transformer as the feature extraction module of the discriminator, and the dimensions of the input data are transformed from $H\times W\times 3$ to the feature maps ${H}/{8}\times {W}/{8}\times \text{2}\,C$ after extracting features. Then, the feature maps ${H}/{8}\times {W}/{8}\times \text{2}\,C$ from the feature extraction module are used directly for classification by a linear layer. It is worthwhile to note that instead of the standard discriminator, we use the relativistic average discriminator, which estimates the probability that the real image is relatively more realistic than the fake image. Its concrete implementation is described in the loss function.

Fig. 4.

Illustration of the simplified swin transformer (discriminator).

Show All

D. Use of Swin Transformer in SWCGAN

To address the challenges involved in the application of CNNs and transformers to the image field, the swin transformer proposed a shifted windows operation that included nonoverlapping local windows and overlapping cross-window connections to restrict the attentional computation to a single window, which can allow the model to enjoy the advantages of CNN convolutional operations on the one hand and to save computational effort on the other hand.

As shown in Fig. 4, the whole model adopts a hierarchical design, where the model contains a total of four stages, each of which reduces the resolution of the input feature map and expands the receptive field layer by layer like a CNN. First, compared with vision transformer [23], [29], patch partitioning becomes an optional operation due to the window self-attention mechanism, which greatly increases the flexibility of the model. Second, there is a patch merging operation before performing swin transformer block, which serves to perform downsampling. It is used to reduce the resolution, adjust the number of channels, and save computational effort. In patch merging, the features of each group of $2\times 2$ neighboring patches are concatenated together as a whole tensor, and a linear layer is applied to these $\text{4}\,C$-dimensional concatenated features, reducing its output dimension to $\text{2}\,C$. Third, the architecture of the STB is similar to that of the transformer block, except that the standard multihead self-attention module in the transformer block is replaced with a module built on the shifted windows, and the other layers remain the same.

To solve the problem of high computational complexity caused by the global-based calculation of attention in the traditional transformer, the swin transformer reduces the complexity of the algorithm by the window attention, limiting the computation of attention to each window. The equation for computing self-attention is as follows: \begin{equation*} A(Q,K,V)=\text{Softmax}\left(\frac{QK^{T}}{\sqrt{d}}+B\right)V \tag{2} \end{equation*} View Sourcewhere $Q$, $K$, and $V\in {\mathbb {R}}^{M^{2}\times d}$ indicate the query, key, and value matrices, $B\in {\mathbb {R}}^{M^{2}\times M^{2}}$ indicates a relative position bias; $d$ represents the $\frac{Q}{K}$ dimension, and $M^{2}$ represents the number of patches in a window ($(M,M)$ is the window size). In the computation of self-attention, the relative position is critical.

The abovementioned window self-attention is calculated for each window. To allow different windows to interact with one another, a unique method called shifted window is adopted in swin transformer. As shown in Fig. 5, there is an illustration of the shifted window approach. First, a regular window partitioning scheme is adopted in layer Layer (the feature map is divided evenly into $2 \times 2$ windows of size $4\times 4$). Then, it can be noticed that the window partitioning scheme has been altered in layer $\text{Layer}+1$, where the window partitioning is shifted in such a way that the shifted window contains the features of the original neighboring window.

Fig. 5.

Illustration of the shifted window approach.

Show All

E. Loss Functions in SWCGAN

Our choice is the relativistic average GAN, which differs from the standard GAN that discriminates real images as 1 and fake images as 0. It estimates the probability that the real image is relatively more realistic than the fake image (see Fig. 6). The loss function of the discriminator is defined as follows: \begin{equation*} L_{D}=-{\mathrm{E}}_{x_{r}\sim p_{\text{data}}\left(x_{r}\right)}\left[{\log (1-D_{\text{RA}}(x_{r},z_{f})) }\right]\\ -{\mathrm{E}}_{z_{f}\sim p_{z}\left(z_{f}\right)}\left[{\log (1-D_{\text{RA}}(z_{f},x_{r})) }\right] \tag{3} \end{equation*} View Sourcewhere $z_{f}$ is the high-resolution image by the generator $G(x_{i})$, $x_{i}$ is the input low-resolution image, and $x_{r}$ is the corresponding real high-resolution image. $x_{r}\sim p_{\text{data}}$ represents the distribution pattern of the real data, and $z_{f}\sim p_{z}(z_{f})$ is similar to $x\sim p_{\text{data}}$.

Fig. 6.

Illustration of the relativistic average discriminator.

Show All

We set the training objective of the generator to minimize the joint loss, which consists of the content loss and the adversarial loss. The loss function of the generator is defined as follows: \begin{align*} L_{G}&=\ L_{\text{cont}}+{\lambda L}_{\text{adv}} \tag{4} \\ L_{\text{cont}}&={\left.z_{f}\mathrm{-}x_{r}\right.}_{\mathrm{1}} \tag{5} \\ L_{\text{adv}}&=-{\mathrm{E}}_{x_{r}\sim p_{\text{data}}\left(x_{r}\right)}\left[{\log (1-D_{\text{RA}}(x_{r},z_{f})) }\right]\\ &\quad-{\mathrm{E}}_{z_{f}\sim p_{z}\left(z_{f}\right)}\left[{\log (D_{\text{RA}}(z_{f},x_{r})) }\right] \tag{6} \end{align*} View Sourcewhere the content loss $L_{\text{cont}}$ is the $L_{1}$ pixel loss to evaluate the mean square error between a generated high-resolution image and real one, the form of the adversarial loss for the generator corresponds to the relativistic average GAN, and $\lambda$ is a hyperparameter.

SECTION III.

Results

In this section, we first list the environment of the experiment. In addition, to evaluate the proposed method, we present the results of the comparison between the proposed methods and other methods. Finally, the application results of the proposed methods are shown.

A. Experimental Environment

The details of the experimental environment are listed in Table I.

TABLE I Platforms Used for Testing

B. Evaluation of the Proposed Methods

1) Experimental Dataset and Model Training

We selected the commonly used remote sensing dataset “UCMerced” [30] as the experimental dataset for the super-resolution reconstruction task. There are many classes of images in the UCMerced dataset with 21 classes (including forest, buildings, beach, and so on), which contain most of the remote sensing scenes. Each class has 100 images, and all images are $256\times 256$ in size. In the training process, we selected 10% of the UCMerced images as the validation set and the remainder as the training set. Furthermore, we use the bicubic algorithm [31], [32] to downsample the UCMerced dataset and obtain the low-resolution dataset as the input.

For the optimizer, the Adam optimizer [33] with the initial learning rate ${10}^{-4}$ and the batch size of 8 is used. Specifically, we first pretrain a single generator for $3\times {10}^{5}$ iterations to provide a standard mean square error-based super-resolution model since the training of the generator is based only on the content loss; on the other hand, it provides better pretraining weights for subsequent adversarial training. Then, based on the relativistic average GAN, the pretraining generator with our discriminator is trained, where the number of iterations is $2\times {10}^{5}$, and the learning rate cuts in half every $7\times {10}^{4}$ iterations. Finally, we obtained the following two versions of the model: 1) the standard mean square error-based super-resolution model (SWCG) and 2) the SWCGAN.

2) Evaluation Metrics

To comprehensively evaluate the performance of the proposed method, we use two evaluation metrics with different focuses. The first metric is the peak signal-to-noise ratio (PSNR), which is the most widely used evaluation metric for images. The PSNR is based on the error between the corresponding pixel points, and minimizing the $L_{1}$ loss is equivalent to maximizing the PSNR. The higher PNSR means that the closer the generated image is to the real image, and the better effect of the super-resolution reconstruction. The equation of PSNR is as follows: \begin{equation*} \text{PNSR}=10{{\log }_{10} ({{\text{Max}}^{2}_{I}}/{\text{MSE}})\ } \tag{7} \end{equation*} View Sourcewhere ${\text{Max}}^{2}_{I}$ is used to represent the possible maximum pixel value of the image, and MSE is the mean square error.

However, the PSNR does not consider the visual characteristics of the human eye, resulting in images with high PNSR often being evaluated as low quality [34], [35]. Therefore, the additional evaluation metric, learned perceptual image patch similarity (LPIPS) [36], is selected, which is more consistent with human perception than PNSR. The LPIPS evaluates the perceptual similarity of the image by a deep learning model, where an $L_{2}$ distance between the real image $x$ and the corresponding generated super-resolution image $x_{0}$ is calculated \begin{equation*} \text{LPIPS}\left(x,x_{0}\right)=\sum _{l}{\frac{1}{N_{l}}}\Vert {\left.{\omega }_{l}\odot ({\phi (x)}_{l}-{\phi \left(x_{0}\right)}_{l})\right.}\Vert ^{2}_{2} \tag{8} \end{equation*} View Sourcewhere ${\phi (\cdot)}_{l}$ is a feature space constructed from a well-trained $l$th layer CNN, and $N_{l}$ represents the number of elements in ${\phi (\cdot)}_{l}$. ${\omega }_{l}$ is a learned weight vector and $\odot$ is the channel-wise product operation. The lower LPIPS represents the better effect of super-resolution reconstruction.

3) Comparative Results

To further evaluate the performance of the proposed SWCG and SWCGAN, we compared our methods with some advanced super-resolution models, including RCAN [37], SRGAN [16], EDSR [38], LGCNet [39], CDGAN [17], EEGAN [18], and HSENet [40], on the UCMerced dataset.

Table II lists the evaluation results of different algorithms on the UCMerced dataset. We calculated the PNSR, reflecting the mean pixel error, and LPIPS, reflecting the perceptual similarity of super-resolution reconstruction methods on 21 categories of images separately due to the different learning difficulties of different images, and obtained the total average values. For the average PNSR and LPIPS, the proposed SWCGAN receives the best score on the LPIPS metric, and the proposed SWCG receives the second score on the PNSR metric. It is worthwhile to note that these algorithms that minimize pixel loss as a training goal, including RCAN, EDSR, LGCNet, HSENet, and SWCG, can achieve high PNSR, yet their performances on LPIPS are unsatisfactory. In contrast, GANs that incorporate adversarial loss tend to perform well on the LPIPS metric and obtain lower scores on the PNSR metric. For these GANs, SRGAN obtains the worst performance on the PNSR metric, EEGAN obtains the best performance on the PNSR metric, and the proposed SWCGAN has the best score on the LPIPS metric and the second score on the PNSR metric.

TABLE II Results of Comparison With Different Methods on the UCMerced Dataset

On the other hand, for different scenarios, these deep-learning-based super-resolution algorithms show a significant preference. For example, all algorithms can obtain higher PNSR (>30) on both Baseballdiamond and Beach scenes, and all algorithms obtain lower PNSR (<24) on Harbor scene. Moreover, compared with other algorithms, the proposed SWCGAN obtains the best LPIPS for Denseresidential and Freeway scenes, and the proposed SWCG obtains the highest PNSR for Overpass, Parkinglot, River, and Runway scenes. As shown in Fig. 7, for different scenarios, the values of the proposed SWCGAN are more concentrated around the mean value, whether based on the PNSR metric or the LPIPS metric, with fewer outliers, which means that the performance of the proposed SWCGAN is the most stable compared to other algorithms. As listed in Table II, the proposed SWCGAN has the smallest standard deviation, which also reflects that the SWCGAN has the best stability.

Fig. 7.

PNSR and LPIPS distributions of different models on all scenes of the UCMerced dataset. (a) PNSR distribution of different models. (b) LPIPS distribution of different models. (White dot represents the PNSR or LPIPS of the corresponding model on a scene).

Show All

Specifically, as illustrated in Fig. 8, the two examples, Baseballdiamond33 and Golfcourse35, are chosen to compare the details of the different algorithms. In Fig. 8(a), the proposed SWCGAN with the lowest PNSR and the best LPIPS shows the most details and the best perceptual effects. In Fig. 8(b), both SWCGAN and CDGAN with better LPIPS show better perceptual effects. Furthermore, except for SWCGAN and CDGAN, other algorithms exhibit the same image style for the super-resolution reconstruction task, i.e., the borders of objects are not sufficiently clear and the transition between pixels is overly smooth, causing the image to appear blurry.

Fig. 8.

Detailed comparison of the outputs with different methods. (a) Baseballdiamond. (b) Golfcourse.

Show All

C. Ablation Studies

According to the comparative results, the proposed SWCGAN obtained the best score on the LPIPS metric, and the proposed SWCG obtained the second score on the PNSR metric. In this section, to further verify the effectiveness of the STB in the proposed RDSTB, the ablation experiments are established by replacing the STB with a convolutional block. Moreover, to demonstrate the impact of the pretraining generator (i.e., SWCG), an experiment is added without pretraining. Finally, to verify the effectiveness of the simplified swin transformer as a discriminator, an experiment using the complete swin transformer as a discriminator is added.

1) ConvGAN: The STBs in the generator of the SWCGAN are replaced by convolution blocks, and the simplified swin transformer is used in the discriminator.

2) ConvG: The STBs in the generator of the SWCG are replaced by convolution blocks, and the discriminator is removed.

3) SWCGAN (N): The proposed SWCGAN without pretraining.

4) SWCGAN (all): the complete swin transformer is used as a discriminator in the SWCGAN.

As listed in Table III, compared with SWCGAN and SWCG, the corresponding models (ConvGAN and ConvG) perform worse on both PNSR and LPIPS metrics when STBs are replaced with convolutional blocks. Compared with the standard mean square error-based super-resolution model ConvG, due to the effect of adversarial loss, ConvGAN exhibits the same characteristics as other GANs, i.e., lower scores on the PNSR metric and better performance on the human perceptual metric LPIPS.

TABLE III Results of the Blation Experiments

On the other hand, SWCGAN (N) performs excellently on the LPIPS metric compared to SWCGAN, yet performs poorly ($< $ 26) on the PNSR metric, which is caused by the overpowering influence of the discriminator on the generator. Therefore, a pretraining generator is necessary to make the capabilities of SWCGAN more comprehensive.

Furthermore, SWCGAN (all) with higher complexity only achieves a slightly better performance than SWCGAN. When the input size is $64\times 64$, the floating point operations (FLOPs) of SWCGAN (all) (26.1 G) are 17 % larger than that of SWCGAN (22.3 G), and the FLOPs of the complete swin transformer (5.7 G) are 3 times that of the simplified swin transformer (1.9 G). These experimental results demonstrate the effectiveness of the simplified swin transformer as a discriminator.

D. Application of the Proposed Methods

To verify that the proposed algorithms can be applied to real satellite remote sensing images, we test the super-resolution effect of our proposed SWCGAN and SWCG on a real-world multispectral image from WorldView-4 (WorldView-4 is capable of acquiring satellite images with a panchromatic resolution of 0.3 m and a multispectral resolution of 1.24 m and is used to test the proposed method), where NIQE [41] is chosen as the evaluation metric. Due to the lack of real high-resolution images, we use the nonreference evaluation metric NIQE (a lower score means a better output for super-resolution reconstruction), which is different from PNSR and LPIPS.

As shown in Fig. 9, before super-resolution reconstruction, the NIQE of the original low-resolution image is 17.049. After super-resolution reconstruction, the quality of the image is significantly improved, and the proposed SWCGAN and SWCG obtained the second and first scores compared to other models. When the real satellite remote sensing image ($400\times 400$) is used as input, the proposed SWCGAN and SWCG provide more improvement for image quality compared to other models, which means that it is important to capture long-range dependencies between pixels for the real satellite remote sensing image.

Fig. 9.

Application of the proposed methods on a real-world multispectral image for the super-resolution reconstruction. (Red for the first, blue for the second).

Show All

SECTION IV.

Discussion

A. Advantages of the Proposed SWCGAN

The advantage of this method is that it overcomes the inherent drawbacks of the convolutional layer by introducing the swin transformer based on shifted window self-attention, and the proposed SWCGAN achieves the best score in the LPIPS metric and performs better than other GANs in all metrics. For different scenarios, the proposed method has the smallest standard deviation, which means that it has the best stability. Specifically, since the introduction of swin transformer overcomes the shortcomings of convolutional layers, it allows the generator to better learn the relationship between different regions of the image, which leads to the generated images with clear object boundaries, unlike other algorithms where the transition between different objects is too smooth, causing the image to appear blurry. Moreover, the swin transformer discriminator is essential for discriminating the generated image from the real image based on fine features, which rights the training objective in such a way that the style of the generated image converges to that of the real image instead of only reducing the error between pixels.

On the other hand, our proposed RDSTB that constitutes the deep feature extractor has an excellent performance to the extent that the proposed SWCG (i.e., the generator of SWCGAN which is trained using only pixel loss) achieves a level close to the state-of-the-art using only four layers of RDSTB. As listed in Table IV, in the standard mean square error-based super-resolution models, the proposed SWCG with significantly lower parameters and FLOPs than those of the RCAN obtains super-resolution performance close to that of the RCAN, and it outperforms the EDSR with exceptionally large parameters and FLOPs.

TABLE IV Parameters, FLOPs, and GPU Runtime of Standard Mean Square Error-Based Super-Resolution Models

B. Shortcomings of the Proposed SWCGAN

The shortcoming of the proposed method is that its training time is longer than that of general CNNs due to the introduction of the swin transformer. Several investigations [23] have shown that networks based on self-attention mechanisms require more data and training time than general CNNs for image processing tasks due to the lack of convolutional inductive bias, i.e., the capability to presuppose some necessary assumptions for the problem. In addition, the usage of swin transformer causes a significant increase in complexity. As listed in Table IV, the results show that the SWCG with middle complexity is still much higher than the lightweight network LGCNet even when the lightweight architecture is chosen.

Moreover, although the proposed SWCGAN performs the best in GANs, its performance is still unsatisfactory compared with the standard Mean Square Error-based super-resolution models according to the PNSR metric due to the effect of adversarial training on the loss. Therefore, for super-resolution reconstruction tasks requiring high PNSR, the abovementioned issue can be addressed by adjusting the hyperparameter $\lambda$ in the loss function of the generator.

C. Outlook and Future Work

In the future, we will continue to use the proposed RDSTB to build a large-scale model to further improve the super-resolution performance of SWCGAN. We believe that the proposed RDSTB can form a deeper and larger network to obtain better super-resolution performance due to the dense connectivity and residual structure. Furthermore, as a feature extraction block of images, the proposed RDSTB will be applied to remote sensing image processing tasks, including classification, recognition, and semantic segmentation. Finally, for the problem of sparse remote sensing image data, we will consider training deep-learning-based models using self-supervised [42] or unsupervised methods [43], [44] in the future.

On the other hand, the superior performance of pure swin-transformer-based models in the field of computational vision has been proven. Thus, we will develop a GAN-based super-resolution network composed of pure swin transformer blocks.

SECTION V.

Conclusion

In this article, we proposed a GAN by combining the advantages of the swin transformer and convolutional layers for super-resolution reconstruction, i.e., SWCGAN. The essential idea behind the proposed method is to generate high-resolution images by a generator network with a hybrid of convolutional and swin transformer layers and then use a pure swin transformer discriminator network for adversarial training. In this proposed SWCGAN,

we used a convolutional layer for shallow feature extraction;
we proposed the RDSTB to extract deep features of the image for upsampling to generate high-resolution images;
we used a simplified swin transformer as the discriminator for adversarial training.

To demonstrate the performance of the proposed methods, we designed an experiment on the UCMerced dataset. The results indicate that the proposed SWCGAN outperforms other state-of-the-art methods in most metrics and performs best on the LPIPS metric compared with other GANs. The ablation experiments suggest the effectiveness of the STB in the proposed RDSTB. Finally, the proposed SWCGAN is applied to a real satellite remote sensing image, and the quality of the remote sensing image is further improved compared to other models.

References is not available for this document.

MIT Libraries

MIT Libraries

SWCGAN: Generative Adversarial Network Combining Swin Transformer and CNN for Remote Sensing Image Super-Resolution

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction