Introduction
Currently, remote sensing plays an important role in various fields [1]–[4]. As a critical component of remote sensing, optical satellite remote sensing with high spatial resolution provides observation targets with clear spatial texture information, which can present the essential features of the landscape. However, an increase in the pixel density of the sensor will significantly increase the hardware cost of obtaining optical remote sensing images. To acquire high-resolution remote sensing images more conveniently and cost-effectively, super-resolution reconstruction techniques that recover high-resolution images according to low-resolution images have received much attention.
The essential idea behind image super-resolution is to learn the prior knowledge from the image data and then use it to recover the lost details of the low-resolution images. It is worthwhile to mention that some early methods, such as sparse reconstruction-based methods [5], degenerate models [6], and interpolation [7], have achieved success in high-resolution images by acquiring a priori knowledge under the limitation of a lack of learning ability. In recent years, deep learning methods have achieved great success in various fields, and naturally, super-resolution reconstruction methods with powerful learning capabilities based on deep learning surpass traditional methods in terms of performance. Convolutional neural networks (CNNs) have long been the “standard answer” to image processing tasks. A super-resolution CNN, the first CNN with a high learning ability for super-resolution reconstruction, was proposed by Dong et al. [8]. Then, to further boost the performance of super-resolution reconstruction using CNNs, various new techniques, including residual dense blocks [9], [10], residual learning [11], [12], and recursive blocks [13], are also introduced.
Moreover, with generative adversarial networks (GANs) being originally proposed and receiving great attention [14], [15], GANs using convolution modules have also been proven to have impressive performance on the super-resolution reconstruction task. For example, SRGAN [16] first introduced a GAN architecture for the super-resolution reconstruction task and proposed a perceptual similarity-based loss function. CDGAN [17] has improved the discriminator for GAN, where both generated image and its high-resolution ground truth are input to the discriminator for better discrimination. EEGAN [18] reduces the interference of noise in the super-resolution reconstruction of satellite images by purifying the noise-contaminated components with mask processing. Although CNNs with powerful learning capabilities offer a significant performance improvement over traditional methods, they cannot escape from the basic problems that originate from the basic convolutional layer, i.e., the convolutional layer based on the principle of local processing has difficulty, or is even ineffective, for the capture of long-range dependencies.
To solve the abovementioned problem, a self-attention mechanism derived from transformer [19]–[21] was used as an alternative to CNNs that capture global interactions between contexts and shows an excellent performance on several visual tasks. However, the networks designed based on the transformer block often have a number of parameters that exceed those of general convolutional networks. On the other hand, the transformer for image recovery typically segments the input image into patches [22]–[24], which can introduce boundary artifacts around each patch.
Recently, swin transformer [25], a self-attention network that overcomes the abovementioned shortcomings of the transformer, has shown great potential in the field of computational vision. It is capable of both processing large-size images without dividing the images into patches and learning long-range dependencies as a transformer due to the shifted window scheme. Furthermore, it has less computational cost than the transformer. Swin transformer has reached state-of-the-art in image classification and semantic segmentation tasks. However, its application and research in image super-resolution, especially in remote sensing images, is still relatively rare.
Remote sensing images contain more information than natural images, and the pixels of remote sensing images are correlated with each other. CNNs have difficulties in acquiring global information and long-range dependencies between pixels for remote sensing images. Moreover, since the size of real remote sensing images tends to be larger than that of natural images and the complexity of the self-attention network is high, the pure self-attention network is prone to memory bottlenecks if the remote sensing images are used directly as input.
In this article, to adapt the characteristics of remote sensing images, we first introduce the shifted window self-attention mechanism from the swin transformer into the super-resolution research of remote sensing images and propose a GAN that combines the advantages of the swin transformer and convolutional layers, namely, SWCGAN. Specifically, in the proposed SWCGAN,
we employ a convolutional layer for shallow feature extraction that can be adapted to flexible input sizes;
we propose the residual dense swin transformer block (RDSTB) by drawing on the characteristics of DenseNet [9] to build the depth feature extraction module of the generator, which is used to obtain the deep features for upsampling to generate high-resolution images;
we simplify the original swin transformer and use it as a discriminator.
To demonstrate the performance of the proposed SWCGAN, the UCMerced dataset is utilized for training and validation of the proposed method, and the performance of the proposed method outperforms other state-of-the-art methods in most metrics. Moreover, the proposed method is applied to a real-world remote sensing image to verify the effectiveness and applicability of the proposed method.
Our contributions are as follows.
We propose a GAN with a hybrid of convolutional and swin transformer layers for the super-resolution reconstruction of remote sensing images to consider the features of large size, large information, and strong correlation between pixels of remote sensing images for super-resolution reconstruction.
We further propose a depth feature extraction block, namely, RDSTB, which can extract deep image features efficiently by stacking multiple blocks. As a feature extraction block of images, the proposed RDSTB can also be used in other image processing tasks in the future.
We evaluate the proposed method using the UCMerced benchmark dataset and real-world remote sensing images from a high-resolution satellite.
The rest of this article is organized as follows. In Section II, the proposed method, including the network architecture, the loss functions, and the shifted window self-attention mechanism, are described in detail. Section III presents the experimental results and performance evaluations, and the proposed method is applied to a real-world remote sensing image. In Section IV, we present a discussion. Finally, Section V concludes this article.
Methods
In this section, we will first briefly introduce the essential idea behind the proposed method, i.e., the idea of the SWCGAN for super-resolution. Specifically, for the proposed method, we will describe the generator network, the discriminator network, and the loss function. Furthermore, the advanced shifted window self-attention mechanism is summarized.
A. Overview of SWCGAN
In this article, we proposed a GAN by combining the advantages of the swin transformer and convolutional layers for super-resolution. The workflow of the proposed SWCGAN for super-resolution is illustrated in Fig. 1. A typical GAN model consists of two parts, a generator G and a discriminator D. As shown in Fig. 1(a), for the generator G, the input is the low-resolution image, and then, the features of the input low-resolution image is extracted to obtain the feature maps using the feature extraction module. To obtain the generated high-resolution image, the extracted feature maps will be upsampled by the upsampler (in this article, 4x upsampling is used). For the discriminator D, the input is the generated high-resolution image, and the generated high-resolution image also undergoes a feature extraction step to obtain its deep features, where the sizes of feature maps gradually decrease, unlike in the generator. Finally, a linear layer is classified dichotomously by the output 0 or 1. Naturally, in an image super-resolution GAN, G is used to generate a fake high-resolution image by reducing the difference between the fake high-resolution image and the real high-resolution image, and D is used to distinguish the real high-resolution image from the generated image in training. G and D compete with each other in the training process in such a way that the data distribution of the generated images is gradually close to the real distribution [see Fig. 1(b)].
Illustration of GAN for the super-resolution reconstruction task. (a) Network framework. (b) Data distribution.
Previously, convolutional blocks were commonly used to form GANs for super-resolution reconstruction. However, the convolutional layers that compose CNNs limit the performance of CNNs in the super-resolution reconstruction task due to its inherent problems, especially the inability to simulate long-term dependencies. To address the abovementioned issue, we propose a GAN with a hybrid of convolutional and swin transformer layers named SWCGAN for the super-resolution reconstruction task, where we introduce swin transformer layers with convolutional layers to form RDSTB by dense connection and residual structure to extract image features.
B. Generator Network in SWCGAN
As shown in Figs. 1(a) and 2(a), the generator can be further divided into the following three modules:
Architecture of the proposed SWCGAN. (a) Network architecture of generator and discriminator. (b) RDSTB. (c) Swin transformer block.
shallow feature extraction module;
deep feature extraction module;
upsampling module.
In this shallow feature extraction module, a convolution layer
After extracting shallow features, the
As shown in Fig. 3, the nearest neighbor interpolation method [26], [27] combined with a convolutional layer is applied to upsample the feature map after extracting deep feature, and its size changes from
\begin{equation*}
I_{\text{HR}}=H_{\text{up}}(F_{0}+F_{D}) \tag{1}
\end{equation*}
C. Discriminator Network in SWCGAN
For the discriminator, the simplified swin transformer is used to conduct the dichotomous classification task. As shown in Figs. 2(a) and 4, in the original swin transformer, the feature extraction is a total of 4 stages, and the dimensions of the input data are transformed from
D. Use of Swin Transformer in SWCGAN
To address the challenges involved in the application of CNNs and transformers to the image field, the swin transformer proposed a shifted windows operation that included nonoverlapping local windows and overlapping cross-window connections to restrict the attentional computation to a single window, which can allow the model to enjoy the advantages of CNN convolutional operations on the one hand and to save computational effort on the other hand.
As shown in Fig. 4, the whole model adopts a hierarchical design, where the model contains a total of four stages, each of which reduces the resolution of the input feature map and expands the receptive field layer by layer like a CNN. First, compared with vision transformer [23], [29], patch partitioning becomes an optional operation due to the window self-attention mechanism, which greatly increases the flexibility of the model. Second, there is a patch merging operation before performing swin transformer block, which serves to perform downsampling. It is used to reduce the resolution, adjust the number of channels, and save computational effort. In patch merging, the features of each group of
To solve the problem of high computational complexity caused by the global-based calculation of attention in the traditional transformer, the swin transformer reduces the complexity of the algorithm by the window attention, limiting the computation of attention to each window. The equation for computing self-attention is as follows:
\begin{equation*}
A(Q,K,V)=\text{Softmax}\left(\frac{QK^{T}}{\sqrt{d}}+B\right)V \tag{2}
\end{equation*}
The abovementioned window self-attention is calculated for each window. To allow different windows to interact with one another, a unique method called shifted window is adopted in swin transformer. As shown in Fig. 5, there is an illustration of the shifted window approach. First, a regular window partitioning scheme is adopted in layer Layer (the feature map is divided evenly into
E. Loss Functions in SWCGAN
Our choice is the relativistic average GAN, which differs from the standard GAN that discriminates real images as 1 and fake images as 0. It estimates the probability that the real image is relatively more realistic than the fake image (see Fig. 6). The loss function of the discriminator is defined as follows:
\begin{equation*}
L_{D}=-{\mathrm{E}}_{x_{r}\sim p_{\text{data}}\left(x_{r}\right)}\left[{\log (1-D_{\text{RA}}(x_{r},z_{f})) }\right]\\
-{\mathrm{E}}_{z_{f}\sim p_{z}\left(z_{f}\right)}\left[{\log (1-D_{\text{RA}}(z_{f},x_{r})) }\right] \tag{3}
\end{equation*}
We set the training objective of the generator to minimize the joint loss, which consists of the content loss and the adversarial loss. The loss function of the generator is defined as follows:
\begin{align*}
L_{G}&=\ L_{\text{cont}}+{\lambda L}_{\text{adv}} \tag{4}
\\
L_{\text{cont}}&={\left.z_{f}\mathrm{-}x_{r}\right.}_{\mathrm{1}} \tag{5}
\\
L_{\text{adv}}&=-{\mathrm{E}}_{x_{r}\sim p_{\text{data}}\left(x_{r}\right)}\left[{\log (1-D_{\text{RA}}(x_{r},z_{f})) }\right]\\
&\quad-{\mathrm{E}}_{z_{f}\sim p_{z}\left(z_{f}\right)}\left[{\log (D_{\text{RA}}(z_{f},x_{r})) }\right] \tag{6}
\end{align*}
Results
In this section, we first list the environment of the experiment. In addition, to evaluate the proposed method, we present the results of the comparison between the proposed methods and other methods. Finally, the application results of the proposed methods are shown.
A. Experimental Environment
The details of the experimental environment are listed in Table I.
B. Evaluation of the Proposed Methods
1) Experimental Dataset and Model Training
We selected the commonly used remote sensing dataset “UCMerced” [30] as the experimental dataset for the super-resolution reconstruction task. There are many classes of images in the UCMerced dataset with 21 classes (including forest, buildings, beach, and so on), which contain most of the remote sensing scenes. Each class has 100 images, and all images are
For the optimizer, the Adam optimizer [33] with the initial learning rate
2) Evaluation Metrics
To comprehensively evaluate the performance of the proposed method, we use two evaluation metrics with different focuses. The first metric is the peak signal-to-noise ratio (PSNR), which is the most widely used evaluation metric for images. The PSNR is based on the error between the corresponding pixel points, and minimizing the
\begin{equation*}
\text{PNSR}=10{{\log }_{10} ({{\text{Max}}^{2}_{I}}/{\text{MSE}})\ } \tag{7}
\end{equation*}
However, the PSNR does not consider the visual characteristics of the human eye, resulting in images with high PNSR often being evaluated as low quality [34], [35]. Therefore, the additional evaluation metric, learned perceptual image patch similarity (LPIPS) [36], is selected, which is more consistent with human perception than PNSR. The LPIPS evaluates the perceptual similarity of the image by a deep learning model, where an
\begin{equation*}
\text{LPIPS}\left(x,x_{0}\right)=\sum _{l}{\frac{1}{N_{l}}}\Vert {\left.{\omega }_{l}\odot ({\phi (x)}_{l}-{\phi \left(x_{0}\right)}_{l})\right.}\Vert ^{2}_{2} \tag{8}
\end{equation*}
3) Comparative Results
To further evaluate the performance of the proposed SWCG and SWCGAN, we compared our methods with some advanced super-resolution models, including RCAN [37], SRGAN [16], EDSR [38], LGCNet [39], CDGAN [17], EEGAN [18], and HSENet [40], on the UCMerced dataset.
Table II lists the evaluation results of different algorithms on the UCMerced dataset. We calculated the PNSR, reflecting the mean pixel error, and LPIPS, reflecting the perceptual similarity of super-resolution reconstruction methods on 21 categories of images separately due to the different learning difficulties of different images, and obtained the total average values. For the average PNSR and LPIPS, the proposed SWCGAN receives the best score on the LPIPS metric, and the proposed SWCG receives the second score on the PNSR metric. It is worthwhile to note that these algorithms that minimize pixel loss as a training goal, including RCAN, EDSR, LGCNet, HSENet, and SWCG, can achieve high PNSR, yet their performances on LPIPS are unsatisfactory. In contrast, GANs that incorporate adversarial loss tend to perform well on the LPIPS metric and obtain lower scores on the PNSR metric. For these GANs, SRGAN obtains the worst performance on the PNSR metric, EEGAN obtains the best performance on the PNSR metric, and the proposed SWCGAN has the best score on the LPIPS metric and the second score on the PNSR metric.
On the other hand, for different scenarios, these deep-learning-based super-resolution algorithms show a significant preference. For example, all algorithms can obtain higher PNSR (>30) on both Baseballdiamond and Beach scenes, and all algorithms obtain lower PNSR (<24) on Harbor scene. Moreover, compared with other algorithms, the proposed SWCGAN obtains the best LPIPS for Denseresidential and Freeway scenes, and the proposed SWCG obtains the highest PNSR for Overpass, Parkinglot, River, and Runway scenes. As shown in Fig. 7, for different scenarios, the values of the proposed SWCGAN are more concentrated around the mean value, whether based on the PNSR metric or the LPIPS metric, with fewer outliers, which means that the performance of the proposed SWCGAN is the most stable compared to other algorithms. As listed in Table II, the proposed SWCGAN has the smallest standard deviation, which also reflects that the SWCGAN has the best stability.
PNSR and LPIPS distributions of different models on all scenes of the UCMerced dataset. (a) PNSR distribution of different models. (b) LPIPS distribution of different models. (White dot represents the PNSR or LPIPS of the corresponding model on a scene).
Specifically, as illustrated in Fig. 8, the two examples, Baseballdiamond33 and Golfcourse35, are chosen to compare the details of the different algorithms. In Fig. 8(a), the proposed SWCGAN with the lowest PNSR and the best LPIPS shows the most details and the best perceptual effects. In Fig. 8(b), both SWCGAN and CDGAN with better LPIPS show better perceptual effects. Furthermore, except for SWCGAN and CDGAN, other algorithms exhibit the same image style for the super-resolution reconstruction task, i.e., the borders of objects are not sufficiently clear and the transition between pixels is overly smooth, causing the image to appear blurry.
Detailed comparison of the outputs with different methods. (a) Baseballdiamond. (b) Golfcourse.
C. Ablation Studies
According to the comparative results, the proposed SWCGAN obtained the best score on the LPIPS metric, and the proposed SWCG obtained the second score on the PNSR metric. In this section, to further verify the effectiveness of the STB in the proposed RDSTB, the ablation experiments are established by replacing the STB with a convolutional block. Moreover, to demonstrate the impact of the pretraining generator (i.e., SWCG), an experiment is added without pretraining. Finally, to verify the effectiveness of the simplified swin transformer as a discriminator, an experiment using the complete swin transformer as a discriminator is added.
1) ConvGAN: The STBs in the generator of the SWCGAN are replaced by convolution blocks, and the simplified swin transformer is used in the discriminator.
2) ConvG: The STBs in the generator of the SWCG are replaced by convolution blocks, and the discriminator is removed.
3) SWCGAN (N): The proposed SWCGAN without pretraining.
4) SWCGAN (all): the complete swin transformer is used as a discriminator in the SWCGAN.
As listed in Table III, compared with SWCGAN and SWCG, the corresponding models (ConvGAN and ConvG) perform worse on both PNSR and LPIPS metrics when STBs are replaced with convolutional blocks. Compared with the standard mean square error-based super-resolution model ConvG, due to the effect of adversarial loss, ConvGAN exhibits the same characteristics as other GANs, i.e., lower scores on the PNSR metric and better performance on the human perceptual metric LPIPS.
On the other hand, SWCGAN (N) performs excellently on the LPIPS metric compared to SWCGAN, yet performs poorly (
Furthermore, SWCGAN (all) with higher complexity only achieves a slightly better performance than SWCGAN. When the input size is
D. Application of the Proposed Methods
To verify that the proposed algorithms can be applied to real satellite remote sensing images, we test the super-resolution effect of our proposed SWCGAN and SWCG on a real-world multispectral image from WorldView-4 (WorldView-4 is capable of acquiring satellite images with a panchromatic resolution of 0.3 m and a multispectral resolution of 1.24 m and is used to test the proposed method), where NIQE [41] is chosen as the evaluation metric. Due to the lack of real high-resolution images, we use the nonreference evaluation metric NIQE (a lower score means a better output for super-resolution reconstruction), which is different from PNSR and LPIPS.
As shown in Fig. 9, before super-resolution reconstruction, the NIQE of the original low-resolution image is 17.049. After super-resolution reconstruction, the quality of the image is significantly improved, and the proposed SWCGAN and SWCG obtained the second and first scores compared to other models. When the real satellite remote sensing image (
Application of the proposed methods on a real-world multispectral image for the super-resolution reconstruction. (Red for the first, blue for the second).
Discussion
A. Advantages of the Proposed SWCGAN
The advantage of this method is that it overcomes the inherent drawbacks of the convolutional layer by introducing the swin transformer based on shifted window self-attention, and the proposed SWCGAN achieves the best score in the LPIPS metric and performs better than other GANs in all metrics. For different scenarios, the proposed method has the smallest standard deviation, which means that it has the best stability. Specifically, since the introduction of swin transformer overcomes the shortcomings of convolutional layers, it allows the generator to better learn the relationship between different regions of the image, which leads to the generated images with clear object boundaries, unlike other algorithms where the transition between different objects is too smooth, causing the image to appear blurry. Moreover, the swin transformer discriminator is essential for discriminating the generated image from the real image based on fine features, which rights the training objective in such a way that the style of the generated image converges to that of the real image instead of only reducing the error between pixels.
On the other hand, our proposed RDSTB that constitutes the deep feature extractor has an excellent performance to the extent that the proposed SWCG (i.e., the generator of SWCGAN which is trained using only pixel loss) achieves a level close to the state-of-the-art using only four layers of RDSTB. As listed in Table IV, in the standard mean square error-based super-resolution models, the proposed SWCG with significantly lower parameters and FLOPs than those of the RCAN obtains super-resolution performance close to that of the RCAN, and it outperforms the EDSR with exceptionally large parameters and FLOPs.
B. Shortcomings of the Proposed SWCGAN
The shortcoming of the proposed method is that its training time is longer than that of general CNNs due to the introduction of the swin transformer. Several investigations [23] have shown that networks based on self-attention mechanisms require more data and training time than general CNNs for image processing tasks due to the lack of convolutional inductive bias, i.e., the capability to presuppose some necessary assumptions for the problem. In addition, the usage of swin transformer causes a significant increase in complexity. As listed in Table IV, the results show that the SWCG with middle complexity is still much higher than the lightweight network LGCNet even when the lightweight architecture is chosen.
Moreover, although the proposed SWCGAN performs the best in GANs, its performance is still unsatisfactory compared with the standard Mean Square Error-based super-resolution models according to the PNSR metric due to the effect of adversarial training on the loss. Therefore, for super-resolution reconstruction tasks requiring high PNSR, the abovementioned issue can be addressed by adjusting the hyperparameter
C. Outlook and Future Work
In the future, we will continue to use the proposed RDSTB to build a large-scale model to further improve the super-resolution performance of SWCGAN. We believe that the proposed RDSTB can form a deeper and larger network to obtain better super-resolution performance due to the dense connectivity and residual structure. Furthermore, as a feature extraction block of images, the proposed RDSTB will be applied to remote sensing image processing tasks, including classification, recognition, and semantic segmentation. Finally, for the problem of sparse remote sensing image data, we will consider training deep-learning-based models using self-supervised [42] or unsupervised methods [43], [44] in the future.
On the other hand, the superior performance of pure swin-transformer-based models in the field of computational vision has been proven. Thus, we will develop a GAN-based super-resolution network composed of pure swin transformer blocks.
Conclusion
In this article, we proposed a GAN by combining the advantages of the swin transformer and convolutional layers for super-resolution reconstruction, i.e., SWCGAN. The essential idea behind the proposed method is to generate high-resolution images by a generator network with a hybrid of convolutional and swin transformer layers and then use a pure swin transformer discriminator network for adversarial training. In this proposed SWCGAN,
we used a convolutional layer for shallow feature extraction;
we proposed the RDSTB to extract deep features of the image for upsampling to generate high-resolution images;
we used a simplified swin transformer as the discriminator for adversarial training.
To demonstrate the performance of the proposed methods, we designed an experiment on the UCMerced dataset. The results indicate that the proposed SWCGAN outperforms other state-of-the-art methods in most metrics and performs best on the LPIPS metric compared with other GANs. The ablation experiments suggest the effectiveness of the STB in the proposed RDSTB. Finally, the proposed SWCGAN is applied to a real satellite remote sensing image, and the quality of the remote sensing image is further improved compared to other models.