Loading [MathJax]/extensions/MathZoom.js
Enhancing the Spatial Resolution of Sentinel-2 Images Through Super-Resolution Using Transformer-Based Deep-Learning Models | IEEE Journals & Magazine | IEEE Xplore

Enhancing the Spatial Resolution of Sentinel-2 Images Through Super-Resolution Using Transformer-Based Deep-Learning Models


Abstract:

Satellite imagery plays a pivotal role in environmental monitoring, urban planning, and national security. However, spatial resolution limitations of current satellite se...Show More
Topic: Recent Advances in Remote Sensing Image Super-Resolution for Earth Observation

Abstract:

Satellite imagery plays a pivotal role in environmental monitoring, urban planning, and national security. However, spatial resolution limitations of current satellite sensors restrict the clarity and usability of captured images. This study introduces a novel transformer-based deep-learning model to enhance the spatial resolution of Sentinel-2 images. The proposed architecture leverages multihead attention and integrated spatial and channel attention mechanisms to effectively extract and reconstruct fine details from low-resolution inputs. The model's performance was evaluated on the Sentinel-2 dataset, along with benchmark datasets (AID and UC-Merced), and compared against state-of-the-art methods, including ResNet, Swin Transformer, and ViT. Experimental results demonstrate superior performance, achieving a peak signal-to-noise ratio (PSNR) of 33.52 dB, structural similarity index (SSIM) of 0.862, and signal-to-reconstruction error ratio (SRE) of 36.7 dB on Sentinel-2 RGB bands. The proposed method outperforms state-of-the-art approaches, including ResNet, Swin Transformer, and ViT, on benchmark datasets (Sentinel-2, AID, and UC-Merced). The results demonstrate that the proposed method achieves superior performance in terms of PSNR, SSIM, and SRE metrics, highlighting its effectiveness in revealing finer spatial details and improving image quality for practical remote sensing applications.
Topic: Recent Advances in Remote Sensing Image Super-Resolution for Earth Observation
Page(s): 4805 - 4820
Date of Publication: 06 January 2025

ISSN Information:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Satellite imagery has become an essential tool in various fields, including environmental monitoring, urban planning, agriculture, and national security. However, the quality of satellite images is often affected by the spatial resolution limitations of sensors, resulting in blurred representations in the recorded images [1].

In the context of satellite imagery, super-resolution refers to the process of increasing the spatial resolution of low-resolution images to reveal finer details and improve the overall image quality [2]. This process involves advanced computational techniques that use information from multiple low-resolution images or complex algorithms to produce high-resolution images. Enhancing the details and resolution of satellite images allows researchers and analysts to better observe, identify, and monitor elements, such as land cover changes, vegetation, building details, and roads, which aids in urban planning, improved disaster management, and infrastructure development projects [3]. The improved spatial resolution achieved through super-resolution methods can significantly enhance applications, such as disaster management, where clearer satellite imagery facilitates damage assessment, emergency response, and resource allocation during natural disasters, such as floods, earthquakes, and wildfires [4].

Nowadays, satellite images are widely used in applications, such as maritime monitoring, agricultural land monitoring, and urban planning. However, even with the highest available resolution, analysts often face challenges in interpretation, and higher resolution images with more details greatly aid in improving the image analysis process for these individuals [5], [6]. Training neural networks for higher accuracy requires precise input data. Traditional interpolation methods, such as linear and bicubic interpolation, often lead to pixelation and information loss [7]. In contrast, deep-learning-based super-resolution techniques, such as convolutional neural networks (CNNs), enhance image quality by learning spatial features and reducing information loss [8]. These enhanced images serve as high-quality inputs, improving the performance of tasks, such as object detection and image segmentation [9], [10].

The objective of this study is to achieve superior visual quality and enhanced resolution of satellite images by leveraging deep-learning models based on transformers. This research aims to explore how transformers, as opposed to CNN-based models, can reveal more detailed spatial information in satellite imagery [11], [12]. Current super-resolution techniques, particularly CNNs, perform well in extracting local features but are limited in capturing long-range dependencies. This shortcoming is especially pronounced in Sentinel-2 satellite imagery, where multispectral data and varying spatial resolutions complicate the recovery of fine spatial details. Traditional interpolation methods, such as linear or bicubic interpolation, exacerbate this issue by introducing pixelation and losing high-frequency information, which reduces their effectiveness for complex remote sensing tasks [13], [14].

The proposed innovations include employing a multihead cross-attention layer instead of the traditional self-attention to extract more comprehensive features and improve the model's learning capabilities. In addition, the model integrates a spatial attention block to capture spatial characteristics and a channel attention block in both encoder and decoder sections of the network architecture to enhance color channel feature comprehension. For data analysis, metrics, such as peak signal-to-noise ratio (PSNR) [12], structural similarity index (SSIM) [15], and signal-to-reconstruction error ratio (SRE) [16], will be used to evaluate the performance of the super-resolution model. The study relies on Sentinel-2 satellite images acquired from Google Earth Engine, with variables, including the number of training data samples and the number of spectral bands in the input images. This approach not only seeks to improve satellite image quality but also provides enhanced input data for neural networks, which can subsequently improve the accuracy in applications, such as object detection and image segmentation.

The novelty of this study lies in the introduction of a transformer-based super-resolution architecture that incorporates a multihead cross-attention mechanism, diverging from the traditional self-attention paradigm used in the existing transformer models. This innovation enables the network to extract more comprehensive spatial features by considering relationships across multiple input bands [17]. In addition, the integration of spatial attention and channel attention blocks further enhances the network's ability to focus on fine spatial details and spectral characteristics during both the encoding and decoding stages. Unlike ResNet and other CNN-based models that often struggle with capturing long-range dependencies, the proposed method achieves a significant improvement in spatial feature reconstruction. The model's effectiveness is validated on Sentinel-2, AID, and UC-Merced datasets, demonstrating superior performance in PSNR, SSIM, and SRE metrics [18]. The primary objective of this study is to propose and evaluate a transformer-based deep-learning model for enhancing the spatial resolution of Sentinel-2 images. We hypothesize that the integration of multihead cross attention, combined with spatial and channel attention mechanisms, will improve super-resolution performance compared with state-of-the-art CNN-based methods. This approach is expected to address challenges related to long-range dependency modeling and multispectral feature integration.

The rest of this article is organized as follows. Section II covers the theoretical foundations of the research, discussing super-resolution, image processing, artificial neural networks, machine learning, deep learning, and related studies on applying super-resolution with neural networks to satellite images. Section III examines materials, such as image data, methods for data preparation, the structure and blocks used in this study, and the evaluation methodology. Section IV focuses on the results, network training, and evaluation of the results. Finally, Section V concludes this article.

SECTION II.

Related Works—State-of-the-Art

In this section, the theoretical foundations of the research on super-resolution and various machine learning and deep-learning methods will first be examined. Following this, previous studies on the application of super-resolution to Sentinel-2 images using deep learning will be reviewed. The aim of this section is to introduce super-resolution and various machine learning and deep-learning methods used in the field of artificial intelligence.

Super-resolution is a powerful technique and tool that goes beyond simple interpolation methods. It enhances image resolution in various fields, including satellite imagery, medical imaging, surveillance systems, and digital photography [19]. Increasing the level of detail and resolution in these types of images provides a better understanding of details and facilitates decision-making processes through more accurate image analysis. High-resolution and better quality images are generated using various methods, such as interpolation and deep-learning-based approaches with low-resolution inputs. These methods use statistical features and contextual information to estimate missing details and improve image quality [3].

Super-resolution can be performed in two ways: single-image super-resolution and multi-image super-resolution. The first method uses a single image to produce a higher quality and higher resolution image, whereas the second method uses multiple images of the same area or object as inputs to achieve this. The latter method is commonly used to improve the quality of low-resolution videos or for video processing [20].

Various algorithms have been implemented to achieve this, each with its own advantages and disadvantages. Interpolation algorithms, such as linear or bicubic interpolation, are among the most commonly used interpolation methods but can sometimes lead to pixelation and loss of previous information [21], [22]. In recent years, the use of machine learning models, especially deep-learning models, such as CNNs and GAN networks, has significantly increased in this field. These networks reconstruct higher quality images by learning the image space and its features [7].

Yang et al. [23] introduced a model called PanNet, based on deep neural networks, for super-resolution using pan-sharpening methods. In this research, CNNs and the ResNet network were used to train the model. The network's input, in addition to passing through the ResNet block, combines the output of this block with the input passed through the upsample block to retain information and prevent data loss, ultimately producing the output image. This network was trained on all bands of WorldView-3 data. The evaluation metrics SAM and ERGAS in this study showed the lowest values and best performance compared with other networks, such as ImageNet and ResNet alone. However, the network's performance on the test dataset declined more than the ImageNet network.

Lanaras et al. [3] proposed a deep-learning model using the ResNet network for super-resolution on bands with 60- and 20-m resolution, based on the data from the 10-m bands. In this study, two CNNs were used to perform super-resolution on 60-m to 10-m and 20-m to 10-m resolution bands, respectively. Each of the networks includes a large number of residual blocks, each consisting of two CNNs and a ReLU activation function in the residual part. The output of this block includes the input and the scaled output from the second CNN. The trained network demonstrated acceptable and superior performance compared with other networks in terms of RMSE and SRE evaluation metrics.

Zhu et al. [24] proposed a deep-learning-based model called DCARN for super-resolution. The network architecture is inspired by the study conducted by Lanaras et al., and data from 60-m resolution bands were excluded. The model leverages a channel attention mechanism and uses ResNet as the backbone to extract image features. The channel attention mechanism used in this study includes a CNN layer along with the summation of previous input pixels, followed by passing through a sigmoid function as the activation function. Initially, the 10-m and 20-m resolution bands are placed in two separate matrices. Then, upsampling is performed on the 20-m resolution matrix, and after combining it with the 10-m resolution matrix data, it is fed as input to the network. BatchNorm blocks were not used in the residual blocks of this network due to the reduction in pixel reconstruction accuracy. This model was also trained with other models that had slight variations from the proposed model. However, the proposed model showed better performance in terms of PSNR and SSIM metrics compared with the other models, and a 20% reduction in PSNR compared with the work done by Lanaras et al.

In another study, Galar et al. [7] proposed a supervised model for super-resolution using a CNN. Sentinel-2 images in RGB format from a specific area were fed as input to the network, which began learning by comparing with images from the same area taken by the PlanetScope satellite. In the initial layers, the network was inspired by the ResNet architecture and trained using shortcut blocks. A pixel shuffle block was added after the activation layer, utilizing the style loss function to learn pixel distribution. In addition, for better network learning, images were filtered using a Blur filter after passing through the pixel shuffle block. The authors compared the network output with the result of bicubic interpolation, achieving up to a 1.2% improvement in the PSNR metric.

To improve image quality at resolutions of 5 and 2.5 m, pixel shuffle layers were used. Kawulok et al. [20] utilized multiple images of shared scenes from a specific area to gather information and perform super-resolution. To use multiple images, the authors needed images taken from a specific point under nearly identical conditions. In this study, each of the multiple CNN layers was used to learn the distribution of each image, with input images passed through a bicubic interpolation block before being fed into CNNs. Finally, another CNN was used to learn and adjust its weights based on the layers in the previous networks, thereby completing its learning process. The output images of this study were provided in both multiband and RGB formats.

In previous research, CNNs or pretrained networks, such as ResNet, have been used as the main block to extract image features for converting 60-m and 20-m band data to 10-m data. Methods based on pan-sharpening techniques also show acceptable performance, but with changes in the dataset, their performance decline is greater compared with deep-learning methods. Since the network's training data consist of Sentinel-2 images, the network was trained using the proposed architecture to convert 60-m and 20-m data to 10-m resolution. While previous studies have achieved notable success in super-resolution using CNN-based methods, such as ResNet, or attention-based networks, such as Swin Transformer, these approaches exhibit certain limitations. CNN-based models, such as ResNet, often struggle to capture long-range dependencies due to the local receptive fields of convolutional layers. Similarly, Swin Transformers, although effective, rely heavily on hierarchical attention mechanisms, which may lose fine spatial details in downsampling processes. In addition, many existing models lack sufficient integration of spatial and spectral information, limiting their performance on multiband satellite imagery. The proposed transformer-based model addresses these challenges by incorporating multihead cross attention to capture global dependencies effectively, while spatial and channel attention blocks ensure preservation of fine-grained spatial and spectral details during super-resolution.

SECTION III.

Material and Methods

A. Satellite Data Observation

Data collection has been conducted from the Google Earth Engine system at resolutions of 60, 20, and 10 m during the summer season from the Sentinel-2 satellite. Due to the large size of the image areas obtained from Google Earth Engine, 10-m resolution images were cropped to dimensions of 180 × 180 with a 120-pixel offset, 20-m resolution images to dimensions of 90 × 90 with a 60-pixel offset, and 60-m resolution images to dimensions of 30 × 30 with a 20-pixel offset from the previous window using a sliding window technique (Fig. 1).

For validation of the trained data, a Gaussian filter was applied before downscaling 10-m data to 20 m, 20-m data to 40 m, and 20-m data to 120 m. The downscaling of these data was performed using the block_reduce function from the scikit-image library [25]. In the downsampling process, a Gaussian filter was applied to smooth the 10-m and 20-m resolution images before reducing their sizes. The Gaussian filter parameters were set as follows: kernel size = 3 × 3 and standard deviation (&sgr;) = 1.0. These values were chosen to balance noise reduction and preservation of spatial details, ensuring a fair comparison between the original and downsampled images. The reason for downscaling from 20 to 40 m or 10 to 20 m is for comparison and error calculation of the model's performance. This is because 10-m data include four bands, while 20-m data include six bands, and even comparing each band with a corresponding band from an image and calculating the error in this way is not considered accurate. In addition to increasing the number of training data using the sliding window technique, another method called gamma correction has been used to further expand the dataset and improve network reliability. This method involves increasing or decreasing image brightness by adjusting γ in the following equation [26], [27]: \begin{equation*} I^{\prime} = {{\left( {\frac{I}{{255}}} \right)}^{\frac{I}{\gamma }}}.\ 255. \tag{1} \end{equation*}

View SourceRight-click on figure for MathML and additional features.

In this study, 10-m resolution images with an average brightness of each of the three channels ranging between 128 and 142 and between 200 and 210 were used to generate images with higher and lower brightness from the same scene using gamma values of 2 and 0.6, respectively. These images were separately used in the main model for converting 60-m data to 10 m and 20-m data to 10 m. For data preprocessing, gamma correction was applied to simulate varying image brightness levels, using γ = 2.0 for brighter images and γ = 0.6 for darker images. The γ values of 0.6 and 2.0 for brightness adjustment were selected to simulate varying lighting conditions in the dataset. A γ value of 0.6 enhances darker regions, making low-intensity details more prominent, while a γ value of 2.0 brightens the overall image to simulate overexposed conditions. This augmentation strategy improves the model's robustness to variations in illumination, ensuring better generalization to real-world satellite imagery with diverse lighting conditions. A sliding window technique was employed to crop images at different resolutions: 10-m resolution images were cropped to 180 × 180 pixels with a 120-pixel offset, 20-m resolution images to 90 × 90 pixels with a 60-pixel offset, and 60-m resolution images to 30 × 30 pixels with a 20-pixel offset. These steps ensured uniform data preparation and enhanced the diversity of the training dataset.

B. AID Dataset

This large dataset includes 200–400 images from 30 different scenes and has been collected and preprocessed for use in scene classification and semantic segmentation tasks. The preprocessing conducted to improve quality was compared with aerial images, and the authors noted the absence of any differences, even at the pixel level, between the preprocessed images and the aerial images taken [28]. All images in this dataset are 600 × 600 pixels in size. To use it for training the network implemented in this study, the images were downscaled to 300 × 300 pixels using the same method applied to the Sentinel-2 dataset. The performance variation across classes in the AID dataset reflects differences in scene complexity. Uniform classes, such as Desert, achieve higher PSNR scores, while complex scenes, such as sparse residential, exhibit lower scores due to the intricate spatial details that require accurate reconstruction (Fig. 2).

C. UC-Merced Dataset

The UC-Merced dataset includes 100 images from 21 different scenes, each with dimensions of 256 × 256 pixels and a resolution of 30 cm. It has been collected and preprocessed for scene classification and semantic segmentation tasks, comprising a total of 2100 images [29]. The resizing process applied to the AID dataset was also applied to this dataset for network training, producing images with dimensions of 128 × 128 pixels using the downscaling method applied to the Sentinel-2 dataset (Fig. 3).

D. Proposed Network Architecture for Super-Resolution

According to Fig. 4, the low-quality image is provided as input to the network, and feature extraction begins in three stages using the encoder block, whose architecture is shown in Figs. 5 and 6. This process starts by extracting spatial and channel-related features in smaller dimensions of the input image. After each encoder block, a multilayer perceptron (MLP) is used for dimensionality reduction, thereby narrowing the network's focus and enhancing attention to details. This idea is inspired by the Swin Transformer architecture [30].

Fig. 1. - Amir Kabir Dam at different resolutions.
Fig. 1.

Amir Kabir Dam at different resolutions.

Fig. 2. - Sample images from the AID dataset.
Fig. 2.

Sample images from the AID dataset.

Fig. 3. - Sample images from the UC-Merced dataset.
Fig. 3.

Sample images from the UC-Merced dataset.

Fig. 4. - Illustration of the network architecture, showing the multistage feature extraction process with encoder blocks, MLPs for dimensionality adjustments, and an MHA module. The decoder blocks reconstruct the high-resolution image using the processed features.
Fig. 4.

Illustration of the network architecture, showing the multistage feature extraction process with encoder blocks, MLPs for dimensionality adjustments, and an MHA module. The decoder blocks reconstruct the high-resolution image using the processed features.

Fig. 5. - Detailed architecture of the encoder block, featuring spatial and channel attention modules along with convolutional and normalization layers. This design enables the effective extraction of spatial and color distribution features, optimizing feature representation for downstream tasks.
Fig. 5.

Detailed architecture of the encoder block, featuring spatial and channel attention modules along with convolutional and normalization layers. This design enables the effective extraction of spatial and color distribution features, optimizing feature representation for downstream tasks.

Fig. 6. - Decoder block architecture focusing on channel attention. This design utilizes convolutional layers to generate QKV values and applies channel attention to emphasize channel-specific features, omitting spatial attention to prevent overfitting during the reconstruction of spatial elements, such as water surfaces and angles.
Fig. 6.

Decoder block architecture focusing on channel attention. This design utilizes convolutional layers to generate QKV values and applies channel attention to emphasize channel-specific features, omitting spatial attention to prevent overfitting during the reconstruction of spatial elements, such as water surfaces and angles.

The results from the second and third blocks are upsampled through an MLP before being fed into the multihead attention (MHA) block. They are then layered with results from other blocks, where feature extraction begins via the MHA block. Since each output from the encoder blocks is processed independently and simultaneously through the MHA block, the number of outputs from this block is equal to the number of inputs. The first and second inputs of the decoder block on the right are initially normalized through a layer normalization block and then downsampled using an MLP. Subsequently, the network starts reconstructing a higher quality image.

The network was trained separately twice using the Adam and SGD optimizers, along with the OneCycleLR scheduler. The OneCycleLR scheduler was adopted to optimize the learning rate dynamically during training. This scheduler gradually increases the learning rate to a maximum value midtraining and then decreases it to a minimal value toward the end, promoting faster convergence and improved generalization. Compared with alternative schedulers, such as StepLR and cosine annealing, the OneCycleLR approach demonstrated superior performance. Specifically, it reduced the convergence time to 175 epochs while achieving a PSNR improvement of 0.6 dB and an SSIM increase of 0.009 over StepLR. These results confirm the effectiveness of OneCycleLR in achieving optimal weight updates, avoiding local minima, and enhancing the model's performance.

The Adam optimizer had a more positive effect on network training and improved the accuracy of the trained model. The Adam algorithm is an optimization algorithm that can be used as an alternative to the classic stochastic gradient descent (SGD) method for updating network weights based on iterations over the training data. Adam can be considered a combination of RMSprop and SGD with momentum. The Adam optimizer has several notable advantages, including easy implementation, lower computational cost compared with gradient descent, independence from diagonal rescaling of the gradients, low memory usage, and intuitive interpretation of hyperparameters. The Adam optimization algorithm incorporates the benefits of both the AdaGrad and RMSProp algorithms. In Adam, the parameter learning rates are adjusted not only based on the first moment (mean) but also the second moment (variance) of the gradients. Overall, the Adam optimization algorithm performs well in practice and shows favorable results compared with other stochastic optimization methods [31].

On the other hand, the SGD algorithm has been introduced to address the computational complexity present in each iteration for large-scale data in gradient descent. The equation for this method is given as follows [32], [33]: \begin{equation*} \theta = \theta - \eta .\ {{\nabla }_\theta }J\left( {\theta ;x.y} \right). \tag{2} \end{equation*}

View SourceRight-click on figure for MathML and additional features.

Obtaining values and recursively adjusting them on various parameters to minimize the loss function is known as backpropagation. In this method, a single sample is used to update the gradient of θ rather than directly calculating the gradient value. This approach provides an unbiased estimate of the true gradient and removes a certain amount of redundancy.

The encoder layer attempts to extract features based on the rotation angle of objects, their position in the image, and their scaling ratio using the spatial attention block, and to understand the color distribution of each channel using the channel attention block. The encoder–decoder architecture was specifically designed to address challenges unique to Sentinel-2 images, such as varying spatial resolutions and subtle spectral variations across bands. The encoder employs both spatial attention and channel attention mechanisms to extract critical spatial structures (e.g., edges and patterns) and spectral features across multiple bands, ensuring efficient feature representation. By capturing both local and global dependencies, the encoder compensates for the resolution disparities inherent to Sentinel-2 imagery. The decoder, on the other hand, focuses on reconstructing high-resolution images by enhancing channel-specific details through channel attention, which emphasizes the spectral importance of each band while avoiding redundancy. This architecture ensures that fine spatial features and spectral relationships are preserved, improving the super-resolution performance of Sentinel-2 images.

A CNN was used to generate the QKV values, and the input was combined with the outputs from the spatial attention and channel attention blocks. Since the data in question are not normalized and may not have values between 0 and 1, a layer normalization block was used again, and the output of this block was added to the previous output. The function of this section is similar to that of the encoder block, but it only uses the channel attention block to emphasize features based on each channel. The spatial attention block is not used here to prevent model overfitting during the reconstruction of elements, such as water surfaces and their angles. An ablation study comparing the inclusion and exclusion of spatial attention in the decoder revealed that the PSNR and SSIM remained nearly unchanged, with a slight computational overhead when spatial attention was added. This indicates that the channel attention block alone is sufficient for reconstructing channel-specific details in the decoding process.

The used for network training is the L1 function, which can be calculated according to (3). The reason for using this loss function is to observe the error in pixel reconstruction, allowing the network to gain a better understanding of the brightness of image pixels [34], [35] \begin{equation*}{\rm{Loss }} = \frac{1}{N}\ \mathop \sum \limits_{i = 1}^N \|I_{\text{HR}}^{\left( i \right)\ } - \ I_{\text{LR}}^{\left( i \right)}{{\|}_1}. \tag{3} \end{equation*}

View SourceRight-click on figure for MathML and additional features.

E. Proposed Evaluation Method

In this study, to evaluate the performance of the network in super-resolution, the metrics of PSNR, SSIM, and SRE have been used. Each of these metrics will be discussed in the following equations [36], [37], [38]: \begin{align*}{\rm{MSE }} &= \frac{1}{{mn}}\ \mathop \sum \limits_{i = 1}^{m - 1} \mathop \sum \limits_{j = 1}^{n - 1} \big( {{{X}_{ij}} - {{{\mathop X\limits^\prime }}_{ij}}} \big) \tag{4}\\{\rm{PSNR }} &= 20*{{\log }_{10}}\left( {\frac{{255}}{{\sqrt[2]{{\text{MSE}}}}}} \right). \tag{5} \end{align*}

View SourceRight-click on figure for MathML and additional features.

The SSIM is another method for measuring image similarity, calculated using (6). This metric performs better in evaluating the structural content of images compared with methods, such as MSE and PSNR, which focus on pixel-by-pixel comparison [36], [37] \begin{align*} \text{SSIM}\ \left( {{{I}_1}.\ {{I}_2}} \right) =& \left( {\frac{{2{{\mu }_1}{{\mu }_2} + \ {{C}_1}}}{{\mu _1^2 + \mu _2^2\ + \ {{C}_1}}}} \right)\ \cdot \left( {\frac{{2{{\sigma }_1}{{\sigma }_2} + \ {{C}_2}}}{{\sigma _1^2 + \sigma _2^2\ + \ {{C}_2}}}} \right) \\ &\cdot \left( {\frac{{{{\sigma }_{12}} + \ {{C}_3}}}{{{{\sigma }_2}{{\sigma }_2}\ + \ {{C}_3}}}} \right). \tag{6} \end{align*}

View SourceRight-click on figure for MathML and additional features.

This metric calculates the similarity between two images using the mean, standard deviation, and cross correlation between the two images. The SRE attempts to calculate the relative error ratio to signal strength and is used to measure the network's error in capturing brightness during image reconstruction. The unit of this metric, such as PSNR, is in decibels and is calculated using the following equation [39]: \begin{equation*} \text{SRE} = 10*{{\log }_{10}}\left( {\frac{{n \cdot \mu _x^2}}{{{{{\left| {\hat{x} - x} \right|}}^2}}}} \right). \tag{7} \end{equation*}

View SourceRight-click on figure for MathML and additional features.

SECTION IV.

Results and Discussion

This section discusses the results obtained from training the proposed network and evaluates the performance of the trained model on the collected Sentinel-2 dataset and the AID and UC-Merced datasets. The training, validation, and test data were divided into 75%, 10%, and 15% for each dataset, respectively. This split was chosen based on the standard practices in deep learning to balance model optimization and unbiased performance evaluation. To minimize potential biases, we ensured that the data across all splits were randomly sampled while maintaining a representative distribution of spatial features and spectral diversity. This approach reduces the risk of overfitting and ensures that the validation and test sets reflect the real-world performance of the model. Any potential bias from spatial redundancy in remote sensing images is mitigated by strict nonoverlapping cropping of image patches across splits. The proposed model was trained separately for 200 epochs on the Sentinel-2 dataset images and for 100 epochs on the AID and UC-Merced datasets. The training was conducted on an Ubuntu operating system with an AMD Ryzen 7 6800H 3.20 GHz processor, 16 GB RAM, and an Nvidia RTX 3070 GPU. The superior performance of the proposed model can be attributed to the MHA mechanism, which effectively captures both local and global dependencies across spatial and spectral dimensions. In datasets, such as Sentinel-2, where bands, such as B8a, exhibit low contrast and subtle spatial features, MHA allows the model to focus on critical regions and integrate information across multiple bands, leading to improved reconstruction of fine details. On high-resolution datasets, such as AID and UC-Merced, this mechanism enables the model to maintain sharp edges and textures by attending to long-range spatial dependencies. Compared with the conventional CNN-based models, which are limited by local receptive fields, the MHA ensures that spatial features across large areas are better represented, thereby enhancing the PSNR and SSIM metrics across all datasets. In Section III, it was mentioned that, for fair validation and calculation of PSNR, SSIM, and SRE metrics, data were converted from 20 to 40 m for comparing the performance of 20-m to 10-m conversion on bands converted from 20 to 10 m. In addition, the data were converted from 60 to 360 m for comparing the performance of 60-m to 10-m conversion on bands converted from 60 to 10 m. Table I lists the number of images used for training, validation, and testing of the trained model. An ablation analysis was performed to examine the effect of different batch sizes (32, 48, and 64) on model performance. The batch size of 48 was found to offer the best tradeoff between model accuracy and computational efficiency. Reducing the batch size to 32 resulted in slightly better convergence but increased training time, while increasing the batch size to 64 led to faster training but caused a minor degradation in performance due to less frequent weight updates. Based on these results, we selected 48 as the optimal batch size, balancing performance and training efficiency. The choice of 3 encoder blocks was made to achieve a balance between performance and computational efficiency. Increasing the number of encoder blocks beyond 3 led to marginal improvements in performance while significantly increasing computational costs. Conversely, reducing the number of blocks to 2 resulted in a noticeable decline in the model's ability to extract and reconstruct spatial details effectively. Therefore, after careful experimentation, three encoder blocks were selected as the optimal configuration to ensure robust feature extraction without excessive computational overhead.

TABLE I Data Frequency of Training, Validation, and Test Stages in the Sentinel-2 Dataset
Table I- Data Frequency of Training, Validation, and Test Stages in the Sentinel-2 Dataset

During network training, the loss function value is calculated for each batch of 48 samples, and at the end of each training phase, these values are summed. Fig. 7 shows the training and evaluation loss values of the model. Based on the decreasing trend of the network's loss function, the best weights for using the network were achieved at the 161st training epoch.

Fig. 7. - Results of the loss function during network training.
Fig. 7.

Results of the loss function during network training.

The PSNR, SSIM, and SRE metrics were also calculated during and after training, and their average values for each band can be seen in Tables II and Table III, respectively. The calculated metric values during training are also visible in Figs. 8–​10.

TABLE II Calculated Metrics for Transforming 40-m Data to 20-m Resolution for Each Band
Table II- Calculated Metrics for Transforming 40-m Data to 20-m Resolution for Each Band
TABLE III Calculated Metrics for Transforming 40-m Data to 20-m Resolution for Each Band
Table III- Calculated Metrics for Transforming 40-m Data to 20-m Resolution for Each Band
Fig. 8. - PSNR metric results during network training.
Fig. 8.

PSNR metric results during network training.

Fig. 9. - SSIM metric results during network training.
Fig. 9.

SSIM metric results during network training.

Fig. 10. - SRE metric results during network training.
Fig. 10.

SRE metric results during network training.

The reason for the jumps in Figs. 8–​10 is the use of the OneCycleLR scheduler to prevent gradient vanishing by adjusting the learning rate during network training.

In addition to the proposed network, the Swin and ViT transformer models, the ResNet network, and the bicubic algorithm were also trained to compare the performance of the networks for each band in transforming 20-m images to 10-m and 60-m images to 10-m resolutions. The super-resolved versions of these images for the Malard region are shown in Figs. 11 and 12, respectively.

Fig. 11. - Comparison of the proposed network's performance with other networks for transforming 20-m images to 10-m resolution for each band.
Fig. 11.

Comparison of the proposed network's performance with other networks for transforming 20-m images to 10-m resolution for each band.

Fig. 12. - Comparison of the proposed network's performance with other methods for transforming 60-m images to 10-m resolution for each band.
Fig. 12.

Comparison of the proposed network's performance with other methods for transforming 60-m images to 10-m resolution for each band.

Fig. 13. - Comparison of the proposed network's performance with other methods for transforming 20-m images to 10-m resolution for the B5 band.
Fig. 13.

Comparison of the proposed network's performance with other methods for transforming 20-m images to 10-m resolution for the B5 band.

Fig. 14. - Comparison of the proposed network's performance with other methods for transforming 20-m images to 10-m resolution for the B5 band.
Fig. 14.

Comparison of the proposed network's performance with other methods for transforming 20-m images to 10-m resolution for the B5 band.

According to Fig. 12 and the comparison with the bicubic method, it can be understood that this algorithm performs super-resolution based on local features. This is because the other networks, which convert images to 10-m resolution, also take into account the mean and standard deviation of the distribution of these pixels for 10-m resolution in their learning process. Although this is also true for 20-m to 10-m images according to Fig. 11, the color changes and distribution per band are greater for 60-m to 10-m images.

Since comparing the super-resolved 20-m to 10-m bands with the 10-m bands is not a fair and accurate comparison, we use the transformation of 40-m images to 20-m resolution to evaluate the network's performance.

The network has shown less accuracy in learning the intensity values for the B8a band compared with other bands, which is why the SRE and PSNR metrics have lower values for this band. It can be said that, based on all three calculated metrics, the network has achieved better learning for the B12 band at 20-m resolution compared with the other bands.

According to Fig. 13, the network demonstrates good learning capability for performing super-resolution compared with other networks for the B5 band. The network's performance in RGB-banded images will also be compared in Section V. According to the table, the network performs better in learning the features and details of the B1 band compared with the B9 band when converting from 60-m to 10-m resolution, providing more details of this band in the reconstructed image than the B9 band. The lower performance observed for certain bands, such as B8a, can be attributed to their spectral characteristics and the nature of the captured information. Band B8a, a near-infrared band, often exhibits lower spatial detail and contrast due to its sensitivity to vegetation and subtle surface variations, which are harder to reconstruct compared with high-contrast bands, such as RGB. In addition, the limited spatial resolution of the input data for this band increases the difficulty for the model to learn and preserve finer details during the super-resolution process. To address this limitation, future work could involve incorporating multisensor data fusion or spectral attention mechanisms to improve learning performance for low-contrast bands (Fig. 14).

The SSIM score of 0.7841 indicates strong preservation of structural details, which is critical for tasks, such as land cover classification, urban planning, and disaster management. Minor discrepancies are mainly observed in low-contrast regions, which can be addressed in future work. The figure below shows the super-resolved B5 and B9 bands for transforming 20-m images to 10-m and 60-m images to 10-m resolution, respectively.

Finally, the average PSNR, SSIM, and SRE metrics for each network for the 40-m to 20-m and 360-m to 60-m cases can be seen in Tables IV and V.

TABLE IV Calculated Metrics for Transforming 40-m Data to 20-m Resolution for Different Models
Table IV- Calculated Metrics for Transforming 40-m Data to 20-m Resolution for Different Models
TABLE V Calculated Metrics for Transforming 360-m Data to 60-m Resolution for Different Models
Table V- Calculated Metrics for Transforming 360-m Data to 60-m Resolution for Different Models

According to Tables IV and V, the overall performance of the proposed network for the calculated metrics is better than other networks, such as the Swin and ViT networks. Only in the SRE metric in Table IV does the Swin network perform slightly better than the proposed network, with a minimal difference.

The model integrates MHA with spatial and channel attention mechanisms to optimize spatial and spectral feature extraction. Evaluation on the Sentinel-2, AID, and UC-Merced datasets demonstrates superior performance, achieving PSNR of 33.52 dB and SRE of 36.7 dB for Sentinel-2 images. Compared with networks, such as ResNet, Swin Transformer, and ViT, the proposed approach consistently outperforms in reconstructing fine spatial details, making it a promising tool for remote sensing super-resolution tasks. Section III mentioned that, for fair validation, 10-m data are converted to 20 m so that the network can start learning image details from the converted 20-m data, focusing on the relevant RGB bands. The number of images used also matches Table I, using the same images. During network training, for each batch of 64, the cost function value is calculated, and at the end of each training epoch, they are summed together, which can be seen in Fig. 15, showing the training and evaluation cost function values of the model. According to the decreasing trend of the network's cost function, the optimal weight for using the network occurred at the 165th training epoch. In addition, this figure indicates that no overfitting occurred during the model training.

Fig. 15. - Results of the cost function during network training.
Fig. 15.

Results of the cost function during network training.

The model was initially trained for 300 epochs, but after surpassing 230 epochs, no change occurred in the training and evaluation cost values. During training, the PSNR, SSIM, and SRE metrics were also calculated, with their values, as shown in Figs. 16, 18, and 20, respectively. Each calculated metric represents the average value per channel for the input image data.

Fig. 16. - PSNR metric results during network training.
Fig. 16.

PSNR metric results during network training.

As mentioned in Section III, the PSNR metric is used to measure the ratio of the maximum power of the input data to its noise. Fig. 16 shows the calculated values for the PSNR metric at each stage during training, and according to Fig. 17, the performance of the proposed model for the PSNR metric is better than other models.

Fig. 17. - Comparison of the PSNR metric of the proposed network with other models.
Fig. 17.

Comparison of the PSNR metric of the proposed network with other models.

Fig. 18 shows the calculated values for the SSIM metric of the proposed network at each training stage, and Fig. 19 shows the calculated values of this metric for evaluating different trained models.

Fig. 18. - SSIM metric results during network training.
Fig. 18.

SSIM metric results during network training.

Fig. 19. - Comparison of the SSIM metric of the proposed network with other models.
Fig. 19.

Comparison of the SSIM metric of the proposed network with other models.

According to Fig. 19, the calculated value of this metric for the proposed network is negligibly lower than the results obtained from the Swin transformers network.

The SRE metric was also calculated for the proposed network at each stage during training, and the calculated values are shown in Fig. 20. According to Fig. 21, the performance of the proposed model for the SRE metric is better than that of the other models.

Fig. 20. - SRE metric results during network training.
Fig. 20.

SRE metric results during network training.

Fig. 21. - Comparison of the SRE metric of the proposed network with other models.
Fig. 21.

Comparison of the SRE metric of the proposed network with other models.

The reason for the jumps in the graphs of Figs. 16, 18, and 20 is the use of the OneCycleLR scheduler to adjust the learning rate during network training to prevent gradient vanishing.

To evaluate and compare the spatial resolution capability of the proposed network relative to the ResNet architecture on the collected dataset, the PSNR and SSIM evaluation metrics were recalculated on 64 × 64 squares. The average difference between the super-resolved images generated by the proposed network and the ResNet network for the PSNR and SSIM metrics is 4.37 and 0.163, respectively.

Figs. 20 and 21 compare the super-resolved version of low-quality input data generated by the proposed network with other networks. According to Fig. 22, the proposed model and the Swin transformers network pay more attention to image texture than other models, which is also validated on the AID and UC-Merced datasets. Since it is not fair to evaluate super-resolution from 10-m to 5-m data due to the lack of 5-m input data, the calculated metrics are based on the conversion from 20-m to 10-m data. To further evaluate the network's attention to image details, the proposed network was trained on two datasets, AID and UC-Merced, which contain high-resolution images, and their performance will be discussed in Section V. To further evaluate the performance of the proposed network, it was separately trained for 100 epochs on the AID and UC-Merced datasets. Since these datasets are primarily used for tasks, such as scene classification, the images in these datasets are already categorized. The training results of the network for each class are presented in Tables VI and Table VII, respectively.

Fig. 22. - Comparison of the performance of the proposed network with other networks and methods.
Fig. 22.

Comparison of the performance of the proposed network with other networks and methods.

TABLE VI Calculated Metrics for Each Class of the AID Dataset
Table VI- Calculated Metrics for Each Class of the AID Dataset
TABLE VII Calculated Metrics for Each Class of the UC-Merced Dataset
Table VII- Calculated Metrics for Each Class of the UC-Merced Dataset

Since the AID and UC-Merced datasets do not contain low-quality data, the method used for downscaling the Sentinel-2 dataset was applied to create low-resolution data for testing and training the model. A comparison of the calculated values from Tables VI and VII shows that the trained model performed better on the AID dataset than on the UC-Merced dataset. This is attributed to the AID dataset containing 200 and 400 images per class, compared with the UC-Merced dataset, which contains only 100 images per class.

Table VIII lists the calculated metrics for the trained model on the AID and UC-Merced datasets. The performance of the proposed model is then compared with other models for these two datasets. The calculated values in the table below are actually the average values computed for each class. The superior performance of the proposed model, as demonstrated through higher PSNR, SSIM, and SRE values, holds significant implications for practical applications. In disaster management, the enhanced spatial resolution of Sentinel-2 images enables clearer identification of affected areas, such as detecting damaged infrastructure, mapping flood extents, or assessing wildfire impact, thereby improving the speed and accuracy of emergency response. For urban planning, the model's ability to reconstruct fine spatial details allows for better delineation of buildings, roads, and land-use patterns, facilitating more precise infrastructure monitoring and resource allocation. These results highlight the potential of the proposed model to bridge the gap between satellite imagery's resolution limitations and the needs of critical applications requiring detailed spatial information.

TABLE VIII Calculated Metrics for Each Dataset
Table VIII- Calculated Metrics for Each Dataset

While the proposed transformer-based model achieves superior performance in terms of PSNR, SSIM, and SRE, it introduces certain tradeoffs. The computational cost of transformers, especially due to the self-attention mechanism, is higher compared with traditional CNN-based approaches. This can lead to increased training time and memory usage, particularly when processing larger datasets or high-resolution images. In addition, scalability to very large datasets remains a challenge, as transformers require substantial computational resources to model long-range dependencies. To address these limitations, future work will explore lightweight transformer variants and optimization techniques, such as mixed precision training or model pruning, to improve computational efficiency without compromising performance. The images from the AID and UC-Merced datasets were also trained on ResNet, Swin Transformers, ViT, and bicubic models in addition to the proposed network. Generally, the Swin model and the proposed network pay more attention to image texture during super-resolution compared with the other two models. To evaluate the computational efficiency of the proposed model, we compared its floating-point operations per second (FLOPs) and training time with Swin Transformer and ViT architectures under identical conditions. The proposed model achieves significantly reduced FLOPs (2.5 G versus 3.1 G for Swin and 3.6 G for ViT) while maintaining competitive performance. Furthermore, the training time for 100 epochs was 5.5 h for the proposed model, compared with 6.8 h for Swin Transformer and 7.4 h for ViT. These results demonstrate that our architecture offers a more computationally efficient solution without compromising accuracy, making it well suited for large-scale remote sensing applications. To compare the spatial resolution capability of the networks trained on the AID dataset, a small region of the image was selected, and the PSNR and SSIM metrics were calculated for them. Fig. 23 shows the original, nonmagnified version of the input image, and Fig. 24 displays it.

Fig. 23. - Super-resolution results of trained models on the AID dataset.
Fig. 23.

Super-resolution results of trained models on the AID dataset.

Fig. 24. - Comparison of the spatial resolution capability of trained models.
Fig. 24.

Comparison of the spatial resolution capability of trained models.

The line distribution on the road surface is blurrier in the bicubic method's output compared with the other methods. The ViT method pays more attention to image texture than the ResNet method; however, its attention and precision in implementation do not reach the outputs of the Swin network and the proposed model. In addition, the diagonal line on the road surface is only visible in the outputs of the proposed network and the Swin network. The SSIM values for the outputs of the proposed network and the Swin network are 0.9456 and 0.9327, respectively.

Finally, the models trained on the UC-Merced dataset were also evaluated for spatial resolution capability. According to Fig. 25, which is a magnified area from Fig. 26, the output from the ResNet network does not replicate the tennis court as accurately as the proposed network's output. The ResNet network also fails to capture the positioning angles of the cream-colored blocks, although it generally performs better than the bicubic method's output.

Fig. 25. - Comparison of the spatial resolution capability of trained models.
Fig. 25.

Comparison of the spatial resolution capability of trained models.

Fig. 26. - Super-resolution results of trained models on the UC-Merced dataset.
Fig. 26.

Super-resolution results of trained models on the UC-Merced dataset.

Since the network begins extracting and learning spatial features, such as edges, rotation angles, and thickness, based on the dimensions of the target object using the spatial attention block, the following sections will examine the performance of this block in aiding the network's learning process.

To better assess the network's attention to radiometric features, images from the AID dataset were used due to their higher quality compared with the collected dataset and the UC-Merced dataset. The network's performance in applying learned geometric features to smaller objects in the image is more challenging than for larger objects. The second row of columns in Fig. 27 shows a cropped area from the top of the image. This part of the super-resolved image closely resembles the original. Still, certain elements, such as noise in the image, are considered a part of the block by the network, resulting in a thicker representation in the super-resolved image compared with the original.

Fig. 27. - Network attention to geometric features.
Fig. 27.

Network attention to geometric features.

Fig. 28. - Network focus on geometric characteristics.
Fig. 28.

Network focus on geometric characteristics.

In Fig. 27, by comparing the cropped super-resolved image with the original in the second row, it becomes apparent that certain details, such as the empty space between two blocks, are not accurately captured by the network. These empty spaces are mistakenly filled based on surrounding values. To improve the network's accuracy in capturing these finer details, additional encoder layers could be used, although this would require more computational resources. To evaluate the network's performance in focusing on and being biased toward color features using the channel attention block, which is responsible for extracting features and the color distribution of each band, the network's attention to radiometric features in images is examined. Each pixel value in the images, initially in 8-bit format (allowing a range from 0 to 255), was reduced to 6 and 4 bits using the pillow package and then provided to the network for super-resolution. To better assess the network's attention to radiometric features, images from the AID dataset were used due to their higher quality compared with the collected dataset and the UC-Merced dataset.

According to the Fig. 28, it is evident that the network has not become biased in learning the distribution of colors and image details, and the 4-bit and 6-bit images have been super-resolved based on the provided color distribution. This network performance is observable in Fig. 29, which has been cropped for better analysis. The observed variability in performance between the AID and UC-Merced datasets can be attributed to inherent differences in their image characteristics. The AID dataset comprises larger, high-resolution images with diverse and complex scenes, such as residential areas and industrial regions, which pose greater challenges for reconstructing fine spatial details. This complexity leads to a wider performance variation across classes. In contrast, the UC-Merced dataset contains smaller images with more homogeneous content, such as agricultural fields and structured urban layouts, which are easier to reconstruct with higher consistency. These differences explain why the proposed model achieves higher PSNR and SSIM scores on UC-Merced compared with AID. Future work could explore dataset-specific optimizations to further improve performance on more complex scenes. The proposed method relies on large-scale, high-quality datasets for effective training, which may limit adoption in resource-constrained settings. Preprocessing and labeling Sentinel-2 data remain resource intensive.

Fig. 29. - Comparison of network attention with radiometric features.
Fig. 29.

Comparison of network attention with radiometric features.

SECTION V.

Conclusion

Satellite images have become essential tools in various fields, including environmental monitoring, urban planning, agriculture, and national security. However, the quality of satellite images is often affected by the spatial resolution limitations of sensors, resulting in blurred representations in the captured images. Super-resolution enables the enhancement of spatial resolution in low-resolution images to reveal finer details and improve the overall image quality. The proposed model's ability to enhance Sentinel-2 satellite images has wide-reaching implications for real-world applications. For disaster management, super-resolved images can provide clearer insights into affected areas, enabling faster and more precise assessments of damage and aiding emergency response teams. In agriculture, improved resolution supports better crop monitoring and early detection of pest infestations or drought conditions. In addition, for urban planning, the model enables more accurate identification of infrastructure, roads, and land use, thereby facilitating smarter development and resource allocation. By improving the quality and usability of satellite imagery, this research paves the way for advancements in decision-making processes across various sectors reliant on high-resolution spatial data.

Machine learning and deep learning are branches of artificial intelligence that teach computers to make decisions like humans. Machine learning focuses on feature engineering and model training, whereas deep learning eliminates the need for feature engineering, allowing the model to handle this through hidden layers. In this study, a transformer-based network is used, which aims to learn and reconstruct a clearer and higher quality version of the input image using encoder and decoder blocks. The reason for using transformer models is to compare their ability to resolve spatial details in Sentinel-2 satellite images compared with other neural network architectures, such as ResNet. In this study, the transformer models showed better performance than the ResNet architecture for the Sentinel-2, AID, and UC-Merced image sets. The trained model generally outperformed other models for the calculated metrics.

The proposed model demonstrates better spatial resolution capabilities than the ResNet network, which uses only CNNs in its architecture. Due to the unavailability of 5-m resolution data, the attempt to achieve super-resolution from 10 to 5-m resolution was conducted by training the network to perform super-resolution on data with resolutions of 60–10 m and 20–10 m. Since it is not possible to evaluate the network for converting 10-m data to 5-m due to the lack of 5-m data, the proposed network was also trained on the AID and UC-Merced datasets to fairly evaluate its attention to image details and performance on the calculated metrics. The performance of the proposed network was compared with other models, such as Swin Transformers, ViT, and ResNet, for the conversion cases of 60–10 m, 20–10 m, and RGB bands from 20 to 10 m. For the conversion of 20-m RGB bands to 10 m, the proposed model achieved better performance than other models in the PSNR and SRE metrics, with values of 33.52 and 36.7, respectively. In addition, the average calculated evaluation metrics for converting images from 60 and 20 m to 10 m per band also indicate that the proposed network outperforms the other methods.

A significant limitation in training these types of models is the lack of training data and the resources required for training deep-learning models. To improve the model's accuracy and obtain better results, instead of using 20-m to 10-m data to achieve 10-m to 5-m conversion, it would be beneficial to use 5-m data from other satellites of the same 10-m scene. Using multiple images as input could help, but this approach requires a large number of images of the same location at 10-m resolution and efficient computational resources. This is because multiple inputs increase the computational parameters, leading to a need for more computational resources for training, and generating outputs for new inputs, it would also require aligning inputs with the quality of the training data. GAN networks can also be trained for super-resolution; however, they require a large amount of high-quality training data, as they need to train on high-quality images to generate high-quality outputs. This study highlights the practical significance of the proposed method for applications, such as disaster management and urban planning, where enhanced spatial resolution enables the accurate mapping and improved decision making. Future advancements in lightweight transformers and scalable approaches could further extend its real-world applicability.

References

References is not available for this document.