Journals & Magazines >IEEE Journal of Selected Topi... >Volume: 18

Enhancing the Spatial Resolution of Sentinel-2 Images Through Super-Resolution Using Transformer-Based Deep-Learning Models

Abstract:

Satellite imagery plays a pivotal role in environmental monitoring, urban planning, and national security. However, spatial resolution limitations of current satellite se...Show More

Topic: Recent Advances in Remote Sensing Image Super-Resolution for Earth Observation

Metadata

Abstract:

Satellite imagery plays a pivotal role in environmental monitoring, urban planning, and national security. However, spatial resolution limitations of current satellite sensors restrict the clarity and usability of captured images. This study introduces a novel transformer-based deep-learning model to enhance the spatial resolution of Sentinel-2 images. The proposed architecture leverages multihead attention and integrated spatial and channel attention mechanisms to effectively extract and reconstruct fine details from low-resolution inputs. The model's performance was evaluated on the Sentinel-2 dataset, along with benchmark datasets (AID and UC-Merced), and compared against state-of-the-art methods, including ResNet, Swin Transformer, and ViT. Experimental results demonstrate superior performance, achieving a peak signal-to-noise ratio (PSNR) of 33.52 dB, structural similarity index (SSIM) of 0.862, and signal-to-reconstruction error ratio (SRE) of 36.7 dB on Sentinel-2 RGB bands. The proposed method outperforms state-of-the-art approaches, including ResNet, Swin Transformer, and ViT, on benchmark datasets (Sentinel-2, AID, and UC-Merced). The results demonstrate that the proposed method achieves superior performance in terms of PSNR, SSIM, and SRE metrics, highlighting its effectiveness in revealing finer spatial details and improving image quality for practical remote sensing applications.

Topic: Recent Advances in Remote Sensing Image Super-Resolution for Earth Observation

Published in: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing ( Volume: 18)

Page(s): 4805 - 4820

Date of Publication: 06 January 2025

ISSN Information:

DOI: 10.1109/JSTARS.2025.3526260

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Satellite imagery has become an essential tool in various fields, including environmental monitoring, urban planning, agriculture, and national security. However, the quality of satellite images is often affected by the spatial resolution limitations of sensors, resulting in blurred representations in the recorded images [1].

In the context of satellite imagery, super-resolution refers to the process of increasing the spatial resolution of low-resolution images to reveal finer details and improve the overall image quality [2]. This process involves advanced computational techniques that use information from multiple low-resolution images or complex algorithms to produce high-resolution images. Enhancing the details and resolution of satellite images allows researchers and analysts to better observe, identify, and monitor elements, such as land cover changes, vegetation, building details, and roads, which aids in urban planning, improved disaster management, and infrastructure development projects [3]. The improved spatial resolution achieved through super-resolution methods can significantly enhance applications, such as disaster management, where clearer satellite imagery facilitates damage assessment, emergency response, and resource allocation during natural disasters, such as floods, earthquakes, and wildfires [4].

Nowadays, satellite images are widely used in applications, such as maritime monitoring, agricultural land monitoring, and urban planning. However, even with the highest available resolution, analysts often face challenges in interpretation, and higher resolution images with more details greatly aid in improving the image analysis process for these individuals [5], [6]. Training neural networks for higher accuracy requires precise input data. Traditional interpolation methods, such as linear and bicubic interpolation, often lead to pixelation and information loss [7]. In contrast, deep-learning-based super-resolution techniques, such as convolutional neural networks (CNNs), enhance image quality by learning spatial features and reducing information loss [8]. These enhanced images serve as high-quality inputs, improving the performance of tasks, such as object detection and image segmentation [9], [10].

The objective of this study is to achieve superior visual quality and enhanced resolution of satellite images by leveraging deep-learning models based on transformers. This research aims to explore how transformers, as opposed to CNN-based models, can reveal more detailed spatial information in satellite imagery [11], [12]. Current super-resolution techniques, particularly CNNs, perform well in extracting local features but are limited in capturing long-range dependencies. This shortcoming is especially pronounced in Sentinel-2 satellite imagery, where multispectral data and varying spatial resolutions complicate the recovery of fine spatial details. Traditional interpolation methods, such as linear or bicubic interpolation, exacerbate this issue by introducing pixelation and losing high-frequency information, which reduces their effectiveness for complex remote sensing tasks [13], [14].

The proposed innovations include employing a multihead cross-attention layer instead of the traditional self-attention to extract more comprehensive features and improve the model's learning capabilities. In addition, the model integrates a spatial attention block to capture spatial characteristics and a channel attention block in both encoder and decoder sections of the network architecture to enhance color channel feature comprehension. For data analysis, metrics, such as peak signal-to-noise ratio (PSNR) [12], structural similarity index (SSIM) [15], and signal-to-reconstruction error ratio (SRE) [16], will be used to evaluate the performance of the super-resolution model. The study relies on Sentinel-2 satellite images acquired from Google Earth Engine, with variables, including the number of training data samples and the number of spectral bands in the input images. This approach not only seeks to improve satellite image quality but also provides enhanced input data for neural networks, which can subsequently improve the accuracy in applications, such as object detection and image segmentation.

The novelty of this study lies in the introduction of a transformer-based super-resolution architecture that incorporates a multihead cross-attention mechanism, diverging from the traditional self-attention paradigm used in the existing transformer models. This innovation enables the network to extract more comprehensive spatial features by considering relationships across multiple input bands [17]. In addition, the integration of spatial attention and channel attention blocks further enhances the network's ability to focus on fine spatial details and spectral characteristics during both the encoding and decoding stages. Unlike ResNet and other CNN-based models that often struggle with capturing long-range dependencies, the proposed method achieves a significant improvement in spatial feature reconstruction. The model's effectiveness is validated on Sentinel-2, AID, and UC-Merced datasets, demonstrating superior performance in PSNR, SSIM, and SRE metrics [18]. The primary objective of this study is to propose and evaluate a transformer-based deep-learning model for enhancing the spatial resolution of Sentinel-2 images. We hypothesize that the integration of multihead cross attention, combined with spatial and channel attention mechanisms, will improve super-resolution performance compared with state-of-the-art CNN-based methods. This approach is expected to address challenges related to long-range dependency modeling and multispectral feature integration.

The rest of this article is organized as follows. Section II covers the theoretical foundations of the research, discussing super-resolution, image processing, artificial neural networks, machine learning, deep learning, and related studies on applying super-resolution with neural networks to satellite images. Section III examines materials, such as image data, methods for data preparation, the structure and blocks used in this study, and the evaluation methodology. Section IV focuses on the results, network training, and evaluation of the results. Finally, Section V concludes this article.

SECTION II.

Related Works—State-of-the-Art

In this section, the theoretical foundations of the research on super-resolution and various machine learning and deep-learning methods will first be examined. Following this, previous studies on the application of super-resolution to Sentinel-2 images using deep learning will be reviewed. The aim of this section is to introduce super-resolution and various machine learning and deep-learning methods used in the field of artificial intelligence.

Super-resolution is a powerful technique and tool that goes beyond simple interpolation methods. It enhances image resolution in various fields, including satellite imagery, medical imaging, surveillance systems, and digital photography [19]. Increasing the level of detail and resolution in these types of images provides a better understanding of details and facilitates decision-making processes through more accurate image analysis. High-resolution and better quality images are generated using various methods, such as interpolation and deep-learning-based approaches with low-resolution inputs. These methods use statistical features and contextual information to estimate missing details and improve image quality [3].

Super-resolution can be performed in two ways: single-image super-resolution and multi-image super-resolution. The first method uses a single image to produce a higher quality and higher resolution image, whereas the second method uses multiple images of the same area or object as inputs to achieve this. The latter method is commonly used to improve the quality of low-resolution videos or for video processing [20].

Various algorithms have been implemented to achieve this, each with its own advantages and disadvantages. Interpolation algorithms, such as linear or bicubic interpolation, are among the most commonly used interpolation methods but can sometimes lead to pixelation and loss of previous information [21], [22]. In recent years, the use of machine learning models, especially deep-learning models, such as CNNs and GAN networks, has significantly increased in this field. These networks reconstruct higher quality images by learning the image space and its features [7].

Yang et al. [23] introduced a model called PanNet, based on deep neural networks, for super-resolution using pan-sharpening methods. In this research, CNNs and the ResNet network were used to train the model. The network's input, in addition to passing through the ResNet block, combines the output of this block with the input passed through the upsample block to retain information and prevent data loss, ultimately producing the output image. This network was trained on all bands of WorldView-3 data. The evaluation metrics SAM and ERGAS in this study showed the lowest values and best performance compared with other networks, such as ImageNet and ResNet alone. However, the network's performance on the test dataset declined more than the ImageNet network.

Lanaras et al. [3] proposed a deep-learning model using the ResNet network for super-resolution on bands with 60- and 20-m resolution, based on the data from the 10-m bands. In this study, two CNNs were used to perform super-resolution on 60-m to 10-m and 20-m to 10-m resolution bands, respectively. Each of the networks includes a large number of residual blocks, each consisting of two CNNs and a ReLU activation function in the residual part. The output of this block includes the input and the scaled output from the second CNN. The trained network demonstrated acceptable and superior performance compared with other networks in terms of RMSE and SRE evaluation metrics.

Zhu et al. [24] proposed a deep-learning-based model called DCARN for super-resolution. The network architecture is inspired by the study conducted by Lanaras et al., and data from 60-m resolution bands were excluded. The model leverages a channel attention mechanism and uses ResNet as the backbone to extract image features. The channel attention mechanism used in this study includes a CNN layer along with the summation of previous input pixels, followed by passing through a sigmoid function as the activation function. Initially, the 10-m and 20-m resolution bands are placed in two separate matrices. Then, upsampling is performed on the 20-m resolution matrix, and after combining it with the 10-m resolution matrix data, it is fed as input to the network. BatchNorm blocks were not used in the residual blocks of this network due to the reduction in pixel reconstruction accuracy. This model was also trained with other models that had slight variations from the proposed model. However, the proposed model showed better performance in terms of PSNR and SSIM metrics compared with the other models, and a 20% reduction in PSNR compared with the work done by Lanaras et al.

In another study, Galar et al. [7] proposed a supervised model for super-resolution using a CNN. Sentinel-2 images in RGB format from a specific area were fed as input to the network, which began learning by comparing with images from the same area taken by the PlanetScope satellite. In the initial layers, the network was inspired by the ResNet architecture and trained using shortcut blocks. A pixel shuffle block was added after the activation layer, utilizing the style loss function to learn pixel distribution. In addition, for better network learning, images were filtered using a Blur filter after passing through the pixel shuffle block. The authors compared the network output with the result of bicubic interpolation, achieving up to a 1.2% improvement in the PSNR metric.

To improve image quality at resolutions of 5 and 2.5 m, pixel shuffle layers were used. Kawulok et al. [20] utilized multiple images of shared scenes from a specific area to gather information and perform super-resolution. To use multiple images, the authors needed images taken from a specific point under nearly identical conditions. In this study, each of the multiple CNN layers was used to learn the distribution of each image, with input images passed through a bicubic interpolation block before being fed into CNNs. Finally, another CNN was used to learn and adjust its weights based on the layers in the previous networks, thereby completing its learning process. The output images of this study were provided in both multiband and RGB formats.

In previous research, CNNs or pretrained networks, such as ResNet, have been used as the main block to extract image features for converting 60-m and 20-m band data to 10-m data. Methods based on pan-sharpening techniques also show acceptable performance, but with changes in the dataset, their performance decline is greater compared with deep-learning methods. Since the network's training data consist of Sentinel-2 images, the network was trained using the proposed architecture to convert 60-m and 20-m data to 10-m resolution. While previous studies have achieved notable success in super-resolution using CNN-based methods, such as ResNet, or attention-based networks, such as Swin Transformer, these approaches exhibit certain limitations. CNN-based models, such as ResNet, often struggle to capture long-range dependencies due to the local receptive fields of convolutional layers. Similarly, Swin Transformers, although effective, rely heavily on hierarchical attention mechanisms, which may lose fine spatial details in downsampling processes. In addition, many existing models lack sufficient integration of spatial and spectral information, limiting their performance on multiband satellite imagery. The proposed transformer-based model addresses these challenges by incorporating multihead cross attention to capture global dependencies effectively, while spatial and channel attention blocks ensure preservation of fine-grained spatial and spectral details during super-resolution.

SECTION III.

Material and Methods

A. Satellite Data Observation

Data collection has been conducted from the Google Earth Engine system at resolutions of 60, 20, and 10 m during the summer season from the Sentinel-2 satellite. Due to the large size of the image areas obtained from Google Earth Engine, 10-m resolution images were cropped to dimensions of 180 × 180 with a 120-pixel offset, 20-m resolution images to dimensions of 90 × 90 with a 60-pixel offset, and 60-m resolution images to dimensions of 30 × 30 with a 20-pixel offset from the previous window using a sliding window technique (Fig. 1).

For validation of the trained data, a Gaussian filter was applied before downscaling 10-m data to 20 m, 20-m data to 40 m, and 20-m data to 120 m. The downscaling of these data was performed using the block_reduce function from the scikit-image library [25]. In the downsampling process, a Gaussian filter was applied to smooth the 10-m and 20-m resolution images before reducing their sizes. The Gaussian filter parameters were set as follows: kernel size = 3 × 3 and standard deviation (&sgr;) = 1.0. These values were chosen to balance noise reduction and preservation of spatial details, ensuring a fair comparison between the original and downsampled images. The reason for downscaling from 20 to 40 m or 10 to 20 m is for comparison and error calculation of the model's performance. This is because 10-m data include four bands, while 20-m data include six bands, and even comparing each band with a corresponding band from an image and calculating the error in this way is not considered accurate. In addition to increasing the number of training data using the sliding window technique, another method called gamma correction has been used to further expand the dataset and improve network reliability. This method involves increasing or decreasing image brightness by adjusting γ in the following equation [26], [27]:

$\begin{equation*} I^{\prime} = {{\left( {\frac{I}{{255}}} \right)}^{\frac{I}{\gamma }}}.\ 255. \tag{1} \end{equation*}$ View Source

In this study, 10-m resolution images with an average brightness of each of the three channels ranging between 128 and 142 and between 200 and 210 were used to generate images with higher and lower brightness from the same scene using gamma values of 2 and 0.6, respectively. These images were separately used in the main model for converting 60-m data to 10 m and 20-m data to 10 m. For data preprocessing, gamma correction was applied to simulate varying image brightness levels, using γ = 2.0 for brighter images and γ = 0.6 for darker images. The γ values of 0.6 and 2.0 for brightness adjustment were selected to simulate varying lighting conditions in the dataset. A γ value of 0.6 enhances darker regions, making low-intensity details more prominent, while a γ value of 2.0 brightens the overall image to simulate overexposed conditions. This augmentation strategy improves the model's robustness to variations in illumination, ensuring better generalization to real-world satellite imagery with diverse lighting conditions. A sliding window technique was employed to crop images at different resolutions: 10-m resolution images were cropped to 180 × 180 pixels with a 120-pixel offset, 20-m resolution images to 90 × 90 pixels with a 60-pixel offset, and 60-m resolution images to 30 × 30 pixels with a 20-pixel offset. These steps ensured uniform data preparation and enhanced the diversity of the training dataset.

B. AID Dataset

This large dataset includes 200–400 images from 30 different scenes and has been collected and preprocessed for use in scene classification and semantic segmentation tasks. The preprocessing conducted to improve quality was compared with aerial images, and the authors noted the absence of any differences, even at the pixel level, between the preprocessed images and the aerial images taken [28]. All images in this dataset are 600 × 600 pixels in size. To use it for training the network implemented in this study, the images were downscaled to 300 × 300 pixels using the same method applied to the Sentinel-2 dataset. The performance variation across classes in the AID dataset reflects differences in scene complexity. Uniform classes, such as Desert, achieve higher PSNR scores, while complex scenes, such as sparse residential, exhibit lower scores due to the intricate spatial details that require accurate reconstruction (Fig. 2).

C. UC-Merced Dataset

The UC-Merced dataset includes 100 images from 21 different scenes, each with dimensions of 256 × 256 pixels and a resolution of 30 cm. It has been collected and preprocessed for scene classification and semantic segmentation tasks, comprising a total of 2100 images [29]. The resizing process applied to the AID dataset was also applied to this dataset for network training, producing images with dimensions of 128 × 128 pixels using the downscaling method applied to the Sentinel-2 dataset (Fig. 3).

D. Proposed Network Architecture for Super-Resolution

According to Fig. 4, the low-quality image is provided as input to the network, and feature extraction begins in three stages using the encoder block, whose architecture is shown in Figs. 5 and 6. This process starts by extracting spatial and channel-related features in smaller dimensions of the input image. After each encoder block, a multilayer perceptron (MLP) is used for dimensionality reduction, thereby narrowing the network's focus and enhancing attention to details. This idea is inspired by the Swin Transformer architecture [30].

Fig. 1.

Amir Kabir Dam at different resolutions.

MIT Libraries

MIT Libraries

Enhancing the Spatial Resolution of Sentinel-2 Images Through Super-Resolution Using Transformer-Based Deep-Learning Models

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Introduction

Related Works—State-of-the-Art

Material and Methods

A. Satellite Data Observation

B. AID Dataset

C. UC-Merced Dataset

D. Proposed Network Architecture for Super-Resolution

E. Proposed Evaluation Method

Results and Discussion

Conclusion

References

IEEE Account

Purchase Details

Profile Information

Need Help?