Journals & Magazines >IEEE Journal of Selected Topi... >Volume: 17

Spectral-Enhanced Sparse Transformer Network for Hyperspectral Super-Resolution Reconstruction

Abstract:

Hyperspectral image (HSI) has garnered increasing attention due to its capacity for capturing extensive spectral information. However, the acquisition of high spatial res...Show More

Metadata

Abstract:

Hyperspectral image (HSI) has garnered increasing attention due to its capacity for capturing extensive spectral information. However, the acquisition of high spatial resolution HSIs is often restricted by current imaging hardware limitations. A cost-effective approach to enhance spatial resolution involves fusing HSIs with high spatial resolution multispectral images collected from the same scene. Traditional convolutional neural network-based models, although gained prominence in HSI super-resolution reconstruction, are typically limited by their small receptive field of the convolutional kernel, primarily emphasizing local information while neglecting nonlocal characteristics of the image. In light of these limitations, this article proposes a novel spectral-enhanced sparse transformer (SEST) network for HSI super-resolution reconstruction. Specifically, the proposed SEST employs a sparse transformer to capture nonlocal spatial similarities efficiently, along with a spectral enhancement module to learn and exploit spectral low-rank characteristics. Integrated within a multiwindow residual block, the abovementioned two components collaboratively extract and combine distinct fine-grained features through a weighted linear fusion process, facilitating the integration of spatial and spectral information to optimize the reconstruction result. Experimental results validate the superior performance of the proposed SEST model against current state-of-the-art methods in both visual and quantitative metrics, thus confirming the effectiveness of the proposed approach.

Published in: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing ( Volume: 17)

Page(s): 17278 - 17291

Date of Publication: 10 September 2024

ISSN Information:

DOI: 10.1109/JSTARS.2024.3457814

Funding Agency:

Contents

SECTION I.

Introduction

Hyperspectral imaging systems have the capability to concurrently capture surface information across hundreds of continuous bands, yielding a set of spectral images depicting the same scene [1], [2]. The primary advantage of hyperspectral images (HSIs) over traditional natural or multispectral images lies in their richer spectral information, which enables accurate distinction and identification of objects within the image scene, making them widely applicable in various fields such as image classification [3], [4], target detection [5], [6], and change detection [7]. However, due to inherent limitations in incident energy and imaging hardware, HSIs often suffer from low-spatial resolution (named as LR-HSI), which significantly restricts their practical applications [8]. Unlike hardware upgrades, hyperspectral super-resolution reconstruction (HSI-SR), as an image postprocessing technique, obtains high spatial resolution hyperspectral images (HR-HSIs) from an algorithmic perspective. Due to its low cost and high efficiency, HSI-SR has become a necessary and promising research direction.

Generally, HSI-SR techniques can be divided into two categories based on whether auxiliary information is required: single HSI super-resolution and fusion-based HSI super-resolution. The former is highly independent and easy to implement, but since HSI-SR is an ill-posed problem, relying solely on a single LR-HSI can result in reconstructed images lacking detailed information. The latter introduces HR-MSIs of the same scene as auxiliary data, leveraging the advantages of different data sources to achieve better reconstruction results [9].

Over the past few decades, the HSI-SR domain has rapidly developed, with many fusion-based methods emerging. These methods can be broadly categorized into model-based and deep learning (DL)-based approaches. Model-based methods typically involve manually constructing various priors (e.g., self-similarity [10], sparsity [11], [12], and low rank [13], [14]) as regularizers to achieve reconstruction. While these methods show commendable performance, they suffer from issues such as being time-consuming and having limited representational capacity due to their reliance on manually crafted priors. With the rapid development of DL techniques, especially convolutional neural networks (CNNs), DL-based methods have demonstrated impressive performance [15], [16], [17], [18], [19]. These methods are inherently data-driven, allowing networks to autonomously learn priors from the characteristics of the dataset itself, thus offering greater flexibility [20]. However, the fixed receptive field of CNNs due to convolution kernel size makes them inefficient in modeling long-range dependencies, which somewhat limits their fusion performance [21].

To address these issues, the self-attention mechanism from the natural language processing (NLP) field has garnered increasing attention for its outstanding global feature extraction capabilities. Especially, the vision transformer (ViT) [22] introduced self-attention mechanisms into the computer vision (CV) field for the first time by partitioning images into patches and adding positional embeddings. Its excellent global context modeling capability effectively addresses edge effects in HSI-SR tasks. However, the global self-attention mechanism of ViT exhibits quadratic computational complexity with respect to input image size, leading to increased GPU memory demands. To mitigate these limitations, the Swin Transformer [23] uses a hierarchical structure and shifted window mechanism to reduce the length of the input sequence, making the self-attention mechanism a more versatile backbone network. Nonetheless, it still faces challenges such as fixed window sizes and high computational and memory demands when processing high-resolution images. The Performer [24] approximates the softmax function using orthogonal random feature mapping, thereby altering the order of matrix computations in self-attention to achieve linear complexity. The sparse transformer [25] restricts each element to interact only with a subset of elements in the sequence, thus sparsifying the attention score matrix.

Motivated by these developments, especially the sparse transformer and channel attention mechanism [26], this article constructs the spectral-enhanced sparse transformer (SEST), a novel network architecture for HSI super-resolution reconstruction. Specifically, in the spatial domain, sparse self-attention is used to model long-range dependencies, complemented by a local enhanced feed-forward network (LeFF) to retain complex local details, thereby achieving more efficient local-global feature learning. In the spectral domain, a spectral enhancement module is designed to explore the correlations between adjacent bands of HSIs and integrate them into the transformer structure, effectively preserving the original spectral information while promoting the interaction between spatial and spectral information, thus enhancing the network's ability to learn critical features. Incorporating these two designed elements, a multiwindow residual block is employed to learn multiscale features from the input image, while long and short skip connections are added to the network, contributing to the flexibility of information flow and the robustness of the network. The primary contributions of this article are as follows.

A novel SEST-based super-resolution reconstruction network is proposed, effectively leveraging nonlocal spatial similarity and the spectral low-rankness inherent in HSIs.
A multiwindow residual block is specifically designed to extract features at different levels of granularity. The incorporation of a weighted linear combination facilitates the fusion of these features, contributing to an enhancement in the quality of the reconstructed image.
The implementation of a spectral enhancement module in the self-attention calculation stage to boost the network's capability of spectral information extraction, facilitating the recalibration of the self-attention map to activate more pixels. The incorporation of LeFF instead of the standard FFN further enhances the network's capacity to exploit local contextual details.

The rest of this article is organized as follows. Section II reviews related work in the domain of HSI-SR, Section III clarifies the proposed SEST architecture along with the loss function, Section IV presents the experimental validation of the proposed method, and Section V concludes this article.

SECTION II.

Related Works

This section reviews some of the most notable recent advancements in HSI-SR techniques, categorizing them into three types: model-based methods, CNN-based methods, and ViT-based methods. Additionally, the limitations of existing ViT-based methods are analyzed in detail.

A. Model-Based Methods

Model-based methods, a classical approach to HSI-SR, can be divided into three primary categories. The first category includes techniques based on panchromatic sharpening, such as component substitution (CS) and multiresolution analysis. A widely employed method within this category is the adaptive Gram–Schmidt algorithm proposed by Aiazzi et al. [27], which incorporates the spectral response function into CS. Another notable method, proposed by Selva et al. [28], involves a super-resolution framework utilizing linear regression to represent each hyperspectral band as a linear combination of multispectral bands. While computationally efficient, these methods often yield unreliable quality of the reconstructed image. The second category consists of methods relying on prior information analysis of HSI, including sparse representation-based and Bayesian-based methods. Specifically, Akhtar et al. [29] proposed a Bayesian framework based on sparse representation, deriving the probability distribution of the spectral bases and computing sparse coding of the high-resolution image. Wei et al. [30] integrated the explicit solution of the Sylvester equation into the Bayesian-based method, named “fast fusion based on Sylvester equation” (FUSE), significantly reducing the algorithm complexity while ensuring the quality of the reconstructed image. Simões et al. [31] introduced total variation regularization for effective edge preservation in a convex optimization of subspace coefficients. The third category involves decomposition-based methods, with considerable attention of being explainable and understandable, where the most representative methods are matrix factorization-based methods. For instance, Yokoya et al. [32] proposed a coupled nonnegative matrix factorization (CNMF) method, employing CNMF to alternately estimate endmembers and abundances. However, these methods cannot effectively preserve the spectral structure of HSIs. Addressing that, tensor decomposition-based methods have been explored, such as the nonlocal sparse tensor decomposition-based method proposed by Dian et al. [33], estimating the sparse kernel tensor and dictionary for HSI-SR, showcasing potential in preserving spectral information for high-quality reconstruction.

B. CNN-Based Methods

The burgeoning interest in DL has led to the rapid development of CNN-based methods for HSI-SR. For example, Li et al. [34] proposed an X-shaped interactive autoencoder network, integrating the concept of matrix factorization into DL to facilitate cross-modal learning between hyperspectral and multispectral data. To inject more texture details into HSI, the IFMSR [35] integrates an RGB-induced detail enhancement and a deep cross-modal feature modulation module. While demonstrating efficacy in various scenarios, it is noteworthy that these models are primarily data-driven, with a lack of interpretability. In response, Xie et al. [16] proposed a model-based DL method using a deep unfolding network inspired by the traditional alternating iterative algorithm model, employing CNN to learn the proximal operator and model parameters. The aforementioned methods are all based on 2D convolution, failing to effectively model the spectral structure of HSIs with 3D characteristics. Mei et al. [36] introduced 3D fully convolutional networks into HSI-SR tasks, allowing the network to better learn both spatial and spectral information in HSI. However, due to the high spectral resolution of HSI, methods based on 3D CNNs suffer from challenges of large parameter size and high computational complexity. In response to these issues, Li et al. [37] considered the spectral similarity of HSI and utilized the similarity between bands to achieve grouping, thereby reducing the computational cost of the network. Li et al. [38], on the other hand, addressed computational efficiency from a network structure perspective by designing separable 3D convolutions, aiming to mitigate the computational burden while preserving the spatial and spectral separability. Furthermore, the ill-posed nature of the HSI-SR task poses significant challenges for single-stage learning. To address this, Li et al. [39] designed a coarse-to-fine dual-stage learning framework. In the coarse stage, a symmetric feature propagation model is utilized for broader feature extraction. In the fine stage, a back-projection refinement network is introduced to learn specific features of the image.

C. Visual Transformer-Based Methods

The transformer network, first proposed by Vaswani et al. [40], quickly gained widespread attention due to its core component, the self-attention mechanism, which has powerful global context modeling capabilities. Dosovitskiy et al. [22] are the first to introduce transformer into the CV field, dividing input image into nonoverlapping patches to generate sequence elements, which are then fed into the transformer model for image recognition. Following the success of this work, transformers have been widely applied in various advanced visual tasks, such as object detection [41], [42], image classification [43], and image segmentation [44]. Among these, the Swin Transformer proposed by Liu et al. [23] is particularly noteworthy. It restricts attention computation to local windows and enhances the network's ability to capture contextual information effectively through shift operations and a hierarchical architecture, significantly reducing the computational cost of self-attention. Inspired by this, numerous transformer-based image reconstruction methods have emerged. For instance, SwinIR [45] employs Swin Transformer-based residual blocks to extract deep features from images, showcasing the immense potential of transformers in image reconstruction. In the HSI-SR domain, Fusformer [46] made the first attempt to use ViT encoders as the main body of the network to explore the spatial and spectral information of HSIs, achieving promising results. However, utilizing a single module to simultaneously model spatial and spectral features increases the complexity of network learning. To address this issue, Long et al. [47] employed Swin Transformer to design spatial and spectral self-attention blocks, cascading them to obtain global spatial features and spectral sequence information, contributing to a more efficient and effective HSI super-resolution reconstruction process. Although the design of the shifted window can reduce computational costs, it also weakens the interaction of image boundary information. To address this, Deng et al. [48] proposed a Pyramid Shuffle-and-Reshuffle Transformer (PSRT) method, which employs shuffle techniques to achieve long-range interaction between patches. Despite the remarkable results achieved by these methods, their approaches to acquiring global information remain inefficient, particularly when dealing with hyperspectral data with high redundancy. Additionally, existing methods often overlook the interaction between spatial and spectral information. Moreover, relying solely on single-type feature modeling is not conducive to the fine reconstruction of images.

SECTION III.

Proposed Method

Drawing inspiration from sparse transformer and channel attention mechanisms, this section proposed a novel SEST network, specifically tailored for HSI-SR tasks. The overall network architecture of the proposed SEST, along with its hierarchical structure, is presented. Additionally, a comprehensive explanation of the key component of SEST, the spectral-enhanced sparse transformer residual layer (SSRL) is provided.

A. Network Architecture

In HSI-SR tasks, the attainment of a larger receptive field is often crucial for achieving superior reconstruction results. However, conventional CNN architectures, constrained by the inherent limitations of convolutional operations, tend to exhibit deficiencies in modeling long-range dependencies effectively. Recognizing the potential of transformers in addressing this limitation, a SEST network is proposed, specifically designed to simultaneously explore nonlocal spatial similarities and spectral low-rank characteristics inherent in HSIs. The network structure of the proposed SEST method is depicted in Fig. 1.

Fig. 1.

Overall structure of the proposed SEST network.

Show All

The process begins with two images, the LR-HSI and the HR-MSI of the same observed scene, represented by $Y \in {{\mathbb{R}}^{h \times w \times L}}$ and $Z \in {{\mathbb{R}}^{H \times W \times l}}$, respectively, where W(w) and H(h) represent the width and height of the spatial dimension, and L(l) denotes the number of spectral bands in the image ($w \ll W$, $h \ll H$, and $l \ll L$). The objective of super-resolution reconstruction is to estimate HR-HSI, represented by $X \in {{\mathbb{R}}^{H \times W \times L}}$, with both high spatial and high spectral resolution from these two images. Commonly, two strategies are employed for LR-HSI and HR-MSI fusion: image-domain and feature-domain concatenation. To better preserve the spatial and spectral details in the original image, this article adopts the image-domain concatenation framework. Initially, the LR-HSI data undergoes bicubic interpolation during the input data preprocessing stage to obtain the up-sampled image ${{Y}^{up}} \in {{\mathbb{R}}^{H \times W \times L}}$, alleviating the learning burden on the model [49]. Subsequently, the two images (${{Y}^{up}}$ and HR-MSI) are concatenated along the spectral dimension to form the network input $I \in {{\mathbb{R}}^{H \times W \times ( {L + l} )}}$. Knowing the fact that deep features contain more semantic information to enhance the realistic texture of the reconstructed image, while shallow features preserve more accurate details and textures, which is crucial for peak signal-to-noise ratio (PSNR)-oriented models, the proposed SEST begins to extract shallow features ${{F}_s} \in {{\mathbb{R}}^{H \times W \times C}}$ by employing a 3 × 3 convolution layer, learning the details and textures in the fused image. While for deep feature extraction, to better leverage the transformer network's exceptional sequence data modeling capabilities [40], this article capitalizes on the unique advantages of high spectral resolution of HSI by designing a specialized feature embedding layer. This layer splits the feature map F_s into individual pixel vectors, enabling the serialization of HSI with an emphasis on spectral information. These vectors are then fed in parallel into multiwindow residual blocks to better capture the multiscale spectral-spatial features of the image. The multiwindow residual blocks consist of SSRLs with different window sizes. Subsequently, these multiscale spectral-spatial features are fused together using a weighted linear combination as the semantic information ${{F}_d} \in {{\mathbb{R}}^{H \times W \times C}}$ extracted by the network from the fused image. In the spatial information exploration phase, pixel vectors serve as inputs to effectively preserve the original spectral structure. On this basis, a spectral-enhanced (SE) module is introduced to generate weight coefficients for different channels in the window blocks, facilitating the activation of more pixels in the self-attention matrix calculation. Global skip connections are then employed to combine F_s and F_d to obtain F_f $ \in {{\mathbb{R}}^{H \times W \times C}}$_, thereby enhancing the robustness of the network, reducing training difficulty, extracting finer high-frequency details, and reducing spectral distortion during the spatial feature extraction process. Finally, the image reconstruction block reduces the number of feature channels in F_f to the number of spectral bands and adds it to ${{Y}^{up}}$ to yield the final reconstructed result $O \in {{\mathbb{R}}^{H \times W \times L}}$.

B. SE Sparse Transformer Residual Layer

Applying transformers to HSI-SR tasks faces three primary challenges. First, unlike RGB images where spatial processing often suffices, HSI-SR tasks require careful handling of rich spectral information crucial for applications such as classification and object detection. Therefore, ensuring that the reconstructed spectrum remains undistorted while promoting effective interaction between spatial and spectral information is quite challenging. This aspect is often overlooked by existing transformer networks, which predominantly focus on spatial attributes. Second, the foundational mechanism of Vanilla Transformer involves computing global self-attention across all tokens, facilitating the modeling of long-distance dependencies, but it also leads to a quadratic growth in complexity relative to the number of tokens, making it impractical for super-resolution reconstruction tasks involving high image resolution. Finally, local contextual information is valuable for capturing image details and textures, providing more semantic information for a better understanding of objects and structures in the image. While essential for image super-resolution tasks, previous work has demonstrated limitations in transformer's ability to capture local dependencies.

To address these challenges, an SSRL is specifically designed, as depicted in Fig. 2, which leverages the advantages of sparse transformer to model long-distance dependencies with a lower computational cost, while at the same time, depth-wise convolutional operators and SE modules are integrated to capture useful local contextual information and spectral features, respectively.

Fig. 2.

Structure of SSRL.

Show All

The process can be described as follows: \begin{align*} {{F}_l} =& {{H}_{SS{{B}_l}}}\left( {{{F}_{l - 1}}} \right),l = 1,2,\ldots,L \tag{1}\\ {{F}_{out}} =& {{H}_{SS{{B}_L}}}\left( {{{F}_L}} \right) + {{F}_s} \tag{2} \end{align*} View Sourcewhere ${{H}_{SS{{B}_l}}}( \cdot )$ denotes the lth SE sparse transformer block, and ${{F}_l}$ and ${{F}_{l - 1}}$ represent its output and input, respectively.

For a single spectral-enhanced sparse transformer block (SSB), as illustrated in Fig. 3, three core designs are adopted as follows: SE module, sparse multihead self-attention (SMSA), and LeFF.

Fig. 3.

Modules of the proposed SSB, with (a) SE module (SE), (b) SMSA module, and (c) LeFF.

Show All

The computation of a single SSB can be described as follows: \begin{align*} {{F}_N} =& LN\left( F \right) \tag{3}\\ {{F}_M} =& SMSA\left( {{{F}_N}} \right) + \alpha SE\left( {{{F}_N}} \right) + F \tag{4}\\ F^{\prime} =& LeFF\left( {LN\left( {{{F}_M}} \right)} \right) + {{F}_M} \tag{5} \end{align*} View Sourcewhere ${{F}_M}$ and $F^{\prime}$ denote the outputs of the hybrid attention module and the LeFF module, respectively, $LN( \cdot )$ denotes layer normalization, and $SE( \cdot )$ represents the SE module. The parameter $\alpha $ is an adaptive weight coefficient used to balance the two attention modules.

1) Spectral Enhancement

Given that the spectra in HSI usually exhibit low-rank properties due to high correlation among different spectral bands, which have been proven to be useful for guidance in HSI tasks such as denoising, compressive sensing, and unmixing [50], this module aims to leverage these properties for efficient spectral handling. Knowing that the biggest bottleneck in the HSI super-resolution task lies in designing an appropriate regularization method to map the low-resolution HSI into the proper subspace, inspired by the channel attention mechanism, a SE module is introduced into the self-attention calculation module of the Vanilla Transformer to facilitate the automatic learning of appropriate representations in the subspace.

Specifically, as depicted in Fig. 3(a), the input feature $F \in {{\mathbb{R}}^{C \times H \times W}}$is initially divided into nonoverlapping data cubes, denoted as F_p$ \in {{\mathbb{R}}^{C \times P \times P}}$. Subsequently, average pooling is applied to F_p to aggregate its channel features and obtain the mapped spectral vector F_c$ \in {{\mathbb{R}}^{C \times 1 \times 1}}$, improving the model efficiency and feature stability. Finally, a linear layer is used to compress its channel dimension to obtain F_s$ \in {{\mathbb{R}}^{K \times 1 \times 1}}$, aiming to map the spectral vector into a suitable low-rank subspace. This process can be described as follows: \begin{align*} {{F}_c} =& Avgpool\left( {{{F}_p}} \right) \tag{6}\\ {{F}_s} =& {{W}_c}{{F}_c} \tag{7} \end{align*} View Sourcewhere ${{W}_c}$ is the weight of the linear layer and $Avgpool( \cdot )$ denotes the average pooling layer.

It is noteworthy that these operations all act on the internal information of each data cube, thereby primarily focusing on learning spectral statistical information between adjacent pixels. Given that F_s contains rich spectral statistical information, a linear layer is utilized to scale the obtained low-rank vector to match the dimensions of the input spectral vector. Subsequently, these vectors are fed into the sigmoid function to convert into weight coefficients F_z, which serve as guidance to recalibrate the input data cube F_p, thus enhancing spatial-spectral correlation and promoting the super-resolution process. This process can be described as follows: \begin{equation*} {{\hat{F}}_p} = {{F}_p} \cdot {{F}_z} = {{F}_p} \cdot Sigmoid\left( {{{W}_z}{{F}_s}} \right) \tag{8} \end{equation*} View Sourcewhere ${{W}_z}$ is the weight of the linear layer and “$ \cdot $” denotes the element-wise dot product.

2) Sparse Multihead Self-Attention

The inherent quadratic computational complexity of the Vanilla Transformer poses substantial challenges for practical applications. To address this issue, innovations in the field of NLP have led to the development of two improved self-attention structures: linear self-attention and sparse self-attention. Leveraging the established principle that the attention matrix naturally exhibits mathematical sparsity, sparse self-attention strategies can effectively reduce computational cost and enhance model efficiency by pruning or distilling the attention matrix. Inspired by this, as shown in Fig. 3(b), the input HSI data is initially divided into individual pixel vectors by drawing upon the characteristics of high spectral resolution of HSI. Subsequently, due to the local spatial similarity of HSI, several nonoverlapping windows of the same size are employed to partition these pixel vectors, obtaining $F_p^i \in {{\mathbb{R}}^{{{P}^2} \times C}}$ by flattening and transposing the data within each window. Finally, self-attention is computed on the flattened features, and the outputs of all attention heads are concatenated and linearly projected to obtain the final result. The kth self-attention head can be described as follows: \begin{align*} F =& \left\{ {F_p^1,F_p^2,\ldots,F_p^N} \right\},N = HW/{{P}^2} \tag{9}\\ SA_k^i =& Attention\left( {F_p^iW_k^Q,F_p^iW_k^K,F_p^iW_k^V} \right),i = 1,2,\ldots,N \tag{10}\\ {{\hat{F}}_k} =& \left\{ {SA_k^1,SA_k^2,\ldots,SA_k^N} \right\} \tag{11} \end{align*} View Sourcewhere $W_k^Q$, $W_k^K$, and $W_k^V$ in (10) denote the projection matrix of the query, key, and value (Q, K, V), respectively. ${{\hat{F}}_k}$ in (11) represents the output of the kth self-attention head. It is worth noting that the self-attention computation here is restricted within each block.

Although this approach shares similarities with the nonoverlapping window-based multihead self-attention mechanism employed in ViT, the purposes are significantly different. In this work, the approach is primarily employed to encourage the model to learn a sparser attention matrix. In contrast to global self-attention, this strategy reduces the computational complexity for a given input feature map $F \in {{\mathbb{R}}^{C \times H \times W}}$ from $O( {{{H}^2}{{W}^2}C} )$ to $O( {{{P}^2}HWC} )$.

3) Local Enhanced Feed-Forward Network

The Vanilla Transformer architecture comprises a multihead self-attention (MSA) module and a feed-forward network (FFN). While the MSA module calculates correlations between tokens and performs linear fusion to achieve global modeling, the FFN, consisting of a simple multilayer perceptron, performs nonlinear transformations on features to enhance their representation capability. However, the conventional FFN designed in most transformer models often neglects crucial neighboring spatial information for images [51]. To overcome this challenge, the designed LeFF modifies the traditional FFN by incorporating a depth-wise convolution block. Specifically, as shown in Fig. 3(c), a linear projection layer is first applied to each token to increase its feature dimension. The projected tokens are then spatially reshaped according to their original positions and a 3 × 3 depth-wise convolution is performed on each channel of the reshaped features to better capture local spatial contextual information. Finally, the features are restored to tokens, and the channel dimension is matched with the input through another linear projection, which serves as the final output.

C. Loss Function

In the reconstruction process, the most crucial aspect is restoring the high-frequency details, which encapsulate critical spatial information that is normally lost during lower-resolution image acquisition processes. To effectively restore these details, the mean absolute error (MAE) is employed in this article as a primary loss function. The choice of MAE is driven by its sensitivity to minor discrepancies with a better convergence of the network. By minimizing the MAE between reconstructed images and ground truth, the network learns to accurately reconstruct the spatial information, thereby enhancing the overall fidelity of the reconstructed HR-HSI. The MAE is depicted as follows: \begin{equation*} {{L}_{MAE}}\left( \theta \right) = \frac{1}{M}\sum\limits_{m = 1}^M {{{{\left\| {{{O}^m} - {{X}^m}} \right\|}}_1}} \tag{12} \end{equation*} View Sourcewhere O^m and X^m are the mth reconstructed HR-HSI and the ground truth, respectively. M is the number of images in a training batch, and $\theta $ denotes the parameter set of the network.

To address the critical challenge of spectral distortion in HSI super-resolution, a spatial-spectral total variation (SSTV) loss, as initially proposed in [52], is introduced as another loss. The SSTV loss is particularly designed to minimize artifacts and ensure fidelity in both spatial and spectral domains, which is crucial for maintaining the essential characteristics of the original scene. The mathematical representation of SSTV loss is expressed as follows: \begin{equation*} {{L}_{SSTV}}\left( \theta \right) = \frac{1}{M}\sum\limits_{m = 1}^M {\left( {{{{\left\| {{{\nabla }_h}{{O}^m}} \right\|}}_1} + {{{\left\| {{{\nabla }_w}{{O}^m}} \right\|}}_1} + {{{\left\| {{{\nabla }_l}{{O}^m}} \right\|}}_1}} \right)} \tag{13} \end{equation*} View Sourcewhere ${{\nabla }_h}$, ${{\nabla }_w}$, and ${{\nabla }_l}$ denote the gradient functions of the computed horizontal, vertical, and spectral dimensions, respectively.

The final loss function is a composite loss function, taking into account both MAE and SSTV, expressed as follows: \begin{equation*} L\left( \theta \right) = {{L}_{MAE}}\left( \theta \right) + \beta {{L}_{SSTV}}\left( \theta \right) \tag{14} \end{equation*} View Sourcewhere $\beta $ is the tradeoff parameter, which is used to adjust the weight between space and spectral reconstruction errors.

SECTION IV.

Experimental Results

This section presents the experimental results to demonstrate the effectiveness of the proposed method. Initially, the experimental configurations are introduced, encompassing descriptions of the utilized datasets, data simulation procedures, and implementation details. Following this, the reconstruction performance on three public datasets is illustrated and compared against the state-of-the-art algorithms, supplemented by a concise analysis. Finally, an ablation study is provided to validate the effectiveness of the proposed method.

A. Experimental Configurations

1) Datasets

Experiments are conducted on three publicly available hyperspectral datasets, the CAVE dataset [53], the Harvard dataset [54], and the Washington DC Mall (WDC) dataset [32]. The CAVE dataset was captured by a cooled charge-coupled device camera and consists of 32 different indoor scenes, with each HSI presenting a spatial resolution of 512 × 512 pixels and encompassing 31 spectral bands at 10 nm intervals in the range of 400–700 nm. The Harvard dataset, captured by Nuance FX and CRI INC cameras, contains 77 real indoor and outdoor scenes, with each HSI presenting a spatial resolution of 1024 × 1392 pixels and encompassing 31 spectral bands at 10 nm intervals in the range of 420–720 nm. The WDC dataset was captured by the HYDICE sensor and contains one HSI of a large urban area, which presents a spatial resolution of 1280 × 307 pixels and encompasses 191 spectral bands with a range of 400–2400 nm.

2) Data Simulation

In this article, considering that both the CAVE and Harvard datasets have 31 bands and similar coverage ranges, 20 images from the CAVE dataset are selected for the training set. The remaining 11 images, along with 9 randomly selected images from the Harvard dataset, are set aside as the test set to evaluate the network's generalization ability. For the WDC dataset, four 128 × 128 images are cropped from the original image for testing network performance, while the rest are used for training. Due to the limited number of training samples, the training images from the CAVE dataset are segmented into 4275 overlapping image patches of size 64 × 64 × 31. These overlapping patches are then downsampled to a spatial resolution of 16 × 16 × 31 using a Gaussian filter with a kernel size of 3 × 3 and a standard deviation of 0.5 to generate LR-HSI. Additionally, HR-MSI patches are generated using the spectral response function of the Nikon D700 camera. For the training images in the WDC dataset, 921 overlapping image patches of size 64 × 64 × 191 are obtained and downsampled to generate LR-HSI using the same method. The spectral response function from blue to SWIR2 of the Landsat 8 is selected to generate HR-MSI.

3) Implementation Details

The proposed SEST model is implemented using Pytorch 1.13.1 and Python 3.7.16 on the Windows operating system with an NVIDIA GPU GeForce RTX4080. Regarding the hyperparameters in the network, the channel feature mapping C is set to 48 in the shallow feature extraction process. The number of transformer blocks in the SEST residual block is set to 6, and the window sizes in the multiwindow residual block are set to 8, 16, and 32, respectively. To train the proposed network, the Adam optimizer is employed with ${{\beta }_1} = 0.9$ and ${{\beta }_2} = 0.999$. The initial learning rate is set to 1e−4 and halved every 75 epochs. The network is trained for a total of 350 epochs.

B. Performance Evaluation

To verify the performance of the proposed method, a comparative analysis is conducted against six state-of-the-art HSI-SR methods, including three traditional methods, namely CNMF [32], FUSE [30], and HySure [31], along with three DL-based methods, namely SSRNet [55], Fusformer [46], and PSRT [48]. To ensure a fair and consistent comparison, all DL models are trained with the same input, and the hyperparameter settings are aligned with the specifications outlined in their respective original papers. Moreover, to provide a more intuitive and quantitative comparison of the performance of the aforementioned methods, four widely used HSI-SR quality indices (QIs) are employed, including PSNR, spectral angle mapping (SAM), structural similarity (SSIM), and erreur relative global adimensionnelle de synthèse (ERGAS). Among these QIs, superior reconstruction performance is indicated by higher PSNR and SSIM values, alongside lower SAM and ERGAS values.

1) Experimental Results on the CAVE Dataset

Table I presents a comprehensive quantitative evaluation of the reconstruction performance achieved by the proposed method along with a comparison to state-of-the-art methods on the CAVE dataset, with the best results highlighted in bold and the second-best results underlined. Notably, the proposed SEST method consistently outperforms comparable methods across all four QIs on the CAVE dataset, establishing its consistently superior efficacy.

TABLE I Quantitative Results of Different Methods on CAVE, Harvard, and WDC Datasets

To substantiate the quantitative findings through visual aspect, two representative images, specifically balloon and feather, are selected from the CAVE dataset for in-depth visualization. Fig. 4 shows the reconstructed images and their corresponding residual images produced by different methods, with highlighted regions (within the red boxes) for detailed comparison. The visual inspection reveals the superior performance of the SEST method, particularly evident in the residual images (generated by a randomly selected band), underscoring the ability of SEST to achieve reconstructions that closely align with the ground truth (GT), thereby demonstrating better recovery of spatial details.

Fig. 4.

First column: the GTs and the corresponding LR-HSI images (in pseudo-colors) for the balloon (first and second rows) and the feather (third and fourth rows) test cases from the CAVE dataset. The second to eighth columns: the visual results and the residuals (generated by a randomly selected band) between the GT and the fused products for all the compared approaches. A zoomed area has been added to aid the visual inspection.

Show All

To better compare the spectral fidelity of different methods, two pixels from balloon and feather are randomly selected to compare spectral differences against the GT, as shown in Fig. 5. The spectral vectors generated by the proposed SEST closely resemble the GT, indicating a significant reduction in spectral distortion and an enhanced preservation of spectral fidelity of the proposed SEST method.

Fig. 5.

Spectral vectors analysis of the GT and results of the compared approaches for the (a) balloon located at (60, 270) and (b) feather located at (50, 420).

Show All

2) Experimental Results on Harvard Dataset

To evaluate the generalization capability of the proposed SEST method, the CAVE dataset is exclusively used to train the proposed network, which is then tested on the Harvard dataset. The quantitative results, as presented in Table I, showcase that the proposed SEST outperforms all compared methods in terms of PSNR, SAM, and SSIM, while securing not that good in terms of ERGAS due to the more complex noise types present in real-world scenarios. For visual assessment, similar experiments are conducted with the results shown in Figs. 6 and 7.

Fig. 6.

First column: the GTs and the corresponding LR-HSI images (in pseudo-colors) for the imga3 (first and second rows) and the imgc9 (third and fourth rows) test cases from the Harvard dataset. The second to eighth columns: the visual results and the residuals (generated by a randomly selected band) between the GT and the fused products for all the compared approaches. A zoomed area has been added to aid the visual inspection.

Show All

Fig. 7.

Spectral vectors analysis of the GT and results of the compared approaches for the (a) imga3 located at (200, 400) and (b) imgc9 located at (200, 250).

Show All

As shown in Fig. 6, two test images from the Harvard dataset are selected for analysis, displaying pseudo-color images and residual images of the reconstructed results. Similarly, Fig. 7 illustrates the spectral vector diagrams of the reconstructed results by different methods, facilitating a comparison of spectral fidelity. The SEST consistently delivers superior results in both spatial and spectral domains, aligning with the quantitative analysis.

3) Experimental Results on the WDC Dataset

To evaluate the robustness of the proposed SEST method on real-world data, the majority of the WDC dataset is used for training the network, while the remaining portion is used for testing its performance. Quantitative results in Table I demonstrate that the proposed SEST method outperforms all comparison methods in terms of PSNR and SAM, and ranks second in SSIM. Similar performance as the Harvard dataset of the ERGAS metric could be drawn by the results. For visual assessment, similar experiments are conducted with the results shown in Figs. 8 and 9.

Fig. 8.

First column: the GTs and the corresponding LR-HSI images (in pseudo-colors) for the img1 (first and second rows) and the img2 (third and fourth rows) test cases from the WDC dataset. The second to eighth columns: the visual results and the residuals (generated by a randomly selected band) between the GT and the fused products for all the compared approaches. A zoomed area has been added to aid the visual inspection.

Show All

Fig. 9.

Spectral vectors analysis of the GT and results of the compared approaches for the (a) img1 located at (40, 70) and (b) img2 located at (45, 80).

Show All

As shown in Fig. 8, two test images from the WDC dataset are selected for analysis, displaying pseudo-color images and residual images of the reconstructed results. Similarly, Fig. 9 illustrates the spectral vector diagrams of the reconstructed results by different methods, facilitating a comparison of spectral fidelity. The SEST consistently delivers superior results in both spatial and spectral domains, aligning with the quantitative analysis.

4) Discussion and Analysis

The experimental results on the CAVE, Harvard, and WDC datasets demonstrate that the proposed method exhibits superior reconstruction performance across the quantitative metrics of PSNR, SAM, and SSIM. However, despite these positive outcomes indicating the method's overall effectiveness in reconstructing spectral information, it performs suboptimally in the ERGAS metric. Further analysis reveals that although the reconstructed spectral curves closely match the ground truth in general, the presence of extreme outliers in certain bands results in significant deviations from the ground truth, which substantially increases the ERGAS value.

A deeper investigation suggests that these outliers may be attributed to two inherent limitations of the proposed method: 1) To reduce the computational cost of the self-attention mechanism and enhance the network's ability to handle high-dimensional data, the proposed method restricts sparse attention to fixed windows. While this design effectively reduces computational overhead, it also compromises the transformer model's ability to capture long-range dependencies, thereby affecting the precise reconstruction of spectral information and leading to large errors in specific bands or pixels. 2) Although the multiwindow residual block design reduces reliance on manual parameter tuning, its inherent structural limitations may lead to inconsistent performance when dealing with hyperspectral data of varying characteristics. In certain cases, these limitations can result in increased errors in specific bands, thus causing larger deviations.

C. Ablation Study

This section evaluates the contributions of individual components within the proposed SEST method through a series of ablation experiments, mainly focusing on four critical aspects: the integration strategy of key modules, the sizes of the residual windows, the location of spectral enhancement, and the influence of loss functions. To be concise and not affect the generality, the ablation experiments are exclusively conducted on the CAVE dataset with a detailed analysis of different factors of the proposed SEST model on the reconstruction performance, allowing for a focused analysis of how each factor influences reconstruction performance.

1) Integration Strategy of Key Modules

As has been demonstrated before, the SEST framework incorporates three pivotal modules: the multiwindow residual block, spectral enhancement, and LeFF. While the multiwindow residual block is comprehensively analyzed in terms of window sizes separately (discussed in the subsequent subsection), this subsection assesses the significance of the remaining SE and LeFF modules. Results of the quantitative reconstruction metrics in Table II indicate that integrating either the LeFF or the SE module individually yields improvements in reconstruction quality compared to a baseline model devoid of these two modules, confirming the necessity of each module. Notably, when both modules are combined, as shown in the final column of Table II, there is a markable improvement in performance, surpassing the individual contributions of each module. This synergistic effect underscores the complementary nature of these two modules, enhancing the overall efficacy of the HSI-SR process.

TABLE II Reconstruction Results on Different Ways of Combining Key Modules

2) Size of Residual Windows

Experimental results concerning various window sizes are detailed in Table III, illustrating the effect of different window sizes on reconstruction quality, where rows 2–9 delineate the impact of the single-window method with window sizes varying from 4 to 32. Interestingly, the reconstruction quality does not exhibit a straightforward improvement with an increase of the window size. This observation underscores the significance of selecting an appropriate window size, emphasizing the critical importance of the multiscale window design.

TABLE III Reconstruction Results on Different Sizes of Residual Windows

Moreover, Table III also provides a comparative analysis of the training time for each method, with the average time for training one epoch. Notably, window-based methods demonstrate significantly reduced training time compared to Fusformer. While the multiscale window method requires more training time compared to the single window, it should be noted that each window operates independently within the network, allowing for parallel training on multiple GPUs, thereby facilitating a reduction in overall training time.

3) Location of Spectral Enhancement

In an effort to ascertain the optimal configuration of the spectral enhancement module within the SEST architecture, two variants of the SSB are tested, denoted as SSBv1 and SSBv2. SSBv1 places the SE module externally to the transformer block, whereas SSBv2 integrates the SE module directly within the self-attention calculation to fine-tune the self-attention map, as depicted in Fig. 10.

Fig. 10.

Structure of the SSBv1 and SSBv2.

Show All

The comparative results, presented in Table IV, demonstrate that the SSBv2 configuration proves to be more conducive to image reconstruction in terms of reconstruction quality. This insightful analysis sheds light on the impact of the location of the spectral enhancement module within the sparse transformer block, emphasizing its significance in improving HSI super-resolution reconstruction efficacy.

TABLE IV Reconstruction Results on Different Locations of Spectral Enhancement

4) Influence of Loss Functions

To balance the recovery of high-frequency spatial details and spectral fidelity, this article introduces the use of MAE and SSTV as loss functions, namely L_MAE and L_SSTV. L_MAE aims to reduce pixel-level errors between the reconstructed and true images, thereby enhancing overall image quality and detail. L_SSTV, on the other hand, constrains spatial and spectral variations to reduce noise and maintain spectral consistency and fidelity.

To determine the impact of these loss functions on super-resolution reconstruction results, this article tests three different combinations: “w L_MAE and w/o L_SSTV,” “w/o L_MAE and w L_SSTV,” and “w L_MAE and w L_SSTV.” Table V presents the experimental results for these combinations. The comparison shows that using either L_MAE or L_SSTV alone results in comparable reconstruction quality. However, using both together significantly enhances the quality of the reconstructed images. This finding indicates that a reasonable combination of different loss functions can fully exploit their respective advantages, leading to higher-quality image reconstruction.

TABLE V Reconstruction Results on Influence of Loss Functions

SECTION V.

Conclusion

This article presents a novel SEST network tailored for HSI-SR tasks. This innovative model, engineered to operate within the spatial domain, incorporates sparse self-attention and a local enhancement feedforward network to capture global features while preserving local details. Addressing the spectral distortions in HSI-SR tasks, an SE module is specially designed within the self-attention calculation process, forming a powerful hybrid attention mechanism. Furthermore, to exploit multiscale information inherent in the image, a well-crafted multiwindow residual block is devised, contributing significantly to the overall improvement in reconstruction quality. The comprehensive experiments conducted on the CAVE, Harvard, and WDC datasets convincingly validate the superior performance of the proposed approach, underscoring its significance in the field of HSI-SR. In light of the limitations identified in the experimental analysis, future work could explore the integration of dynamic convolutions to enable more flexible window size adjustments and diversified attention patterns based on the characteristics of the input data, thereby mitigating the occurrence of outliers.

References is not available for this document.

MIT Libraries

MIT Libraries

Spectral-Enhanced Sparse Transformer Network for Hyperspectral Super-Resolution Reconstruction

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction

Related Works

A. Model-Based Methods

B. CNN-Based Methods

C. Visual Transformer-Based Methods

Proposed Method

A. Network Architecture

B. SE Sparse Transformer Residual Layer

1) Spectral Enhancement

2) Sparse Multihead Self-Attention

3) Local Enhanced Feed-Forward Network

C. Loss Function

Experimental Results

A. Experimental Configurations

1) Datasets

2) Data Simulation

3) Implementation Details

B. Performance Evaluation

1) Experimental Results on the CAVE Dataset

2) Experimental Results on Harvard Dataset

3) Experimental Results on the WDC Dataset

4) Discussion and Analysis

C. Ablation Study

1) Integration Strategy of Key Modules

2) Size of Residual Windows

3) Location of Spectral Enhancement

4) Influence of Loss Functions

Conclusion

References