Introduction
Hyperspectral imaging systems have the capability to concurrently capture surface information across hundreds of continuous bands, yielding a set of spectral images depicting the same scene [1], [2]. The primary advantage of hyperspectral images (HSIs) over traditional natural or multispectral images lies in their richer spectral information, which enables accurate distinction and identification of objects within the image scene, making them widely applicable in various fields such as image classification [3], [4], target detection [5], [6], and change detection [7]. However, due to inherent limitations in incident energy and imaging hardware, HSIs often suffer from low-spatial resolution (named as LR-HSI), which significantly restricts their practical applications [8]. Unlike hardware upgrades, hyperspectral super-resolution reconstruction (HSI-SR), as an image postprocessing technique, obtains high spatial resolution hyperspectral images (HR-HSIs) from an algorithmic perspective. Due to its low cost and high efficiency, HSI-SR has become a necessary and promising research direction.
Generally, HSI-SR techniques can be divided into two categories based on whether auxiliary information is required: single HSI super-resolution and fusion-based HSI super-resolution. The former is highly independent and easy to implement, but since HSI-SR is an ill-posed problem, relying solely on a single LR-HSI can result in reconstructed images lacking detailed information. The latter introduces HR-MSIs of the same scene as auxiliary data, leveraging the advantages of different data sources to achieve better reconstruction results [9].
Over the past few decades, the HSI-SR domain has rapidly developed, with many fusion-based methods emerging. These methods can be broadly categorized into model-based and deep learning (DL)-based approaches. Model-based methods typically involve manually constructing various priors (e.g., self-similarity [10], sparsity [11], [12], and low rank [13], [14]) as regularizers to achieve reconstruction. While these methods show commendable performance, they suffer from issues such as being time-consuming and having limited representational capacity due to their reliance on manually crafted priors. With the rapid development of DL techniques, especially convolutional neural networks (CNNs), DL-based methods have demonstrated impressive performance [15], [16], [17], [18], [19]. These methods are inherently data-driven, allowing networks to autonomously learn priors from the characteristics of the dataset itself, thus offering greater flexibility [20]. However, the fixed receptive field of CNNs due to convolution kernel size makes them inefficient in modeling long-range dependencies, which somewhat limits their fusion performance [21].
To address these issues, the self-attention mechanism from the natural language processing (NLP) field has garnered increasing attention for its outstanding global feature extraction capabilities. Especially, the vision transformer (ViT) [22] introduced self-attention mechanisms into the computer vision (CV) field for the first time by partitioning images into patches and adding positional embeddings. Its excellent global context modeling capability effectively addresses edge effects in HSI-SR tasks. However, the global self-attention mechanism of ViT exhibits quadratic computational complexity with respect to input image size, leading to increased GPU memory demands. To mitigate these limitations, the Swin Transformer [23] uses a hierarchical structure and shifted window mechanism to reduce the length of the input sequence, making the self-attention mechanism a more versatile backbone network. Nonetheless, it still faces challenges such as fixed window sizes and high computational and memory demands when processing high-resolution images. The Performer [24] approximates the softmax function using orthogonal random feature mapping, thereby altering the order of matrix computations in self-attention to achieve linear complexity. The sparse transformer [25] restricts each element to interact only with a subset of elements in the sequence, thus sparsifying the attention score matrix.
Motivated by these developments, especially the sparse transformer and channel attention mechanism [26], this article constructs the spectral-enhanced sparse transformer (SEST), a novel network architecture for HSI super-resolution reconstruction. Specifically, in the spatial domain, sparse self-attention is used to model long-range dependencies, complemented by a local enhanced feed-forward network (LeFF) to retain complex local details, thereby achieving more efficient local-global feature learning. In the spectral domain, a spectral enhancement module is designed to explore the correlations between adjacent bands of HSIs and integrate them into the transformer structure, effectively preserving the original spectral information while promoting the interaction between spatial and spectral information, thus enhancing the network's ability to learn critical features. Incorporating these two designed elements, a multiwindow residual block is employed to learn multiscale features from the input image, while long and short skip connections are added to the network, contributing to the flexibility of information flow and the robustness of the network. The primary contributions of this article are as follows.
A novel SEST-based super-resolution reconstruction network is proposed, effectively leveraging nonlocal spatial similarity and the spectral low-rankness inherent in HSIs.
A multiwindow residual block is specifically designed to extract features at different levels of granularity. The incorporation of a weighted linear combination facilitates the fusion of these features, contributing to an enhancement in the quality of the reconstructed image.
The implementation of a spectral enhancement module in the self-attention calculation stage to boost the network's capability of spectral information extraction, facilitating the recalibration of the self-attention map to activate more pixels. The incorporation of LeFF instead of the standard FFN further enhances the network's capacity to exploit local contextual details.
The rest of this article is organized as follows. Section II reviews related work in the domain of HSI-SR, Section III clarifies the proposed SEST architecture along with the loss function, Section IV presents the experimental validation of the proposed method, and Section V concludes this article.
Related Works
This section reviews some of the most notable recent advancements in HSI-SR techniques, categorizing them into three types: model-based methods, CNN-based methods, and ViT-based methods. Additionally, the limitations of existing ViT-based methods are analyzed in detail.
A. Model-Based Methods
Model-based methods, a classical approach to HSI-SR, can be divided into three primary categories. The first category includes techniques based on panchromatic sharpening, such as component substitution (CS) and multiresolution analysis. A widely employed method within this category is the adaptive Gram–Schmidt algorithm proposed by Aiazzi et al. [27], which incorporates the spectral response function into CS. Another notable method, proposed by Selva et al. [28], involves a super-resolution framework utilizing linear regression to represent each hyperspectral band as a linear combination of multispectral bands. While computationally efficient, these methods often yield unreliable quality of the reconstructed image. The second category consists of methods relying on prior information analysis of HSI, including sparse representation-based and Bayesian-based methods. Specifically, Akhtar et al. [29] proposed a Bayesian framework based on sparse representation, deriving the probability distribution of the spectral bases and computing sparse coding of the high-resolution image. Wei et al. [30] integrated the explicit solution of the Sylvester equation into the Bayesian-based method, named “fast fusion based on Sylvester equation” (FUSE), significantly reducing the algorithm complexity while ensuring the quality of the reconstructed image. Simões et al. [31] introduced total variation regularization for effective edge preservation in a convex optimization of subspace coefficients. The third category involves decomposition-based methods, with considerable attention of being explainable and understandable, where the most representative methods are matrix factorization-based methods. For instance, Yokoya et al. [32] proposed a coupled nonnegative matrix factorization (CNMF) method, employing CNMF to alternately estimate endmembers and abundances. However, these methods cannot effectively preserve the spectral structure of HSIs. Addressing that, tensor decomposition-based methods have been explored, such as the nonlocal sparse tensor decomposition-based method proposed by Dian et al. [33], estimating the sparse kernel tensor and dictionary for HSI-SR, showcasing potential in preserving spectral information for high-quality reconstruction.
B. CNN-Based Methods
The burgeoning interest in DL has led to the rapid development of CNN-based methods for HSI-SR. For example, Li et al. [34] proposed an X-shaped interactive autoencoder network, integrating the concept of matrix factorization into DL to facilitate cross-modal learning between hyperspectral and multispectral data. To inject more texture details into HSI, the IFMSR [35] integrates an RGB-induced detail enhancement and a deep cross-modal feature modulation module. While demonstrating efficacy in various scenarios, it is noteworthy that these models are primarily data-driven, with a lack of interpretability. In response, Xie et al. [16] proposed a model-based DL method using a deep unfolding network inspired by the traditional alternating iterative algorithm model, employing CNN to learn the proximal operator and model parameters. The aforementioned methods are all based on 2D convolution, failing to effectively model the spectral structure of HSIs with 3D characteristics. Mei et al. [36] introduced 3D fully convolutional networks into HSI-SR tasks, allowing the network to better learn both spatial and spectral information in HSI. However, due to the high spectral resolution of HSI, methods based on 3D CNNs suffer from challenges of large parameter size and high computational complexity. In response to these issues, Li et al. [37] considered the spectral similarity of HSI and utilized the similarity between bands to achieve grouping, thereby reducing the computational cost of the network. Li et al. [38], on the other hand, addressed computational efficiency from a network structure perspective by designing separable 3D convolutions, aiming to mitigate the computational burden while preserving the spatial and spectral separability. Furthermore, the ill-posed nature of the HSI-SR task poses significant challenges for single-stage learning. To address this, Li et al. [39] designed a coarse-to-fine dual-stage learning framework. In the coarse stage, a symmetric feature propagation model is utilized for broader feature extraction. In the fine stage, a back-projection refinement network is introduced to learn specific features of the image.
C. Visual Transformer-Based Methods
The transformer network, first proposed by Vaswani et al. [40], quickly gained widespread attention due to its core component, the self-attention mechanism, which has powerful global context modeling capabilities. Dosovitskiy et al. [22] are the first to introduce transformer into the CV field, dividing input image into nonoverlapping patches to generate sequence elements, which are then fed into the transformer model for image recognition. Following the success of this work, transformers have been widely applied in various advanced visual tasks, such as object detection [41], [42], image classification [43], and image segmentation [44]. Among these, the Swin Transformer proposed by Liu et al. [23] is particularly noteworthy. It restricts attention computation to local windows and enhances the network's ability to capture contextual information effectively through shift operations and a hierarchical architecture, significantly reducing the computational cost of self-attention. Inspired by this, numerous transformer-based image reconstruction methods have emerged. For instance, SwinIR [45] employs Swin Transformer-based residual blocks to extract deep features from images, showcasing the immense potential of transformers in image reconstruction. In the HSI-SR domain, Fusformer [46] made the first attempt to use ViT encoders as the main body of the network to explore the spatial and spectral information of HSIs, achieving promising results. However, utilizing a single module to simultaneously model spatial and spectral features increases the complexity of network learning. To address this issue, Long et al. [47] employed Swin Transformer to design spatial and spectral self-attention blocks, cascading them to obtain global spatial features and spectral sequence information, contributing to a more efficient and effective HSI super-resolution reconstruction process. Although the design of the shifted window can reduce computational costs, it also weakens the interaction of image boundary information. To address this, Deng et al. [48] proposed a Pyramid Shuffle-and-Reshuffle Transformer (PSRT) method, which employs shuffle techniques to achieve long-range interaction between patches. Despite the remarkable results achieved by these methods, their approaches to acquiring global information remain inefficient, particularly when dealing with hyperspectral data with high redundancy. Additionally, existing methods often overlook the interaction between spatial and spectral information. Moreover, relying solely on single-type feature modeling is not conducive to the fine reconstruction of images.
Proposed Method
Drawing inspiration from sparse transformer and channel attention mechanisms, this section proposed a novel SEST network, specifically tailored for HSI-SR tasks. The overall network architecture of the proposed SEST, along with its hierarchical structure, is presented. Additionally, a comprehensive explanation of the key component of SEST, the spectral-enhanced sparse transformer residual layer (SSRL) is provided.
A. Network Architecture
In HSI-SR tasks, the attainment of a larger receptive field is often crucial for achieving superior reconstruction results. However, conventional CNN architectures, constrained by the inherent limitations of convolutional operations, tend to exhibit deficiencies in modeling long-range dependencies effectively. Recognizing the potential of transformers in addressing this limitation, a SEST network is proposed, specifically designed to simultaneously explore nonlocal spatial similarities and spectral low-rank characteristics inherent in HSIs. The network structure of the proposed SEST method is depicted in Fig. 1.
The process begins with two images, the LR-HSI and the HR-MSI of the same observed scene, represented by
B. SE Sparse Transformer Residual Layer
Applying transformers to HSI-SR tasks faces three primary challenges. First, unlike RGB images where spatial processing often suffices, HSI-SR tasks require careful handling of rich spectral information crucial for applications such as classification and object detection. Therefore, ensuring that the reconstructed spectrum remains undistorted while promoting effective interaction between spatial and spectral information is quite challenging. This aspect is often overlooked by existing transformer networks, which predominantly focus on spatial attributes. Second, the foundational mechanism of Vanilla Transformer involves computing global self-attention across all tokens, facilitating the modeling of long-distance dependencies, but it also leads to a quadratic growth in complexity relative to the number of tokens, making it impractical for super-resolution reconstruction tasks involving high image resolution. Finally, local contextual information is valuable for capturing image details and textures, providing more semantic information for a better understanding of objects and structures in the image. While essential for image super-resolution tasks, previous work has demonstrated limitations in transformer's ability to capture local dependencies.
To address these challenges, an SSRL is specifically designed, as depicted in Fig. 2, which leverages the advantages of sparse transformer to model long-distance dependencies with a lower computational cost, while at the same time, depth-wise convolutional operators and SE modules are integrated to capture useful local contextual information and spectral features, respectively.
The process can be described as follows:
\begin{align*}
{{F}_l} =& {{H}_{SS{{B}_l}}}\left( {{{F}_{l - 1}}} \right),l = 1,2,\ldots,L \tag{1}\\
{{F}_{out}} =& {{H}_{SS{{B}_L}}}\left( {{{F}_L}} \right) + {{F}_s} \tag{2}
\end{align*}
For a single spectral-enhanced sparse transformer block (SSB), as illustrated in Fig. 3, three core designs are adopted as follows: SE module, sparse multihead self-attention (SMSA), and LeFF.
Modules of the proposed SSB, with (a) SE module (SE), (b) SMSA module, and (c) LeFF.
The computation of a single SSB can be described as follows:
\begin{align*}
{{F}_N} =& LN\left( F \right) \tag{3}\\
{{F}_M} =& SMSA\left( {{{F}_N}} \right) + \alpha SE\left( {{{F}_N}} \right) + F \tag{4}\\
F^{\prime} =& LeFF\left( {LN\left( {{{F}_M}} \right)} \right) + {{F}_M} \tag{5}
\end{align*}
1) Spectral Enhancement
Given that the spectra in HSI usually exhibit low-rank properties due to high correlation among different spectral bands, which have been proven to be useful for guidance in HSI tasks such as denoising, compressive sensing, and unmixing [50], this module aims to leverage these properties for efficient spectral handling. Knowing that the biggest bottleneck in the HSI super-resolution task lies in designing an appropriate regularization method to map the low-resolution HSI into the proper subspace, inspired by the channel attention mechanism, a SE module is introduced into the self-attention calculation module of the Vanilla Transformer to facilitate the automatic learning of appropriate representations in the subspace.
Specifically, as depicted in Fig. 3(a), the input feature
\begin{align*}
{{F}_c} =& Avgpool\left( {{{F}_p}} \right) \tag{6}\\
{{F}_s} =& {{W}_c}{{F}_c} \tag{7}
\end{align*}
It is noteworthy that these operations all act on the internal information of each data cube, thereby primarily focusing on learning spectral statistical information between adjacent pixels. Given that Fs contains rich spectral statistical information, a linear layer is utilized to scale the obtained low-rank vector to match the dimensions of the input spectral vector. Subsequently, these vectors are fed into the sigmoid function to convert into weight coefficients Fz, which serve as guidance to recalibrate the input data cube Fp, thus enhancing spatial-spectral correlation and promoting the super-resolution process. This process can be described as follows:
\begin{equation*}
{{\hat{F}}_p} = {{F}_p} \cdot {{F}_z} = {{F}_p} \cdot Sigmoid\left( {{{W}_z}{{F}_s}} \right) \tag{8}
\end{equation*}
2) Sparse Multihead Self-Attention
The inherent quadratic computational complexity of the Vanilla Transformer poses substantial challenges for practical applications. To address this issue, innovations in the field of NLP have led to the development of two improved self-attention structures: linear self-attention and sparse self-attention. Leveraging the established principle that the attention matrix naturally exhibits mathematical sparsity, sparse self-attention strategies can effectively reduce computational cost and enhance model efficiency by pruning or distilling the attention matrix. Inspired by this, as shown in Fig. 3(b), the input HSI data is initially divided into individual pixel vectors by drawing upon the characteristics of high spectral resolution of HSI. Subsequently, due to the local spatial similarity of HSI, several nonoverlapping windows of the same size are employed to partition these pixel vectors, obtaining
\begin{align*}
F =& \left\{ {F_p^1,F_p^2,\ldots,F_p^N} \right\},N = HW/{{P}^2} \tag{9}\\
SA_k^i =& Attention\left( {F_p^iW_k^Q,F_p^iW_k^K,F_p^iW_k^V} \right),i = 1,2,\ldots,N \tag{10}\\
{{\hat{F}}_k} =& \left\{ {SA_k^1,SA_k^2,\ldots,SA_k^N} \right\} \tag{11}
\end{align*}
Although this approach shares similarities with the nonoverlapping window-based multihead self-attention mechanism employed in ViT, the purposes are significantly different. In this work, the approach is primarily employed to encourage the model to learn a sparser attention matrix. In contrast to global self-attention, this strategy reduces the computational complexity for a given input feature map
3) Local Enhanced Feed-Forward Network
The Vanilla Transformer architecture comprises a multihead self-attention (MSA) module and a feed-forward network (FFN). While the MSA module calculates correlations between tokens and performs linear fusion to achieve global modeling, the FFN, consisting of a simple multilayer perceptron, performs nonlinear transformations on features to enhance their representation capability. However, the conventional FFN designed in most transformer models often neglects crucial neighboring spatial information for images [51]. To overcome this challenge, the designed LeFF modifies the traditional FFN by incorporating a depth-wise convolution block. Specifically, as shown in Fig. 3(c), a linear projection layer is first applied to each token to increase its feature dimension. The projected tokens are then spatially reshaped according to their original positions and a 3 × 3 depth-wise convolution is performed on each channel of the reshaped features to better capture local spatial contextual information. Finally, the features are restored to tokens, and the channel dimension is matched with the input through another linear projection, which serves as the final output.
C. Loss Function
In the reconstruction process, the most crucial aspect is restoring the high-frequency details, which encapsulate critical spatial information that is normally lost during lower-resolution image acquisition processes. To effectively restore these details, the mean absolute error (MAE) is employed in this article as a primary loss function. The choice of MAE is driven by its sensitivity to minor discrepancies with a better convergence of the network. By minimizing the MAE between reconstructed images and ground truth, the network learns to accurately reconstruct the spatial information, thereby enhancing the overall fidelity of the reconstructed HR-HSI. The MAE is depicted as follows:
\begin{equation*}
{{L}_{MAE}}\left( \theta \right) = \frac{1}{M}\sum\limits_{m = 1}^M {{{{\left\| {{{O}^m} - {{X}^m}} \right\|}}_1}} \tag{12}
\end{equation*}
To address the critical challenge of spectral distortion in HSI super-resolution, a spatial-spectral total variation (SSTV) loss, as initially proposed in [52], is introduced as another loss. The SSTV loss is particularly designed to minimize artifacts and ensure fidelity in both spatial and spectral domains, which is crucial for maintaining the essential characteristics of the original scene. The mathematical representation of SSTV loss is expressed as follows:
\begin{equation*}
{{L}_{SSTV}}\left( \theta \right) = \frac{1}{M}\sum\limits_{m = 1}^M {\left( {{{{\left\| {{{\nabla }_h}{{O}^m}} \right\|}}_1} + {{{\left\| {{{\nabla }_w}{{O}^m}} \right\|}}_1} + {{{\left\| {{{\nabla }_l}{{O}^m}} \right\|}}_1}} \right)} \tag{13}
\end{equation*}
The final loss function is a composite loss function, taking into account both MAE and SSTV, expressed as follows:
\begin{equation*}
L\left( \theta \right) = {{L}_{MAE}}\left( \theta \right) + \beta {{L}_{SSTV}}\left( \theta \right) \tag{14}
\end{equation*}
Experimental Results
This section presents the experimental results to demonstrate the effectiveness of the proposed method. Initially, the experimental configurations are introduced, encompassing descriptions of the utilized datasets, data simulation procedures, and implementation details. Following this, the reconstruction performance on three public datasets is illustrated and compared against the state-of-the-art algorithms, supplemented by a concise analysis. Finally, an ablation study is provided to validate the effectiveness of the proposed method.
A. Experimental Configurations
1) Datasets
Experiments are conducted on three publicly available hyperspectral datasets, the CAVE dataset [53], the Harvard dataset [54], and the Washington DC Mall (WDC) dataset [32]. The CAVE dataset was captured by a cooled charge-coupled device camera and consists of 32 different indoor scenes, with each HSI presenting a spatial resolution of 512 × 512 pixels and encompassing 31 spectral bands at 10 nm intervals in the range of 400–700 nm. The Harvard dataset, captured by Nuance FX and CRI INC cameras, contains 77 real indoor and outdoor scenes, with each HSI presenting a spatial resolution of 1024 × 1392 pixels and encompassing 31 spectral bands at 10 nm intervals in the range of 420–720 nm. The WDC dataset was captured by the HYDICE sensor and contains one HSI of a large urban area, which presents a spatial resolution of 1280 × 307 pixels and encompasses 191 spectral bands with a range of 400–2400 nm.
2) Data Simulation
In this article, considering that both the CAVE and Harvard datasets have 31 bands and similar coverage ranges, 20 images from the CAVE dataset are selected for the training set. The remaining 11 images, along with 9 randomly selected images from the Harvard dataset, are set aside as the test set to evaluate the network's generalization ability. For the WDC dataset, four 128 × 128 images are cropped from the original image for testing network performance, while the rest are used for training. Due to the limited number of training samples, the training images from the CAVE dataset are segmented into 4275 overlapping image patches of size 64 × 64 × 31. These overlapping patches are then downsampled to a spatial resolution of 16 × 16 × 31 using a Gaussian filter with a kernel size of 3 × 3 and a standard deviation of 0.5 to generate LR-HSI. Additionally, HR-MSI patches are generated using the spectral response function of the Nikon D700 camera. For the training images in the WDC dataset, 921 overlapping image patches of size 64 × 64 × 191 are obtained and downsampled to generate LR-HSI using the same method. The spectral response function from blue to SWIR2 of the Landsat 8 is selected to generate HR-MSI.
3) Implementation Details
The proposed SEST model is implemented using Pytorch 1.13.1 and Python 3.7.16 on the Windows operating system with an NVIDIA GPU GeForce RTX4080. Regarding the hyperparameters in the network, the channel feature mapping C is set to 48 in the shallow feature extraction process. The number of transformer blocks in the SEST residual block is set to 6, and the window sizes in the multiwindow residual block are set to 8, 16, and 32, respectively. To train the proposed network, the Adam optimizer is employed with
B. Performance Evaluation
To verify the performance of the proposed method, a comparative analysis is conducted against six state-of-the-art HSI-SR methods, including three traditional methods, namely CNMF [32], FUSE [30], and HySure [31], along with three DL-based methods, namely SSRNet [55], Fusformer [46], and PSRT [48]. To ensure a fair and consistent comparison, all DL models are trained with the same input, and the hyperparameter settings are aligned with the specifications outlined in their respective original papers. Moreover, to provide a more intuitive and quantitative comparison of the performance of the aforementioned methods, four widely used HSI-SR quality indices (QIs) are employed, including PSNR, spectral angle mapping (SAM), structural similarity (SSIM), and erreur relative global adimensionnelle de synthèse (ERGAS). Among these QIs, superior reconstruction performance is indicated by higher PSNR and SSIM values, alongside lower SAM and ERGAS values.
1) Experimental Results on the CAVE Dataset
Table I presents a comprehensive quantitative evaluation of the reconstruction performance achieved by the proposed method along with a comparison to state-of-the-art methods on the CAVE dataset, with the best results highlighted in bold and the second-best results underlined. Notably, the proposed SEST method consistently outperforms comparable methods across all four QIs on the CAVE dataset, establishing its consistently superior efficacy.
To substantiate the quantitative findings through visual aspect, two representative images, specifically balloon and feather, are selected from the CAVE dataset for in-depth visualization. Fig. 4 shows the reconstructed images and their corresponding residual images produced by different methods, with highlighted regions (within the red boxes) for detailed comparison. The visual inspection reveals the superior performance of the SEST method, particularly evident in the residual images (generated by a randomly selected band), underscoring the ability of SEST to achieve reconstructions that closely align with the ground truth (GT), thereby demonstrating better recovery of spatial details.
First column: the GTs and the corresponding LR-HSI images (in pseudo-colors) for the balloon (first and second rows) and the feather (third and fourth rows) test cases from the CAVE dataset. The second to eighth columns: the visual results and the residuals (generated by a randomly selected band) between the GT and the fused products for all the compared approaches. A zoomed area has been added to aid the visual inspection.
To better compare the spectral fidelity of different methods, two pixels from balloon and feather are randomly selected to compare spectral differences against the GT, as shown in Fig. 5. The spectral vectors generated by the proposed SEST closely resemble the GT, indicating a significant reduction in spectral distortion and an enhanced preservation of spectral fidelity of the proposed SEST method.
Spectral vectors analysis of the GT and results of the compared approaches for the (a) balloon located at (60, 270) and (b) feather located at (50, 420).
2) Experimental Results on Harvard Dataset
To evaluate the generalization capability of the proposed SEST method, the CAVE dataset is exclusively used to train the proposed network, which is then tested on the Harvard dataset. The quantitative results, as presented in Table I, showcase that the proposed SEST outperforms all compared methods in terms of PSNR, SAM, and SSIM, while securing not that good in terms of ERGAS due to the more complex noise types present in real-world scenarios. For visual assessment, similar experiments are conducted with the results shown in Figs. 6 and 7.
First column: the GTs and the corresponding LR-HSI images (in pseudo-colors) for the imga3 (first and second rows) and the imgc9 (third and fourth rows) test cases from the Harvard dataset. The second to eighth columns: the visual results and the residuals (generated by a randomly selected band) between the GT and the fused products for all the compared approaches. A zoomed area has been added to aid the visual inspection.
Spectral vectors analysis of the GT and results of the compared approaches for the (a) imga3 located at (200, 400) and (b) imgc9 located at (200, 250).
As shown in Fig. 6, two test images from the Harvard dataset are selected for analysis, displaying pseudo-color images and residual images of the reconstructed results. Similarly, Fig. 7 illustrates the spectral vector diagrams of the reconstructed results by different methods, facilitating a comparison of spectral fidelity. The SEST consistently delivers superior results in both spatial and spectral domains, aligning with the quantitative analysis.
3) Experimental Results on the WDC Dataset
To evaluate the robustness of the proposed SEST method on real-world data, the majority of the WDC dataset is used for training the network, while the remaining portion is used for testing its performance. Quantitative results in Table I demonstrate that the proposed SEST method outperforms all comparison methods in terms of PSNR and SAM, and ranks second in SSIM. Similar performance as the Harvard dataset of the ERGAS metric could be drawn by the results. For visual assessment, similar experiments are conducted with the results shown in Figs. 8 and 9.
First column: the GTs and the corresponding LR-HSI images (in pseudo-colors) for the img1 (first and second rows) and the img2 (third and fourth rows) test cases from the WDC dataset. The second to eighth columns: the visual results and the residuals (generated by a randomly selected band) between the GT and the fused products for all the compared approaches. A zoomed area has been added to aid the visual inspection.
Spectral vectors analysis of the GT and results of the compared approaches for the (a) img1 located at (40, 70) and (b) img2 located at (45, 80).
As shown in Fig. 8, two test images from the WDC dataset are selected for analysis, displaying pseudo-color images and residual images of the reconstructed results. Similarly, Fig. 9 illustrates the spectral vector diagrams of the reconstructed results by different methods, facilitating a comparison of spectral fidelity. The SEST consistently delivers superior results in both spatial and spectral domains, aligning with the quantitative analysis.
4) Discussion and Analysis
The experimental results on the CAVE, Harvard, and WDC datasets demonstrate that the proposed method exhibits superior reconstruction performance across the quantitative metrics of PSNR, SAM, and SSIM. However, despite these positive outcomes indicating the method's overall effectiveness in reconstructing spectral information, it performs suboptimally in the ERGAS metric. Further analysis reveals that although the reconstructed spectral curves closely match the ground truth in general, the presence of extreme outliers in certain bands results in significant deviations from the ground truth, which substantially increases the ERGAS value.
A deeper investigation suggests that these outliers may be attributed to two inherent limitations of the proposed method: 1) To reduce the computational cost of the self-attention mechanism and enhance the network's ability to handle high-dimensional data, the proposed method restricts sparse attention to fixed windows. While this design effectively reduces computational overhead, it also compromises the transformer model's ability to capture long-range dependencies, thereby affecting the precise reconstruction of spectral information and leading to large errors in specific bands or pixels. 2) Although the multiwindow residual block design reduces reliance on manual parameter tuning, its inherent structural limitations may lead to inconsistent performance when dealing with hyperspectral data of varying characteristics. In certain cases, these limitations can result in increased errors in specific bands, thus causing larger deviations.
C. Ablation Study
This section evaluates the contributions of individual components within the proposed SEST method through a series of ablation experiments, mainly focusing on four critical aspects: the integration strategy of key modules, the sizes of the residual windows, the location of spectral enhancement, and the influence of loss functions. To be concise and not affect the generality, the ablation experiments are exclusively conducted on the CAVE dataset with a detailed analysis of different factors of the proposed SEST model on the reconstruction performance, allowing for a focused analysis of how each factor influences reconstruction performance.
1) Integration Strategy of Key Modules
As has been demonstrated before, the SEST framework incorporates three pivotal modules: the multiwindow residual block, spectral enhancement, and LeFF. While the multiwindow residual block is comprehensively analyzed in terms of window sizes separately (discussed in the subsequent subsection), this subsection assesses the significance of the remaining SE and LeFF modules. Results of the quantitative reconstruction metrics in Table II indicate that integrating either the LeFF or the SE module individually yields improvements in reconstruction quality compared to a baseline model devoid of these two modules, confirming the necessity of each module. Notably, when both modules are combined, as shown in the final column of Table II, there is a markable improvement in performance, surpassing the individual contributions of each module. This synergistic effect underscores the complementary nature of these two modules, enhancing the overall efficacy of the HSI-SR process.
2) Size of Residual Windows
Experimental results concerning various window sizes are detailed in Table III, illustrating the effect of different window sizes on reconstruction quality, where rows 2–9 delineate the impact of the single-window method with window sizes varying from 4 to 32. Interestingly, the reconstruction quality does not exhibit a straightforward improvement with an increase of the window size. This observation underscores the significance of selecting an appropriate window size, emphasizing the critical importance of the multiscale window design.
Moreover, Table III also provides a comparative analysis of the training time for each method, with the average time for training one epoch. Notably, window-based methods demonstrate significantly reduced training time compared to Fusformer. While the multiscale window method requires more training time compared to the single window, it should be noted that each window operates independently within the network, allowing for parallel training on multiple GPUs, thereby facilitating a reduction in overall training time.
3) Location of Spectral Enhancement
In an effort to ascertain the optimal configuration of the spectral enhancement module within the SEST architecture, two variants of the SSB are tested, denoted as SSBv1 and SSBv2. SSBv1 places the SE module externally to the transformer block, whereas SSBv2 integrates the SE module directly within the self-attention calculation to fine-tune the self-attention map, as depicted in Fig. 10.
The comparative results, presented in Table IV, demonstrate that the SSBv2 configuration proves to be more conducive to image reconstruction in terms of reconstruction quality. This insightful analysis sheds light on the impact of the location of the spectral enhancement module within the sparse transformer block, emphasizing its significance in improving HSI super-resolution reconstruction efficacy.
4) Influence of Loss Functions
To balance the recovery of high-frequency spatial details and spectral fidelity, this article introduces the use of MAE and SSTV as loss functions, namely LMAE and LSSTV. LMAE aims to reduce pixel-level errors between the reconstructed and true images, thereby enhancing overall image quality and detail. LSSTV, on the other hand, constrains spatial and spectral variations to reduce noise and maintain spectral consistency and fidelity.
To determine the impact of these loss functions on super-resolution reconstruction results, this article tests three different combinations: “w LMAE and w/o LSSTV,” “w/o LMAE and w LSSTV,” and “w LMAE and w LSSTV.” Table V presents the experimental results for these combinations. The comparison shows that using either LMAE or LSSTV alone results in comparable reconstruction quality. However, using both together significantly enhances the quality of the reconstructed images. This finding indicates that a reasonable combination of different loss functions can fully exploit their respective advantages, leading to higher-quality image reconstruction.
Conclusion
This article presents a novel SEST network tailored for HSI-SR tasks. This innovative model, engineered to operate within the spatial domain, incorporates sparse self-attention and a local enhancement feedforward network to capture global features while preserving local details. Addressing the spectral distortions in HSI-SR tasks, an SE module is specially designed within the self-attention calculation process, forming a powerful hybrid attention mechanism. Furthermore, to exploit multiscale information inherent in the image, a well-crafted multiwindow residual block is devised, contributing significantly to the overall improvement in reconstruction quality. The comprehensive experiments conducted on the CAVE, Harvard, and WDC datasets convincingly validate the superior performance of the proposed approach, underscoring its significance in the field of HSI-SR. In light of the limitations identified in the experimental analysis, future work could explore the integration of dynamic convolutions to enable more flexible window size adjustments and diversified attention patterns based on the characteristics of the input data, thereby mitigating the occurrence of outliers.