Journals & Magazines >IEEE Journal of Selected Topi... >Volume: 17

MIMFormer: Multiscale Inception Mixer Transformer for Hyperspectral and Multispectral Image Fusion

Abstract:

The fusion of low-spatial-resolution hyperspectral image and high-spatial-resolution multispectral image provides an effective method to obtain high-spatial-resolution hy...Show More

Metadata

Abstract:

The fusion of low-spatial-resolution hyperspectral image and high-spatial-resolution multispectral image provides an effective method to obtain high-spatial-resolution hyperspectral image. However, existing hybrid fusion architectures combining convolutional neural networks (CNNs) and transformers face significant challenges. Sequential approaches struggle with simultaneous local and global modeling, while parallel approaches often result in information redundancy. In this article, to meet diverse information demands at different layers, we propose a novel multiscale inception mixer transformer network (MIMFormer), a multiscale hybrid network based on the Inception structure integrating CNN and transformer. The core of this network is the multiscale spatial transformer (MST) structure, which enhances the detail richness of fused images by integrating local and global information at various scales. The inception spatial–spectral mixer (ISSM) module within the MST leverages an Inception architecture and employs a spectral splitting mechanism to regulate spectral channel counts across different branches. This design allows the ISSM module to efficiently extract local spatial–spectral features through convolution and max pooling, while global features are captured using a self-attention mechanism, ensuring comprehensive feature fusion across spectral groups. Experimental results on three benchmark datasets and one real remote sensing dataset demonstrate that MIMFormer outperforms ten advanced fusion methods.

Published in: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing ( Volume: 17)

Page(s): 15122 - 15135

Date of Publication: 22 August 2024

ISSN Information:

DOI: 10.1109/JSTARS.2024.3447648

Funding Agency:

Contents

SECTION I.

Introduction

Compared to multispectral images, hyperspectral images possess exceptional capabilities for distinguishing key features in illuminated scenes, a fact that has been confirmed by numerous studies [1]. However, due to the physical limitations of imaging sensors, there is an inherent tradeoff between spatial resolution and spectral resolution in imaging, resulting in existing hyperspectral satellites often having lower spatial resolutions, such as EO1-30 m [2] and ZY1E-30 m [3]. Consequently, some hyperspectral applications experience significant performance degradation due to insufficient spatial resolution, including soil composition estimation [4], vegetation classification [5], and urban change detection [6]. Unlike breakthroughs in imaging hardware, fusing low-spatial-resolution hyperspectral image (LR-HSI) with high-spatial-resolution multispectral image (HR-MSI) offers an economically viable method to acquire high-spatial-resolution hyperspectral image (HR-HSI).

Pansharpening methods [7] are used for fusing LR-HSI with the panchromatic band extracted from HR-MSI to generate HR-HSI. Model-based approaches [8] utilize matrix or tensor decomposition of LR-HSI, followed by the recovery of HR-HSI using predefined priors. However, these methods often overly rely on handcrafted prior assumptions about the unknown HR-HSI, leading to spatial distortions and spectral inaccuracies in the fused results [9], [10].

In recent years, deep learning methods, such as convolutional neural networks (CNNs) [11] and transformers [12], [13], have been widely applied to address this challenge. While deep learning methods automatically extract image features, eliminating the limitations of handcrafted features, they each have their own limitations. Transformers emphasize global feature extraction, whereas CNNs excel at local feature extraction. During the fusion process, both global and local features play crucial roles in accurately reconstructing spatial details and preserving spectral fidelity. Global features primarily capture overall semantic and structural context, aiding in maintaining visual consistency and identifying major patterns and trends within the image. Simultaneously, local features focus on small-scale structures, such as edges, lines, and fine textures, which are essential elements in the fusion process.

In order to better combine global and local information, many studies combine CNN and transformer to form a CNN-transformer hybrid architecture [14] to make better use of global and local information, thereby significantly improving the fusion performance. At present, there are two ways to mix this architecture, sequential and parallel, and the combination of sequential results in each layer modeling only one aspect, such as local modeling in the convolutional layer and global modeling in the transformer layer, which is difficult to achieve both in this way. In parallel combination, one branch handles local information and the other branch handles global information. However, this approach can lead to information redundancy if all channels are processed. Study [15] has shown that the lower layers of the transformer require more local information and the higher layers require more global information. Therefore, the simple parallel processing method cannot fully meet the information requirements at all levels, and a more flexible structure must be introduced to optimize the processing and transmission of information. The inception structure [16] is a good solution to the problem of parallel joining, as long as the appropriate channel is divided before entering the branch. At the same time, due to the different requirements for high and low-frequency information in different depths, the number of channels can be divided by control to meet the needs of different depths.

Therefore, we propose a novel network, the multiscale inception mixer transformer (MIMFormer), which integrates CNN and transformer architectures through an inception-based multiscale hybrid approach. Central to MIMFormer is the multiscale spatial transformer (MST) structure, which incorporates an inception spatial–spectral mixer (ISSM). The ISSM regulates the number of spectral channels in various Inception branches via a spectral splitting mechanism, effectively combining CNN and transformer advantages to capture both spectral and spatial information across bands. The main contributions of this article are as follows.

We introduce MIMFormer, a multiscale hybrid network based on the inception structure that combines CNN and transformer for fusing LR-HSI and HR-MSI. This architecture capture spectral and spatial information across various bands and scales, enhancing the fused images' quality and accuracy.
We develop the ISSM module, constructed upon the inception framework, which ensures image precision through meticulous processing of localized regions and maintains consistency with global sharpening across the entire image. By simultaneously integrating global and local information, it enhances fusion quality while preserving the integrity and authenticity of the image content.
We design a spectral splitting mechanism, which regulates the number of spectral channels across different Inception branches. This mechanism reduces feature redundancy and promotes a comprehensive integration of global and local features, thereby further improving the performance of the fusion algorithm.

The rest of this article is organized as follows. Section II reviews related work, including pansharpening, model-based approaches, and deep learning methods. Section III describes the proposed network architectures, MIMFormer, and ISSM. Section IV presents experimental results on benchmark and real datasets. Section V presents the discussion. Finally, Section VI concludes this article.

SECTION II.

Related Work

In recent years, researchers have developed numerous innovative approaches to the fusion of LR-HSI and HR-MSI from diverse perspectives [17], [18]. These methods can generally be categorized into pansharpening [7], [19], model-based approaches [20], [21], and deep learning methods [22].

A. Pansharpening

Pansharpening is one of the earliest developed methods for fusing LR-HSI and HR-MSI. It involves merging the LR-HSI with the panchromatic band extracted from the HR-MSI, transforming it into HR-HSI [23]. Component substitution (CS) and multiresolution analysis (MRA) are common pansharpening fusion techniques. CS enhances the spatial resolution of hyperspectral images by separating and replacing the spatial components of a multispectral image. Representative methods include principal component analysis [24], [25], intensity hue saturation [26], and Gram–Schmidt spectral sharpening methods [27], [28]. CS-based methods are computationally inexpensive and can recover the main spatial features similar to the original image. However, such approaches often lead to a degradation of spectral information [29]. MRA uses multiresolution decomposition techniques to extract high-frequency spatial details from multispectral images and fuse them into LR-HSI to enhance its spatial resolution. Typical methods based on MRA include high-pass filters [30] and wavelet transforms [31]. While these techniques are computationally straightforward and efficient, significant discrepancies in spatial resolution between multispectral and hyperspectral images may lead to noticeable distortions in the fused results.

B. Model-Based Approaches

Model-based approaches primarily include matrix decomposition, tensor decomposition, and Bayesian-based methods [21], [32]. Yokoya et al. [33] introduced coupled nonnegative matrix factorization (CNMF) for the fusion of LR-HSI and HR-MSI, exploring its impact on HSI classification. Their method demixed the sources of the two images to identify the characteristics and abundances of endmembers. Although this method showed performance improvements, it is computationally intensive and sensitive to parameter selection. Tensor decomposition approaches [34] utilize multidimensional tensors to represent multispectral and hyperspectral images, achieving fusion through tensor decomposition or operations. Tucker decomposition [35], a commonly used method, decomposes high-dimensional tensors into a core tensor and dictionaries for each dimension, extracting and representing information across dimensions. The first known Bayesian fusion approach was designed by Zhang et al. [36]. This method assumes an additive noise imaging model for LR-HSI and uses interpolation as a prior, circumventing the need to estimate the spatial degradation operator and performing super-resolution in a blind manner. Overall, these methods frame the fusion of LR-HSI with HR-MSI as an optimization problem constrained by various handcrafted priors, which may not adequately represent the required HR-HSI, thus limiting their fusion accuracy.

C. Deep Learning Methods

Recent advancements in deep learning have significantly impacted image processing [37], [38], [39], inspiring research in LR-HSI and HR-MSI fusion. Dian et al. [40] introduced a deep hyperspectral image enhancement model that integrates image priors into a CNN fusion framework, outperforming traditional methods. Han et al. [41] developed the MS-SSFNet network, which uses a multilevel loss function to mitigate gradient vanishing in fusing LR-HSI and HR-MSI. Zhang et al. [22] employed CNNs to regularize spatial and spectral degradation and used generative networks to model HR-HSI. Li et al. [42] proposed the cross spectral-scale and shift-window-based cross spatial-scale nonlocal attention network (CSSNet) to explicitly learn spectral and spatial correlations between two input images. To further enhance CNN-based fusion algorithms, researchers introduced a multitask, multiobjective evolutionary network [43], [44] to address spectral distortion caused by LR-HSI upsampling. These CNN-based methods have significantly advanced fusion algorithms, providing robust solutions to the limitations of prior methods and achieving satisfactory results.

Although CNN-based methods significantly improve over traditional approaches, their limited receptive fields and lack of remote modeling ability prevent the full extraction of global image features, reducing fusion quality [45]. To address this, vision transformer (ViT), which excels in modeling global dependencies, has recently been applied to LR-HSI and HR-MSI fusion with notable success [46]. For instance, Hu et al. [47] introduced Fusformer, the first ViT-based solution for fusion, while Jia et al. [48] developed the multiscale spatial–spectral transformer network (MSST-Net) to enhance network performance and generalization. Fang et al. [49] integrated spatiotemporal frequency information of LR-HSI and HR-MSI. However, focusing too much on global information can neglect local feature extraction.

Combining CNN's local feature extraction with transformer's global feature modeling [50], [51] aims to address these issues and has shown promising results. This hybrid approach leverages the strengths of both architectures, allowing for a more comprehensive extraction of image features. However, challenges remain. Ma et al. [52] used Swin transformer's window attention with 3D-CNN to learn LR-HSI's implicit priors. While this outperforms pure transformer networks, the sequential mode struggles to balance global and local information extraction. Interactformer [53] combines Swin transformer and 3D-CNN in parallel to improve spatial resolution and preserve spectral information, but this can lead to feature redundancy and weakened performance. Lower transformer layers need more local information, while higher layers require global information [15]. Therefore, a simple parallel method cannot fully satisfy the information needs at all levels.

Inspired by the Inception structure [16], we propose a novel multiscale hybrid network, MIMFormer, which optimizes information processing through a spectral splitting mechanism. By adjusting the number of channels, MIMFormer meets different depth requirements, providing a flexible structure to enhance fusion. This innovative architecture effectively balances the extraction of global and local features, addressing the limitations of previous methods and achieving superior fusion quality.

SECTION III.

Methodology

A. MIMFormer Architecture

The proposed MIMFormer fusion network architecture, depicted in Fig. 1, comprises three primary modules: a shallow feature extraction module, a deep feature extraction module, and an image reconstruction module. The shallow feature extraction module utilizes two residual network modules to extract preliminary features. The deep feature extraction module consists of three MST branches, each functioning at a distinct scale. Lastly, the image reconstruction module includes two convolutional layers paired with a LeakyReLU activation function to reconstruct the final image.

Fig. 1.

Overall architecture diagram of the proposed MIMFormer network.

Show All

Let $Z \in \mathbb {R}^{h \times w \times S}$ represent the observed LR-HSI, where $h$ , $w$ , and $S$ denote the number of rows, columns, and spectral bands of the LR-HSI, respectively. Let $Y \in \mathbb {R}^{H \times W \times s}$ represent the observed HR-MSI, where $H$ , $W$ , and $s$ denote the number of rows, columns, and spectral bands of the HR-MSI, respectively. Our objective is to fuse $Y$ and $Z$ to obtain an image $\hat{X} \in \mathbb {R}^{H \times W \times S}$ that possesses both high spatial and spectral resolutions. Initially, this article employs a bilinear interpolation method to upsample LR-HSI to obtain $Z_{\text{up}} \in \mathbb {R}^{H \times W \times S}$ , facilitating channel-wise concatenation with HR-MSI. This process can be represented by the following equation:

$\begin{equation*} Z_{\text{up}} = \text{Up}(Z) \tag{1} \end{equation*}$ View Source

where

$\text{Up}(\cdot)$

denotes the bilinear interpolation upsampling function. Subsequently,

$Y$

and

$Z_{\text{up}}$

are aggregated along the channel dimension to form

$D_{\text{cat}} \in \mathbb {R}^{H \times W \times (S+s)}$

, which can be expressed as

$\begin{equation*} D_{\text{cat}} = \text{Concat}\left(Y, Z_{\text{up}}\right) \tag{2} \end{equation*}$

View Source

where

$\text{Concat}(\cdot)$

represents concatenation along the channel dimension. Given that convolution is a simple and effective method for mapping images to a higher dimensional feature space, this article uses 2-D convolution with a kernel size of 3, 180 channels, and a stride of 1, constructing a residual network with two blocks to extract shallow features

$F_{s} \in \mathbb {R}^{H \times W \times 180}$

, represented as

$\begin{equation*} F_{s} = \text{SF}\left(D_{\text{cat}}\right) \tag{3} \end{equation*}$

View Source

where

$\text{SF}(\cdot)$

denotes the shallow feature extraction module.

In the deep feature extraction module, we designed three MST to extract features at different scales. For the simulated dataset, the patch size values in MST were set to 8, 12, and 16, whereas for the real dataset, the patch sizes were set to 3, 15, and 6. Each MST module includes a patch embed layer, three ISSMs, and a patch unembed layer. Initially, the extracted shallow features $F_{s}$ are fed into the patch embed layer, whose goal is to divide the shallow feature map into a series of equally sized image blocks, each mapped to a higher dimensional feature representation. This process is described by the following equation:

$\begin{equation*} I^{i} = {\text{GELU}}\left({\text{Norm}}\left({\text{Conv}}\left(F_{s}\right)\right)\right). \tag{4} \end{equation*}$ View Source

The feature maps, represented as $I^{i} \in \mathbb {R}^{H/P \times W/P \times C}$ and $i = 1, 2, 3$ , are processed through a convolutional layer after passing through the patch embed layer. The convolutional layer uses a kernel size and stride of $p$ (with $p$ values being 8, 12, 16, or 3, 15, 6), a channel count of $C = 120$ , and a padding of 1.

B. ISSM and Spectral Splitting Mechanism

The architecture of the ISSM is shown in Fig. 2. After partitioning the input features across the spectral dimensions, local and global mixers are employed to learn features across different frequency ranges. The local mixer includes a MaxPool path, consisting of a maximum pooling operation and a linear layer, and a parallel convolutional path, consisting of a linear layer and a DwConv layer. The global mixer includes an attention path, which consists of average pooling, a multihead self-attention mechanism (MSA), and an upsample layer. The rationale is as follows.

Fig. 2.

Architecture diagram of the proposed ISSM.

Show All

Local mixer: Given the sharpness sensitivity of maximum pooling and the detail perception capability of convolution operations, we propose two local paths to leverage the sharpness sensitivity of maximum pooling and the detail perception capability of convolution layers to learn local features. Initially, the input $I_{l}^{i}$ (where $i = 1, 2, 3$ ) is divided into $I_{l1}^{i} \in \mathbb {R}^{H/P \times W/P \times C_{l}/2}$ and $I_{l2}^{i} \in \mathbb {R}^{H/P \times W/P \times C_{l}/2}$ . In our experiments, to reduce feature redundancy, enhance feature extraction efficiency, and ensure that both pathways receive spectral information representing global characteristics while retaining sufficient detail, we designed a spectral splitting mechanism. This mechanism sets $C_{l}$ as $C - (i \times {\dim })$ , ${\dim }$ as 40. $I_{l1}^{i}$ is routed to the MaxPool path, and $I_{l2}^{i}$ is routed to the parallel convolution path. The outputs of the local mixer, $F_{l1}^{i}$ and $F_{l2}^{i}$ , can be represented by the following equation:

$\begin{align*} F_{l1}^{i} &= \text{Linear}\left(\text{MaxPool} \left(I_{l1}^{i}\right)\right) \tag{5}\\ F_{l2}^{i} &= \text{DwConv}\left(\text{Linear} \left(I_{l2}^{i}\right)\right). \tag{6} \end{align*}$ View Source

Global mixer: Considering the powerful capability of attention mechanisms in learning global representations, we use an MSA to establish long-distance dependencies to learn global information. Before applying the attention operation, we use an average pooling operation to reduce the scale of $I_{g}^{i} \in \mathbb {R}^{H/P \times W/P \times C_{g}}$ to decrease computational complexity. The kernel size and stride for the average pooling in this experiment are both 2. Then, a multihead self-attention mechanism is applied to calculate the attention for redundant bands

$\begin{equation*} {F}_{\text{ap}}^{i} = \text{AvgPool}\left(I_{g}^{i}\right)) \tag{7} \end{equation*}$ View Source

where

${F}_{\text{ap}}^{i} \in \mathbb {R}^{H/P/2 \times W/P/2 \times C_{g}}$

is the output of average pooling,

$C_{g}$

is set as

$i \times {\dim }$

Multihead self-attention, as shown in Fig. 3, captures the correlation among spectral bands by computing self-attention. Initially, we project the input $F_{\text{ap}}^{i}$ through trainable linear projections to obtain the query matrix $Q \in \mathbb {R}^{H/P/2 \times W/P/2 \times C_{g}}$ , key matrix $K \in \mathbb {R}^{H/P/2 \times W/P/2 \times C_{g}}$ , and value matrix $V \in \mathbb {R}^{H/P/2 \times W/P/2 \times C_{g}}$ , represented as

$\begin{equation*} Q^{i} = F_{\text{ap}}^{i}W_{Q}^{i}, \quad K^{i} = F_{\text{ap}}^{i}W_{K}^{i}, \quad V^{i} = F_{\text{ap}}^{i}W_{V}^{i} \tag{8} \end{equation*}$ View Source

where

$W_{Q}^{i}, W_{K}^{i}, W_{V}^{i} \in \mathbb {R}^{C_{g} \times C_{g}}$

are learnable projection matrices. The scaled dot-product attention function, using the queries, keys, and values, is defined as

$\begin{equation*} {\text{Atten}}(Q^{i}, K^{i}, V^{i}) = V^{i} \left({\text{softmax}}\left(\frac{{K^{i}}^{T}Q^{i}}{\sqrt{C_{g}}}\right)\right) . \tag{9} \end{equation*}$

View Source

Fig. 3.

Structure MSA.

Show All

$\text{Atten}(\cdot)$ is the scaled dot-product attention function. The MSA, similar to that used in ViT, enhances the network's feature extraction capability. The function form of the multihead self-attention is as follows:

$\begin{align*} {\text{head}}_{h}^{i} &= {\text{Atten}}\left(Q_{h}^{i}, K_{h}^{i}, V_{h}^{i}\right), \quad h = 1, 2, 3 \tag{10}\\ F_{\text{msa}}^{i} &= {\text{view}}\left({\text{Concat}}\left({\text{head}}_{h}^{i}\right)\right) \tag{11} \end{align*}$ View Source

where

${\text{head}}_{h}$

is the output of the

$h$

th head, computed through the Atten function. Concat denotes concatenation of all head outputs. Finally, the spectral multihead self-attention output is reshaped.

C. Future Fusion

Ultimately, we employ an upsampling layer to restore the original scale, consistent with a ratio of 2 used in max pooling

$\begin{equation*} F_{g}^{i} = {\text{UpSample}}\left(F_{\text{msa}}^{i}\right). \tag{12} \end{equation*}$ View Source

The outputs

$F_{l1}^{i}, F_{l2}^{i},$

and

$F_{g}^{i}$

are concatenated along the channel dimension and fused to form

$F_{\text{fu}}^{i}$

$\begin{equation*} F_{fu}^{i} = {\text{Fusion}}\left({\text{Concat}}\left(F_{l1}^{i}, F_{l2}^{i}, F_{g}^{i}\right)\right). \tag{13} \end{equation*}$

View Source

To ensure the feature dimensions are consistent before reconstructing the image, the fused feature map is passed through a patch unembed layer and a transposed convolution layer to restore the features to the original number of hyperspectral bands $S$ , resulting in a single-scale feature $F^{i} \in \mathbb {R}^{H \times W \times S}$ . Subsequently, the extracted multiscale features are aggregated using learnable weights to form the deep features $F_{d} \in \mathbb {R}^{H \times W \times S}$ . Finally, $F_{d}$ and the shallow features $F_{s}$ are input into an image reconstruction module composed of two convolutional layers and a LeakyReLU activation function, yielding the estimated high-resolution hyperspectral image $\hat{X} \in \mathbb {R}^{H \times W \times S}$

$\begin{equation*} \hat{X} = \text{Conv}\left(\text{LReLU}\left(\text{Conv}(F)\right)\right). \tag{14} \end{equation*}$ View Source

Ultimately, the network's parameters are optimized by minimizing the

$L1$

pixel loss

$\begin{equation*} l_{1} = \Vert \hat{X} - X \Vert \tag{15} \end{equation*}$

View Source

where

$X \in \mathbb {R}^{H \times W \times S}$

is the true HR-HSI.

SECTION IV.

Experiments and Analyzes

To effectively evaluate the performance of the proposed methods, this study selected ten state-of-the-art fusion technologies for comparative analysis. These techniques include two traditional methods: HySure [54] and CNMF [33]; four CNN-based approaches: MSDCNN [55], TFNet [56], MHF-Net [57], and CSSNet [42]; along with four novel transformer-based technologies: Fusformer [47] PSRT [58], MSST-Net [59], and 3DT-Net [52]. The parameters for each method were set according to the original authors' code or literature recommendations. Traditional methods were tested on a Windows 10 system equipped with an Intel Core i9 processor and 32GB RAM, using MATLAB R2014a. Deep learning methods were primarily implemented using Python 3.8 and PyTorch 1.7, with GPU acceleration provided by NVIDIA RTX 4060TI. Data preprocessing and analysis were conducted using MATLAB R2014a and Python's NumPy and Pandas libraries.

To comprehensively evaluate the performance of image fusion algorithms, a variety of quantitative metrics are commonly employed for comparative analysis [10], [60]. These metrics include the spectral angle mapper (SAM), peak signal-to-noise ratio (PSNR), erreur relative globale adimensionnelle de synthèse (ERGAS), structural similarity index metric (SSIM), root mean squared error (RMSE), and the quality with no reference (QNR) index. SAM assesses spectral quality, with lower values indicating minimal loss of spectral information. PSNR evaluates spatial effects, where higher values denote lesser loss of spatial details. SSIM is used to appraise structural correlation, with higher values suggesting superior fusion outcomes. RMSE measures the similarity between images, where lower values denote a more effective fusion algorithm. ERGAS serves as a comprehensive metric, with lower values indicating higher fusion quality. Particularly, QNR is apt for evaluating the fusion quality of no-reference imagery, such as the ZY1E real remote sensing dataset, encompassing all aspects of distortion including both spectral and spatial distortions. Higher QNR values signify optimal fusion quality with more complete information preservation. Collectively, these metrics reflect the efficacy of fusion algorithms in retaining both spatial and spectral information.

A. Datasets

The experiments in this article utilized three mainstream hyperspectral image benchmark datasets: CAVE [61], Washington DC Mall (WDCM) [62], Pavia University (PU) [63], and a real remote sensing dataset ZY1E.

The CAVE dataset contains 32 indoor hyperspectral images, each with dimensions of 512 × 512 pixels, covering the wavelength range of 400–700 nm with 31 bands. Experiments followed the Wald protocol [64], using the spectral response function of a Nikon D700 camera to generate HR-MSI. The original hyperspectral images of CAVE served as reference HR-HSI. They were filtered with a Gaussian kernel of size 8 × 8 and standard deviation of 2, then subsampled by a factor of eight in both horizontal and vertical directions to generate LR-HSI. A total of 20 image pairs were randomly selected from the dataset for training, and the remaining 12 pairs for testing. During training, patches of 64 × 64 were randomly extracted from each 512 × 512 image, making the dimensions of HR-HSI, HR-MSI, and LR-HSI during training 64 × 64 × 31, 64 × 64 × 3, and 8 × 8 × 31, respectively, and during testing, 512 $\times \text {512}\times$ 31, 512 $\times \text {512}\times$ 3, and 64 × 64 × 31.

The WDCM dataset, captured by the Hydice sensor in 1995, consists of 191 bands covering the wavelength range of 400–2400 nm. Each band has a resolution of 1280 × 307 pixels, with a spatial resolution of 2.5 m. Two subimages of 128 × 128 pixels were cropped for testing, with the remainder used for training. The setup was the same as that used for MSST-Net [59], with HR-MSI generated using the Sentinel-2A spectral response matrix and LR-HSI produced in the same manner as the CAVE dataset. The dimensions of HR-HSI, HR-MSI, and LR-HSI during training were 64 × 64 × 191, 64 × 64 × 10, and 8 × 8 × 191, respectively, and for testing, 128 × 128 × 191, 128 × 128 × 10, and 16 × 16 × 191.

The PU dataset was collected by the ROSIS sensor in 2003, originally measuring 610 × 340 pixels in dimensions, covering a wavelength range of 430–860 nm. After removing 22 water vapor absorption bands, 93 bands remained. Consistent with the WDCM dataset, after Gaussian filtering, the images were subsampled by a factor of eight to generate LR-HSI. Two subimages of 128 × 128 pixels were cropped for testing, with the remainder used for training. HR-MSI was generated using a spectral response function similar to IKONOS. The dimensions of HR-HSI, HR-MSI, and LR-HSI during training were 64 × 64 × 93, 64 × 64 × 4, and 8 × 8 × 93, respectively, and for testing, 128 × 128 × 93, 128 × 128 × 4, and 16 × 16 × 93.

The ZY1E dataset consists of hyperspectral and multispectral data acquired from the ZY1E satellite, specifically from the ZY1E satellite, equipped with both visible-near infrared and hyperspectral cameras. This study utilized an image captured on 19 April 2023, over the Pinggu district of Beijing, consisting of AHSI hyperspectral and VNIC multispectral imagery. In the ENVI software, a series of preprocessing steps were applied to the acquired hyperspectral and multispectral data, including radiometric calibration, atmospheric correction, orthorectification, and cropping, followed by registration of the hyperspectral and multispectral images. The processed data comprised 166 bands of LR-HSI and 8 bands of HR-MSI, with spatial resolutions of 30 m and 10 m, and spatial dimensions of 4986 × 4581 and 1662 × 1527 pixels, respectively. Notably, lacking HR-HSI as a reference image for training, we followed the Wald protocol [64], using a 9 × 9 Gaussian kernel to perform a threefold spatial downsampling of the original LR-HSI and HR-MSI to generate the training dataset. Subsequently, the original LR-HSI was considered as HR-HSI. Given the low signal-to-noise ratio in some bands of the ZY-1E data, we selected the first 76 bands of the LR-HSI for the fusion experiments. During the training phase, image pairs were randomly cropped from the training dataset (HR-HSI: 60 $\times \text{60}\times$ 76, HR-MSI: 60 $\times \text{60}\times$ 8, LR-HSI: 20 $\times \text{20}\times$ 76) for training purposes. In the testing phase, image pairs (HR-MSI: 540 $\times \text{540}\times$ 8, LR-HSI: 180 $\times \text{180}\times$ 76) were cropped for testing.

B. Ablation Experiments

To better understand MIMFormer, a series of ablation experiments were conducted. All models were trained on the WDCM dataset for 100 epochs, with training configurations consistent with those previously described in the document.

In terms of multiscale feature extraction: To adequately extract features from LR-HSI and HR-MSI, this study employed a multiscale approach for feature extraction. To evaluate the effectiveness of multiscale features, we conducted single-scale feature extraction experiments by modifying the original three-branch feature structure into a single-branch structure. “M-scale” denotes the removal of the multiscale architecture, utilizing only a single MST structure. The quantitative evaluation metrics are presented in the first column of Table I. In addition, we plotted the decline in the loss function and the increase in PSNR values on the test set with the single MST structure as training epochs progressed, as shown by the orange lines in Fig. 4. The results demonstrate that multiscale feature extraction effectively captures features at various scales, enabling the model to identify patterns and details that are challenging to detect at a single scale, thereby achieving superior fusion performance.

TABLE I Ablation Study of the ISSM and Multiscale Structure on the WDCM Dataset

Fig. 4.

Ablation results on the WDCM dataset, showing evaluation loss on the test set, PSNR on the test set, and training loss on the train set, for different models (BASE, M-scale, NAttention, NDwConv, NMaxPool).

Show All

Regarding the ISSM: To integrate the local feature extraction capabilities of CNN with the strengths of transformer, we introduces the ISSM, aimed at enhancing the transformer's perceptual ability in the spectral dimension. To evaluate the effectiveness of each component within the inception mixer, we progressively removed each branch from the complete model and recorded the results. As shown in Table I, combining the attention mechanism with convolution and max-pooling yields higher accuracy compared to using only the mixer. This validates the efficacy of the ISSM. As illustrated in Fig. 4, the blue line labeled “BASE” represents our proposed MIMFormer architecture; the green, red, and purple lines correspond to the removal of the attention branch, the convolution branch, and the max-pooling branch from the ISM's three-branch structure, respectively. The results indicate that MIMFormer exhibits the fastest and most stable loss function decline, both in the test and validation sets, and achieves the most significant improvement in PSNR values on the test set.

Regarding the spectral splitting mechanism: This mechanism can reduce feature redundancy and improve feature extraction efficiency, but improper allocation may lead to insufficient extraction of both global and local information. Therefore, the key is to balance the splitting mechanism to ensure comprehensive information extraction through the full integration of global and local features. In designing the spectral splitting mechanism, we ensure that both paths receive spectral information that represents global features while containing sufficient detail. By setting $C_{l} = C - (i\cdot \dim)$ and gradually increasing the value of $i$ , we can dynamically adjust the information received by each path. As shown in Table II, ac represents an average allocation of channels (both global and local paths receiving $C/2$ channels), BASE represents our MIMFormer method using the spectral splitting mechanism, and NSSM represents the absence of the spectral splitting mechanism (both global and local paths receiving $C$ channels). BASE outperforms in all metrics. This design ensures that the spectral splitting mechanism effectively enhances network performance by fully leveraging and extracting both global and local information.

TABLE II Ablation Study of Spectral Splitting Mechanism on the WDCM Dataset

C. Experiments With Benchmark Data

Results on CAVE dataset: We evaluated 12 test images on the CAVE dataset and presented the average evaluation metrics for different methods in Table III, with the best results highlighted in bold. It is observable that the proposed MIMFormer method significantly outperforms the comparison methods in terms of performance. Notably, the PSNR values for MIMFormer are substantially higher than those of other methods. This aligns with previous analyses of the network architecture, suggesting that the proposed ISSM effectively preserves the spectral characteristics of the scene.

TABLE III Comparison of Different Fusion Methods on the CAVE Dataset

To ease the burden on readers, we only display the fusion results for the “watercolors” test image. Fig. 5 showcases the enlarged local images using bicubic interpolation, and the synthesized false-color images of bands 29, 19, and 9. The second and fourth rows display the error images, where the error values are the average of all band errors. Compared to the methods assessed, the proposed MIMFormer method reconstructs high-resolution details more effectively, significantly reducing errors in the error images, especially in regions with prominent edge information.

Fig. 5.

Fusion results of the “watercolors” image from the CAVE dataset. The first row presents a false-color image synthesized from the 29th, 19th, and 9th spectral bands. The second row depicts the error images between the fused and the ground truth.

Show All

Results on the WDCM dataset: Table IV presents the objective results of various comparative algorithms on the WDCM dataset, with the best metrics highlighted in bold. Deep learning methods based on CNNs, such as MSDCNN and TFNet, exhibit superior performance compared to traditional approaches, with PSNR values reaching 45.90 and 47.67, respectively. MHF-Net and CSSNet excel further in detail preservation, achieving PSNR values of 48.22 and 48.82. Transformer-based methods also demonstrate commendable efficacy; for instance, MSST-Net and 3DT-Net showcase exceptional integration capabilities. Notably, 3DT-Net attains an SSIM of 0.9993 on the WDCM dataset, with its metrics being on par with our proposed MIMFormer, attributable to its utilization of 3-D CNNs to accommodate the characteristics of hyperspectral data cubes. However, it is worth noting that our proposed MIMFormer shows slightly lower SSIM performance on the WDCM dataset compared to other deep learning networks. This may be due to the inherent challenges MIMFormer faces when handling extremely fine details. Nevertheless, this 0.001 discrepancy is negligible in terms of overall performance impact. Overall, the proposed MIMFormer outperforms other methods across most critical performance metrics.

TABLE IV Comparison of Different Fusion Methods on the WDCM and PU Datasets

This confirms that employing multiscale feature extraction combined with the ISSM structure can significantly improve the performance of hyperspectral and multispectral image fusion networks. Compared to methods that use CNNs alone for fusion, increasing network depth limits the receptive field, leading to the loss of many detail features. In our method, by incorporating the Inception structure and using a three-branch structure to extract high-frequency and low-frequency information, along with the long-distance dependencies of the self-attention mechanism and local feature extraction of deep convolution and max pooling layers, our network can learn more valuable information, effectively addressing this issue. Thus, MIMFormer surpasses other comparative methods in performance. Fig. 6 provides a visual appreciation of the superiority of our network's results over other comparative methods in terms of color and edge granularity. Compared to CNN-based methods, our network achieves more ideal results in color and brightness performance, and shows more notable improvements in edge detail and clarity compared to 3DT-Net and MSST-Net.

Fig. 6.

Fusion results of WDCM test set. The first row presents a false-color image synthesized from the 56th, 21th, and 5th spectral bands. The second row depicts the error comparison between the fused and the ground truth.

Show All

Results on the PU dataset: The ground sampling distance of the PU dataset is 1.3 m, with each pixel containing only one or a few types of land cover, making the spectral characteristics relatively simple. The quantitative results of MIMFormer and other comparative methods on the PU dataset are shown in Table IV. It is evident from the table that traditional methods, such as CNMF, still perform poorly on this dataset. In contrast, MIMFormer and MSST-Net outperform Fusformer in key performance metrics such as PSNR, SAM, ERGAS, SSIM, and RMSE, thanks to their multiscale architectures that enable the extraction of deep spectral features at different scales.

Regarding the fusion results of the PU dataset test set, the false-color images and error maps are shown in Fig. 7. MIMFormer, combining CNN and transformer technologies, achieved satisfactory visual results on this dataset. As shown in Fig. 7, the fused image mainly includes land cover types, such as asphalt roads, grasslands, trees, self-adhesive bricks, and buildings. In the error map, the areas marked by red rectangles are magnified to show artificial buildings and grasslands. The density of grasslands varies in different locations, and the materials used in different buildings also vary, leading to spectral signal differences even among the same type of land cover, greatly increasing the difficulty of fusing hyperspectral and multispectral images. Despite these challenges, MIMFormer still demonstrates superior image reconstruction quality in the magnified areas compared to MSST-Net and 3DT-Net, showcasing its exceptional fusion capabilities. To analyze the computational burden, the last two columns of Table IV list the floating-point operations (FLOPs) and the parameters of different fusion methods on the PU dataset. As shown, MSST-Net and Fusformer have higher FLOPs, at 188.72G and 456.327G, respectively. Fusformer causes GPU memory overload due to its use of original transformer layers. MSST-Net likely uses large-scale transposed convolutions, resulting in a high parameter count. Other methods like CSSNet, PSRT, and 3DT-Net have more moderate FLOPs and parameter counts. For instance, 3DT-Net has FLOPs and parameter counts of 66.122G and 3.455M, respectively. Compared to MIMFormer, 3DT-Net has higher FLOPs but fewer parameters, indicating different optimization strategies in computational burden and parameter usage. Overall, MIMFormer strikes a balance between high performance and reasonable computational burden and parameter count, demonstrating efficiency and resource utilization in practical applications.

Fig. 7.

Fusion results of PU test set. The first row presents a false-color image synthesized from the 29th, 19th, and 9th spectral bands. The second row depicts the error comparison between the fused and the ground truth.

Show All

In addition, to further evaluate the performance of each method across individual spectral bands of images, we plotted the PSNR and SAM values for different methods across each spectral band on the benchmark datasets CAVE and PU. As shown in Fig. 8, from a spatial perspective, our proposed MIMFormer displays the highest PSNR values in certain bands, indicating optimal performance in terms of spatial information loss; in terms of spectral quality, MIMFormer achieves the lowest SAM values across all bands, indicating minimal spectral information loss. These results demonstrate that our proposed fusion method can generate higher quality fused images.

Fig. 8.

PSNR and SAM on CAVE and PU datasets of all bands.

Show All

D. Experiments With Real Data

To validate the robustness of our method on real data, we created a real paired LR-HSI and HR-MSI dataset, named ZY1E. Fig. 9 displays the fused images from the ZY1E dataset, which include diverse scenes, such as water bodies, buildings, roads, and farmland, generated by different methods. We selected a 100 × 100 region within the red box and enlarged it in the bottom left corner of the fused image. A clear comparison reveals several key observations: the traditional HySure method exhibits severe distortion and spatial aliasing; CSSNet produces images with varying degrees of color distortion; and while MSST-Net and 3DT-Net show some improvement in spatial details, they still fall short. In contrast, MIMFormer not only achieves a color accuracy closer to LR-HSI but also significantly enhances spatial details, demonstrating superior performance overall.

Fig. 9.

Fusion results of various methods applied to the ZY1E dataset.

Show All

Fig. 10 illustrates the spectral curves of different objects after fusion by various methods from the ZY1E dataset, compared to the original LR-HSI:

bright roof building;
highway;
playground;
cultivated land;
lake.

Fig. 10.

Spectral contrast of diffferent objects in the ZY1E dataset. (a) Bright roof building. (b) Highway. (c) Playground. (d) Cultivated land. (e) Lake.

Show All

In these plots, our proposed MIMFormer is represented in yellow, while the reference LR-HSI curves are in green. It is evident that the spectral curves of MIMFormer, despite minor discrepancies, closely match those of all objects in LR-HSI, achieving the smallest error. The traditional method HySure overall performs the worst, particularly showing significant deviations from the LR-HSI spectral curves in the highway and cultivated land scenarios. MSST-Net and 3DT-Net exhibit smaller spectral curve errors in the bright roof building and lake scenario. Severe deviations in the highway and lake scenarios. In addition, we employed the QNR image quality assessment metric to evaluate the fusion results of all methods. As shown in Table V, the proposed method achieved the highest score, indicating that MIMFormer outperformed its competitors in terms of image quality. Overall, our method effectively addresses feature redundancy through the spectral splitting mechanism, ensuring superior spectral quality in the fused images. Furthermore, by integrating the strengths of CNNs and transformers, we were able to extract comprehensive spatial and spectral information, resulting in significantly enhanced image quality.

TABLE V No-Reference Indexes for the Fusion Results of Each Method on the ZY1E Dataset

SECTION V.

Discussion

The key advantage of MIMFormer lies in the introduction of the spectral splitting mechanism, which enhances the control and utilization of different spectral channel characteristics, effectively modeling and extracting spatial–spectral information dispersed across various bands. However, the dim setting within the spectral splitting mechanism primarily depends on the network's C value (number of feature maps) and is based on empirical settings, presenting a limitation of our network. In the future, we plan to adopt methods, such as genetic algorithms and particle swarm optimization to automatically select or explore other approaches for optimal band selection, thereby further optimizing network performance.

In addition, our current network primarily targets well-aligned images. Moving forward, we aim to optimize the MIMFormer network and investigate advanced image alignment techniques to enhance its performance under varying alignment conditions, such as changes in posture and illumination. This will improve its applicability and generalization in practical applications. Finally, we plan to explore cross-dataset generalization strategies to address the generalization issues arising from differences in the number of bands and image features. Through these improvements, our goal is to promote the widespread application of fusion networks, enabling them to demonstrate outstanding performance across various hyperspectral datasets.

SECTION VI.

Conclusion

This article presents a multiscale hybrid network (MIMFormer) based on the Inception structure, combining CNN and transformer, effectively addressing the shortcomings of existing hybrid architectures in modeling local and global features, as well as the lack of flexibility in information processing. By designing the MST and ISSM modules, MIMFormer can capture and fuse local and global information of LR-HSIs and HR-MSIs at different scales, significantly enhancing the spatial and spectral details of the fused images. In addition, the introduction of the spectral splitting mechanism allows MIMFormer to more effectively control and utilize the characteristics of different spectral channels, modeling, and extracting spatial–spectral information dispersed across different bands. Experimental results demonstrate that MIMFormer performs excellently on both benchmark and real-world datasets, maintaining the integrity of spatial and spectral information while being efficient and accurate in processing spectral and spatial information. In the future, we will continue to optimize this network and explore more advanced image alignment and cross-dataset generalization strategies to further enhance its performance in various application scenarios.

References is not available for this document.

MIMFormer: Multiscale Inception Mixer Transformer for Hyperspectral and Multispectral Image Fusion

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction