Introduction
Remote sensing images are mainly acquired by various types of optical sensors carried by satellites or other space platforms. Owing to the different imaging strategies of the different sensors, panchromatic (PAN), multispectral (MS), and hyperspectral (HS) images can be obtained. PAN images provide excellent spatial resolution, but only one band of spectral information. HS and MS images are rich in spectral information but have poor spatial resolution. It is impractical to capture images with both high spectral and spatial resolutions owing to the limitations of imaging equipments. Therefore, image fusion techniques [1], [2], [3], [4] have been developed. For example, fusing low-resolution HS (LRHS) images with high-resolution PAN images can produce high-resolution HS (HRHS) images. This technique is called HS pansharpening, making the best use of spatial and spectral information to improve the efficiency of subsequent tasks like classification [5], [6], [7], detection [8], and tracking [9].
In the past few decades, researchers have proposed a variety of classical methods to enhance HS pansharpening techniques. They can be roughly divided into three categories: component substitution (CS) [10], [11], [12], [13], multiresolution analysis (MRA) [14], [15], and model-based [16], [17], [18], [19], [20], [21] and deep-learning-based [22], [23], [24], [25], [26], [27] methods. The goal of CS methods is to replace part or all of the spatial information of HS images with PAN images. For example, those exploiting intensity–hue–saturation [10], principal component analysis [11], Gram–Schmidt (GS) [12], and its enhanced version called GS adaptive (GSA) [13]. These methods are computationally fast, but the fusion results suffer from spectral distortion. The MRA methods are to inject the spatial detail information from PAN images into interpolated HS images, including the algorithms such as smoothing-filter-based intensity modulation [14] and wavelet transforms [15].
The model-based methods consider the fusion process as an ill-posed inverse optimization problem. It usually uses prior knowledge of remote sensing images to formulate hypotheses and construct spectral fidelity and structural fidelity terms, respectively. Examples are Laplace priors [16], variational priors [17], [18], [19], nonlocal priors [20], and low-rank priors [21]. Optimization solutions for model-based methods are usually based on iterative optimization algorithms. Examples include the gradient descent algorithm [28], [29], the split Bregman iteration algorithm [30], the half-quadratic splitting (HQS) algorithm [31], and the alternating direction method of multipliers (ADMM) algorithm [32]. The final results of the optimization algorithm are fused images. The model-based methods ensure that the spectral and spatial information is mostly retained in the fused image, which is more frequently applied to image fusion.
Deep-learning-based approaches have been applied to pansharpening, showing satisfactory performance due to their remarkable feature representing ability. It usually builds an end-to-end convolutional neural network (CNN) structure to learn the nonlinear relationship between inputs and outputs, by unsupervised [33] or supervised [22], [23], [24], [25], [26], [27] ways. Recently, the transformer-based methods have attracted the attention of researchers because the multihead self-attention mechanism in the transformer differs from the structure of CNNs. The methods have been implemented for image classification [34], detection [35], segmentation [36], and super-resolution [37], achieving outstanding results.
There are still problems in the existing deep-learning-based methods for HS pansharpening. On the one hand, CNN-based approaches have been intensively studied, but there are shortcomings. The limitations of the convolutional kernel size and stride make it impossible for the CNN-based methods to fully extract the global features of an image. On the other hand, there are few studies based on a transformer, and its corresponding methods usually lack physical interpretation. Directly applying the original transformer to pansharpening may degrade its performance as redundant information in the images is easily taken into account during the calculation of self-attention. In addition, limited by the HS images, the transformer is faced with the challenge of the computational consumption of HS pansharpening.
To address the above problems, we propose a model-inspired approach with transformers for HS pansharpening (MTNet). First, we build an optimization model for the PAN and LRHS images, including a fidelity term and a regularization term. Second, we iteratively solve the optimization model using the HQS algorithm, where the fidelity term can be solved using a gradient descent algorithm and the regularization term can be replaced by a proximal operation. Finally, we design a transformer network to learn the iterative steps of the algorithm and obtain the final result. Specifically, the solutions of the fidelity terms are designed as a CNN structure, and the proximal operation is learned by a gradual sampling transformer network that can reduce computational costs. This design makes MTNet have the physical implications of HS pansharpening. In summary, the contributions of MTNet are described as follows.
1) We propose a novel HS pansharpening method that combines the model-driven deep technique and the transformer to ensure the interpretability of network design and robustness of network performance.
2) We add a linear mapping layer and a residual connection to the encoder module of the original transformer so that the gradient explosion and vanishing problems can be addressed, resulting in stable performance.
3) The MTNet adopts a gradual sampling strategy to address the nontrivial computational burden, which can not only capture global information but also reduce the computational complexity of the self-attention mechanism in the transformer.
The rest of this article is organized as follows. Section II briefly introduces the work related to HS pansharpening. Section III details the proposed optimization model, solution algorithm, and network structure. Section IV presents experiments on different data to validate the superiority of our proposed method. Finally, Section V concludes this article.
Related Works
In this section, we provide a brief overview of the HS pansharpening methods regarding model-based and deep-learning-based approaches. And the application of the transformer module in computer vision is also introduced.
A. Model-Based Approaches
Model-based methods can be roughly divided into nonfactorization-based and factorization-based methods. The nonfactorization-based approach acquires the target image from the entire perspective and using relevant a priori knowledge to achieve the desired result. For example, Ballester et al. [17] achieved the fusion of low-resolution MS and PAN images by a linear degradation of high-resolution MS (HRMS) images along the spatial and spectral dimensions. Based on this, Fu et al. [38] proposed a local gradient constraint to improve the performance of the fusion. Wei et al. [39] proposed a fast fusion method based on the Sylvester equation (FUSE), which decreased the computational complexity. The factorization-based methods focus on splitting the target image into two parts and regenerating the fused image through recovery techniques. It is roughly classified into matrix-based and tensor-based methods. For instance, Li and Yang [40] used a sparse representation, assuming that the signals of a remote sensing image are sparse in the basis set, applied to pansharpening. In addition, some improvement versions [41], [42] were suggested to produce better fusion results and to make the model more practical. Yokoya et al. [43] completed the fusion of LRHS and HRMS images using the matrix-factorization-based technique [44], [45], [46], considering that each spectral signature of an HRHS image can be mathematically represented as a linear combination of several endmembers, i.e., an HRHS image can be expressed as an endmember matrix multiplied by an abundance matrix. Lanaras et al. [47] obtained the two matrices by employing a projected gradient method. Because an HS image is essentially a 3-D tensor, the tensor-based methods are able to express hidden spatial structure and spectral information. For example, Dian et al. [48] represented an HS image as a tensor and obtained the HRHS image by a sparse core tensor and a dictionary of three modes. Since then, some improved versions have emerged [49], [50], [51]. Xu et al. [52] presented a nonlocal patch tensor sparse representation method to achieve the fusion of LRHS images with HRMS images. Besides, a new tensor factorization called tensor ring decomposition was suggested in [53]. In general, model-based methods have the ability to retain both spectral information and spatial information, but some model-based methods have complex and computationally intensive processes for solving optimization problems.
B. Deep-Learning-Based Approaches
Recently, deep-learning-based approaches, especially CNNs, have been developed to explore the nonlinear relationships between features and applied in the field of image processing with significant success. Typical instances are super-resolution [54], [55], [56], target detection [57], image segmentation [58], and image classification [59], [60]. Dong et al. [61] first proposed a super-resolution CNN (SRCNN) model, which is a milestone for the CNN in the field of image super-resolution. Inspired by the SRCNN, Masi et al. [23] proposed an SRCNN-based pansharpening that takes preinterpolated LRMS images and PAN images as inputs and learns the mapping relationship between input images and output HRMS images. Following the success of CNNs in pansharpening, a variety of techniques have been invented to improve the performance of CNN-based methods, for instance, residual techniques [22], [26], [27], attention mechanisms [25], [62], detail injection [63], [64], etc. Although CNN-based methods have excellent feature extraction capabilities, they lack a physical interpretation of pansharpening.
Subsequently, many researchers came up with model-based deep learning methods to improve the interpretability of pure deep networks. These approaches are a combination of a traditional optimization model and a deep-learning-based approach. Specifically, an optimization model uses a task-specific prior as the regularization term, and an iterative algorithm is applied to implement the iterative process. Then, a deep network is designed for simulating the process of its solution. These methods are used in many areas, such as image deraining [65], image deblurring [66], and image super-resolution [67], [68], [69], [70], [71]. For example, Xie et al. [69] proposed a deep unfolding network to achieve the fusion of LRHS and HRMS images. Dong et al. [70] proposed a model-guided deep convolutional network that takes the observation matrix of HS images into account in the end-to-end optimization process, achieving excellent results. Zheng et al. [72] developed an edge-conditioned feature transform network for LRHS and HRMS, which uses an edge mapping prior to learn features of images adaptively. Shen et al. [68] presented a model for the fusion of LRHS and HRMS images solved by the ADMM algorithm, whose solution is unfolded using a deep network. Zheng et al. [71] used an unsupervised deep network to implement the matrix decomposition model. In the fields of pansharpening, the above methods cannot ensure the same outstanding results. To tackle this issue, Xu et al. [73] introduced a model for pansharpening with a deep approach, building two optimization problems whose solutions were unfolded by two network blocks, which were stacked alternately to form a deep network. However, these methods do not fully exploit the global information of HS, resulting in limited improvements.
C. Transformer Module
Before the emergence of transformers, most sequence models were based on CNNs or recurrent neural networks (RNNs). Two problems with sequential models are as follows: 1) sequential models can only be computed sequentially, which limits the ability to compute in parallel and there is information loss during the computation; and 2) the sequential model cannot solve the long-term dependence problem. The structure of a transformer consists of a self-attention module and a feedforward neural network. The structure breaks the limitations of structures of CNNs and RNNs, using self-attention mechanisms to learn the relationship between inputs and outputs from a global perspective and to achieve parallel computation. Based on the above, the transformer has been successfully applied to natural language processing (NLP) [74].
The transformer is growing at a remarkable rate and has become mainstream in the NLP field, gradually progressing into the field of computer vision [34], [35], [36], [75], [76], [77], [78], [79]. For example, a new model, Vision Transformer [34], was first applied to image classification with excellent results. Based on this, a number of methods were proposed to tackle the problem of image super-resolution. A learning texture transformer network for image super-resolution was proposed by Yang et al. [37]. The model efficiently retrieves and migrates high-definition texture information to maximize the use of reference image information. An appropriate migration of high-definition textures into the generated super-resolution results eliminates texture blurring and texture distortion. But this model cannot be designed too deep because of its heavy computational cost and high GPU memory occupation. To address this problem, Lu et al. [80] presented a novel efficient super-resolution transformer for fast and effective image super-resolution.
Proposed Method
A. Optimization Model and Algorithm
HS pansharpening is the process of fusing an LRHS image with a PAN image. For convenience,
\begin{align*}
\mathbf{Y} = \mathbf{XB} + \mathbf{N}_{Y} \tag{1}\\
\mathbf{P} = \mathbf{RX} + \mathbf{N}_{P} \tag{2}
\end{align*}
\begin{align*}
\min \limits _{\mathbf{X}}\frac{1}{2}\left\Vert \mathbf{Y}-\mathbf{XB} \right\Vert _{F}^{2} + \frac{\lambda }{2}\left\Vert \mathbf{P}-\mathbf{RX} \right\Vert _{F}^{2} \tag{3}
\end{align*}
\begin{align*}
\min \limits _{\mathbf{X}}\frac{1}{2}\left\Vert { \mathbf{Y}}-\mathbf{XB} \right\Vert _{F}^{2} + \frac{\lambda }{2}\varphi (\mathbf{X},\mathbf{P}). \tag{4}
\end{align*}
To facilitate the use of the HQS1 algorithm to solve problem (4), a variable H is introduced. Then, the optimization problem (4) can be rewritten as follows:
\begin{align*}
\min \limits _{\mathbf{X}}\frac{1}{2}\left\Vert \mathbf{Y}-\mathbf{XB} \right\Vert _{F}^{2} + \frac{\lambda }{2}\varphi (\mathbf{H},\mathbf{P}) \qquad \text{s.t.}\quad \mathbf{H} = \mathbf{X}. \tag{5}
\end{align*}
The constrained problem (5) is transformed into an unconstrained optimization problem as follows:
\begin{align*}
L_{\mu }(\mathbf{X},\mathbf{H}) = \frac{1}{2}\left\Vert \mathbf{Y}-\mathbf{XB} \right\Vert _{F}^{2} + \frac{\lambda }{2}\varphi (\mathbf{H},\mathbf{P}) + \frac{\mu }{2}\left\Vert \mathbf{X} - \mathbf{H} \right\Vert _{F}^{2} \tag{6}
\end{align*}
\begin{align*}
\mathbf{X}^{(k)} &=\arg \min \limits _{\mathbf{X}}L_{\mu }(\mathbf{X},\mathbf{H}^{(k-1)}) \\
&= \arg \min \limits _{\mathbf{X}}\frac{1}{2}\left\Vert \mathbf{Y}-\mathbf{XB} \right\Vert _{F}^{2}+\frac{\mu }{2}\left\Vert \mathbf{X} - \mathbf{H}^{(k-1)} \right\Vert _{F}^{2} \tag{7}\\
\mathbf{H}^{(k)} &=\arg \min \limits _{\mathbf{H}}L_{\mu }(\mathbf{X}^{(k-1)},\mathbf{H}) \\
&=\arg \min \limits _{\mathbf{H}}\frac{\mu }{2}\left\Vert \mathbf{X}^{(k-1)} - \mathbf{H} \right\Vert _{F}^{2} + \frac{\lambda }{2}\varphi (\mathbf{H},\mathbf{P}). \tag{8}
\end{align*}
Problem (7) is convex, but we solve it by the gradient descent algorithm for the purpose of network implementation. It is described as follows:
\begin{align*}
\mathbf{X}^{(k)} &= \mathbf{X}^{(k-1)} - \alpha (\mathbf{X}^{(k-1)}\mathbf{B} \\
&\qquad- \mathbf{Y})\mathbf{B}^{T}- \alpha \mu (\mathbf{X}^{(k-1)} - \mathbf{H}^{(k-1)})\\
&=\mathbf{X}^{(k-1)} - t_{1}\mathbf{G}^{(k-1)} - t_{2}(\mathbf{X}^{(k-1)} - \mathbf{H}^{(k-1)})\\
&=(1-t_{2})\mathbf{X}^{(k-1)} - t_{1}\mathbf{G}^{(k-1)} + t_{2}\mathbf{H}^{(k-1)} \tag{9}
\end{align*}
\begin{align*}
\mathbf{H}^{(k)} = \text{prox}_{\varphi }\left(\mathbf{X}^{(k-1)},\frac{\lambda }{2\mu }\right). \tag{10}
\end{align*}
The entire process of solving the above problem (4) is detailed in Algorithm 1.
B. Network Structure
We propose MTNet to implement the fusion of HS and PAN images by unfolding the iterative process of Algorithm 1 as each layer of MTNet (see Fig. 1). MTNet consists of
Interpretation of the components in the MTNet. (a) TransBlock. (b) Encoder–Decoder module in TransBlock. (c) GSNet. ©indicates concatenate.
Algorithm 1: HQS Algorithm for Problem (4).
Input: LRHS image Y, PAN image P, B, R
Initialize
for
Update X by (9)
Update H by (10)
end for
Output: X
Algorithm 2: Entire Procedure for MTNet.
Input: LRHS image Y, PAN image P
Initialize
Initialize
for
end for
Output:
1) Encoder–Decoder: In problem (9), we use a gradient descent algorithm to achieve an iterative update of X. This approach has similarities with Gaussian pyramid and Laplace structures, both of which require first simulating a Gaussian fuzzy downsampling and subsequently performing an inverse operation. The difference is that we cannot directly use spatial downsampling as B, which does not guarantee the blindness of our proposed method. We may implement the local subtraction in problem (9) by using a neural network in a similar way to residual learning. Specifically, for
2) GSNet Module: GSNet is a gradual sampling network structure, which plays an important role in the overall network and is responsible for learning the proximal operation in problem (10). This module achieves the update of H in Algorithm 1. It mainly consists of two parts, the Down-transformer [see Fig. 3(a)] and the Up-transformer [see Fig. 3(b)]. The Down-transformer and the Up-transformer are formed by stacking transformers. Since the complexity of the transformer is quadratic to the spatial size, it is unwise to use the image directly as input to the transformer in HS pansharpening. It is also infeasible to cut the image into small patches as input to the transformer like [34], which would result in the loss of spatial information and poor quality of the fused image. Motivated by progressive sampling in image super-resolution [84], GSNet uses the transformer to build a progressive upsampling and downsampling network structure. The downsampling consisting of three Down-transformers is divided into three stages. The Down-transformer is applied to retrieve the shallow global information of each stage. The convolution is added later to capture the local information and edge information missed during the Down-transformer extraction. The upsampling consisting of two Up-transformers is divided into two stages. The Up-transformer is applied to extract the deeper global information. Moreover, skip connections and concatenation are applied in GSNet. This allows the dependencies between stages to be learned and enhances the ability of GSNet to acquire features. The GSNet module alleviates to a certain extent the high computational effort of the transformer and ensures the quality of the images generated by the whole network.
Illustration of the GSNet module. (a) Down-transformer. (b) Up-transformer. (c) Transformer module used in the Down-transformer and Up-transformer.
As can be seen in Fig. 3, the Down-transformer and the Up-transformer have the same transformer module [see Fig. 3(c)]. In contrast to the original transformer [74], where only the encoder part is used, we add a linear layer and a residual connection. The linear layer is used to map the dimensions to output the dimensions we desire. The residual connection is able to improve the performance of the transformer by eliminating the problem of information loss caused by the deeper layers. In addition, “
Multihead attention plays an important role in the transformer. It is a self-attention mechanism with multiple heads that enables the network to capture and takes into account various relationships from a global perspective. The typical process of a self-attention is as follows:
\begin{align*}
\text{Q} &= W_{q}\text{X},\text{K} = W_{k}\text{X},\text{V} = W_{v}\text{X}\\
\text{Attention} &= \text{softmax}\left(\frac{\text{X}W_{q}W_{k}^{T}\text{X}^{T}}{\sqrt{d_{k}}}\right)\text{X}W_{v}\\
&= \text{softmax}\left(\frac{\text{Q}\text{K}^{T}}{\sqrt{d_{k}}}\right)\text{V} \tag{11}
\end{align*}
\begin{equation*}
\begin{split} \text{MultiHead}(\text{Q},\text{K},\text{V}) &= \text{Concat}(head_{1},head_{2},{\ldots },head_{h})W^{O}\\
head_{i} &= \text{Attention}(\text{Q}W_{i}^{Q},\text{K}W_{i}^{K},\text{V}W_{i}^{V}) \end{split} \tag{12}
\end{equation*}
Illustration of the attention mechanism in the transformer module. (a) Multihead self-attention. (b) Self-attention.
Notably, both the Down-transformer and the Up-transformer have a flatten operation before the data are fed into the transformer network (see Fig. 3). The flatten operation unfolds the data cube into a matrix so that each vector of the matrix represents the spectral information of the data cube. Then, the correlation of spatial information can be learned through the transformer. In the Down-transformer, we use average pooling as a downsampling operation to reduce the number of parameters and computations. In the Up-transformer, we use deconvolution as the upsampling operation.
In order to measure the difference between the network output
\begin{align*}
\text{Loss} = \left\Vert \mathbf{X} - \mathbf{X}^{(\text{target})}\right\Vert _{1} \tag{13}
\end{align*}
Experimental Results
In this section, we conducted experiments on three synthetic datasets and one real dataset to validate the effectiveness of MTNet.
A. Datasets
1) CAVE: The CAVE2 dataset consists of 32 indoor HS images of
2) Pavia Center (PC): The PC3 dataset is an image of the city acquired by the Reflection Optical System Imaging Spectrometer (ROSIS). The ROSIS sensors have a spectral range of 0.43–
3) Botswana: The Botswana4 dataset is a scene of Botswana Okavango Delta, South Africa, collected by the Hyperion sensor on NASA EO-1 satellite, with a spatial resolution of 30 m. This image initially had
4) University of Houston (UH): The UH5 is a real dataset released by the 2018 Data Fusion Contest of the IEEE Geoscience and Remote Sensing Society, which contains an HRMS image (RGB image) of size
According to Wald's protocol, we generated the LRHS and PAN images by using the given HS images as reference. The LRHS images were acquired by Gaussian blurring and downsampling the HS reference images, where the Gaussian filter with a mean of 0 and a standard deviation of 2 was used, and the sampling factor was 8
B. Quality Measures
To assess the quality of the fusion results, eight popular metrics are used in this article. The first five are used for the synthetic dataset and the last three for the real dataset. For convenience, X represents the reference image, T represents the target image, P represents the PAN image, and Y represents the LRHS image, where
1) Spectral Angle Mapper (SAM) [85]
The SAM indicates the spectral quality of the fused image. The smaller the number of degrees, the better the spectral quality. The optimal value is 0.
\begin{align*}
\text{SAM}(\mathbf{X},\mathbf{T}) \!=\! \frac{1}{MN}\sum _{\text{i}=1}^{M}\sum _{\text{j}=1}^{N}\arccos \left(\frac{\mathbf{X}(\text{i},\text{j})\cdot \mathbf{T}(\text{i},\text{j})}{\left\Vert \mathbf{X}(\text{i},\text{j})\right\Vert _{2}\left\Vert \mathbf{T}(\text{i},\text{j})\right\Vert _{2}}\right) \tag{14}
\end{align*}
2) Peak Signal-to-Noise Ratio (PSNR) [43]
The PSNR was used to measure the average spatial similarity between the target image and the reference image for all the bands. The higher the value, the lower the spatial distortion. The optimal value is
\begin{align*}
\text{PSNR}(\mathbf{X},\mathbf{T}) = \frac{1}{B}\sum _{\text{t}=1}^{B}\text{PSNR}(\mathbf{X}(\text{t}),\mathbf{T}(\text{t})) \tag{15}
\end{align*}
3) Structural Similarity Index Measure (SSIM) [3]
SSIM calculates the average structural similarity in the spatial domain between the generated image and the reference image. The higher the value of SSIM, the more similar the spatial structure. The preferred value is 1.
\begin{align*}
\text{SSIM}(\mathbf{X},\mathbf{T}) = \frac{1}{B}\sum _{\text{t}=1}^{B}\text{SSIM}(\mathbf{X}(\text{t}),\mathbf{T}(\text{t})) \tag{16}
\end{align*}
4) Root Mean Squared Error (RMSE) [3]
RMSE is used to represent the difference between the target image and the reference image. The smaller the difference, the better the result. The ideal value is 0.
\begin{align*}
&\text{RMSE}(\mathbf{X},\mathbf{T}) \\
&= \sqrt{\frac{1}{BMN}\sum\nolimits _{\text{t}=1}^{B}\sum\nolimits _{\text{i}=1}^{M}\sum\nolimits _{\text{j}=1}^{N}(\mathbf{X}(\text{i},\text{j},\text{t})-\mathbf{T}(\text{i},\text{j},\text{t}))^{2}}. \tag{17}
\end{align*}
5) Erreur Relative Globale Adimensionnelle de Synthèse (ERGAS) [3]
ERGAS is a global metric that reflects the overall quality of the fusion. The optimal value of ERGAS is 0, and the lower the value, the better the overall quality.
\begin{align*}
\text{ERGAS}(\mathbf{X},\mathbf{T}) = \frac{100}{r}\sqrt{\frac{1}{B}\sum \nolimits_{\text{t}=1}^{B}\frac{\text{MSE}(\mathbf{T}(\text{t}),\mathbf{X}(\text{t}))}{\text{MEAN}^{2}(\mathbf{T}(\text{t}))}} \tag{18}
\end{align*}
6) $\text{D}_{s}$ [2]
\begin{align*}
\text{D}_{s} = \sqrt[q]{\frac{1}{B}\sum \nolimits_{\text{t}=1}^{B}\left|\mathbf{Q}(\mathbf{T}(\text{t}), \mathbf{P})-\mathbf{Q}(\mathbf{Y}({\text{t})},\widetilde{\mathbf{P}})\right|^{q}} \tag{19}
\end{align*}
7) $\text{D}_{\lambda }$ [2]
\begin{align*}
\text{D}_{\lambda } = \sqrt[p]{\frac{1}{B(B-1)}\sum \nolimits_{t_{1}=1}^{B}\sum \nolimits_{\text{t}_{2}=1,\text{t}_{2}\ne 1}^{B}\left| \mathbf{Q}_{1}^{\text{t}_{1},\text{t}_{2}} - \mathbf{Q}_{2}^{\text{t}_{1},\text{t}_{2}}\right|^{p}} \tag{20}
\end{align*}
8) Quality With No Reference (QNR) [2]
QNR is the product of the one's complements of the spatial and spectral distortion indices, each raised to a real-valued exponent that attributes the relevance of spectral and spatial distortions to the overall quality. The two exponents jointly determine the nonlinearity of response in the interval [0, 1], same as a gamma correction, to achieve a better discrimination of the fusion results compared. It is calculated as
\begin{align*}
\text{QNR} = (1-\text{D}_{\lambda })^{\alpha }(1-\text{D}_{s})^{\beta } \tag{21}
\end{align*}
C. Compared Methods and Implementation Details
Seven methods were compared with our proposed methods, including the classical methods GSA [13],6 HySure [46],7 FUSE [39],6 CNMF [43],8 and the deep-learning-based methods PanNet [22],9 DARN [25],10 Fusion-Net [64],11 DBDENet [62],12 and GPPNN [73].13 The compared methods followed their papers and the corresponding source codes.
In preparing the training samples, overlapping patches of
In problem (9), there are two parameters
D. Parameter Selection and Ablation Studies
In this section, we will evaluate the number of different iterations
1) Selection of
2) Ablation Studies: As can be observed from Fig. 1, our proposed network consists of several TransBlock modules. TransBlock is replaced by ResNet, which has become more popular in recent years, to measure its importance. Specifically, we make ablation experiments to evaluate the effect of both the modules. “ResNet” means that the TransBlock module is replaced with a ResNet module in the MTNet. Figs. 5 and 6 show the performance gap between the two modules for different numbers of iterations. We can notice that whichever module is applied, the PSNR and SAM results improve as the number of iterations increases. This demonstrates the effectiveness of the proposed framework when the TransBlock module is replaced; the entire network can still extract features at each stage and inject feature information into the fused image. Furthermore, we can see that the results of “ResNet” on PSNR and SAM increase with the number of iterations, but the improvement is very small compared to “MTNet,” and there is even a decrease. Overall, the difference between “ResNet” and “MTNet” is large, demonstrating the superiority of TransBlock in feature extraction and its importance in the structure of MTNet.
MTNet employs a gradual sampling strategy to reduce computational complexity, and an ablation experiment is proposed to demonstrate the effectiveness of this strategy. For convenience, “MTNet-G” represents the use of a gradual sampling strategy. “MTNet-N” represents the lack of using this strategy. Fig. 7 and Table II present the results on the CAVE dataset. Specifically, PARAMS and FLOPs are the results for an LRHS image of size
E. Results of Experiments
In order to comprehensively evaluate the performance of our proposed methods, seven methods mentioned in Section IV-C are used for comparison. The results of the three datasets are described qualitatively and visually in the following.
1) CAVE
Table III shows the average results of 12 test images on the five quality evaluation metrics, with the best results marked in bold. Apparently, our proposed method is superior to the compared methods in all the metrics. This demonstrates that the MTNet method can accomplish HS pansharpening with the least distortion. In addition, the proposed MTNet method outperforms the five CNN-based HS pansharpening approaches, which further validates the superiority of transformer modules.
The competition between the reconstructed images and the corresponding error images on the “
Visual results of the CAVE dataset at band 20. Top row: reconstructed images (with a meaningful region marked and zoomed in four times for easy observation). Bottom row: error images. (a) GSA. (b) HySure. (c) FUSE. (d) CNMF. (e) PanNet. (f) DARN. (g) Fusion-Net. (h) DBDENet. (i) GPPNN. (j) MTNet. (k) Ground truth.
Comparison of spectral curves. (a) CAVE dataset. (b) PC dataset. (c) Botswana dataset.
Visual results of PC dataset at band 44. Top row: reconstructed images (with a meaningful region marked and zoomed in two times for easy observation). Bottom row: error images. (a) GSA. (b) HySure. (c) FUSE. (d) CNMF. (e) PanNet. (f) DARN. (g) Fusion-Net. (h) DBDENet. (i) GPPNN. (j) MTNet. (k) Ground truth.
PSNR as a function of spectral band. (a) CAVE dataset. (b) PC dataset. (c) Botswana dataset.
2) PC
Table IV lists the average results of the eight test images of
Fig. 10 shows the comparison of the reconstructed images and the corresponding error images for all the methods. By looking at the reconstructed images, we can conclude that those of GSA, HySure, FUSE, and DARN have varying degrees of brightness errors. This proves that these methods have poor performance in terms of spectral fidelity. The error images for PanNet, GPPNN, and MTNet are relatively close. However, the reconstructed images of PanNet and GPPNN methods have contour blurring, i.e., there is a problem of loss of edge information. Fig. 9(b) shows the spectral curves for all the methods on the PC dataset. It can be seen that the MTNet curve overlaps the most with the reference. Fig. 11(b) shows the PSNR versus spectral band for all the methods on the PC dataset. Owing to the complex band information of the PC dataset, there is also a higher content of topographic information reacting between the bands. In 1 and 80–85 bands, most methods have difficulty in fusing well. However, the PanNet, DARN, DPPNN, and MTNet methods still maintain excellent results. This demonstrates the robustness of these methods and their ability to recover spectra. In general terms, our method achieves higher PSNR values than other methods, showing that our method retains the spectral information well on the PC dataset.
3) Botswana
For the Botswana dataset, Table V lists the average results of all the competing HS pansharpening methods. We can see that all the methods show relatively excellent results, but our method performs best. Fig. 12 shows the comparison between reconstructed images and error images on all the methods. It can be seen that the reconstructed images from CNMF and PanNet are blurry. This indicates that the CNMF and PanNet methods lack the ability to preserve spectral information when applied to the Botswana dataset. Detailed information is lost in the reconstructed images of GSA and DARN, while those of HySure, FUSE, GPPNN, and MTNet are closer to ground truth [see Fig. 12(k)]. The error images of GSA, HySure, FUSE, CNMF, and PanNet differ significantly from the ground truth, while DARN, Fusion-Net, DBDENet, DPPNN, and MTNet are closer, especially MTNet. Fig. 9(c) shows the spectral curves for all the methods on the Botswana dataset. DARN, Fusion-Net, DBDENet, GPPNN, and MTNet are closer to the reference, while the remaining methods differ slightly from it. The PSNR curves for Botswana are given in Fig. 11(c). All the methods perform well on all the bands, but our method has the best PSNR results. In a summary, the experimental results and visual analysis show that the proposed MTNet method can achieve excellent pansharpening performance on the Botswana dataset.
Visual results of the Botswana dataset at band 52. Top row: reconstructed images (with a meaningful region marked and zoomed in two times for easy observation). Bottom row: error images. (a) GSA. (b) HySure. (c) FUSE. (d) CNMF. (e) PanNet. (f) DARN. (g) Fusion-Net. (h) DBDENet. (i) GPPNN. (j) MTNet. (k) Ground truth.
4) UH
For the UH dataset, Table VI lists the average results of all the competing HS pansharpening methods. It can be seen that all the methods show relatively good results. FUSE is the best on
Visual results of the UH dataset. (a) Real LRHS image. (b) Real PAN image. (c) GSA. (d) HySure. (e) FUSE. (f) CNMF. (g) PanNet. (h) DARN. (i) Fusion-Net. (j) DBDENet. (k) GPPNN. (l) MTNet.
F. Computational Efficiency
To clarify the computational efficiency of our proposed method, we discuss three evaluation metrics of each competing method on the CAVE datasets. For the training phase, all the competing methods are implemented with PyTorch and run on a single GeForce GTX 3060 12-GB graphic card. Specifically, Table VII shows the results of the experiment on the CAVE dataset. “PARAS” and “FLOPs” are the results for an LRHS image of size
Conclusion
This article proposes a model-inspired deep approach with transformers for HS pansharpening. The approach mainly combines the traditional model-based method with the transformer-based method. For the model-based method, there are a data fidelity term and a regularization term. It is solved iteratively by an HQS algorithm and is decomposed into two suboptimization problems, namely, the data fidelity problem and the regularization problem. For the data fidelity problem, it is solved by a gradient descent algorithm. For the regularization problem, it is expressed by a proximal operation. Using the unfolding technique, the two optimization problems are implemented by two network modules. The Encoder–Decoder module performs the data fidelity problem by means of convolutional operations, and the GSNet module learns the proximal operation using the transformer network. Specifically, the transformer is an evolution of the encoder part of the original transformer by adding a linear mapping layer and a residual connection. This allows it to be more effective in feature extraction and information supplementation. GSNet builds a gradual sampling structure network using transformers. This structure enhances the performance of fusion and solves the problem of the computationally intensive of self-attention in transformers. The Encoder–Decoder and GSNet modules are assembled into a TransBlock module. The stacking of several TransBlocks forms the MTNet corresponding to the iterative steps of the algorithm. The final fusion result is the output of the last TransBlock of MTNet. Experimental results on three simulated datasets demonstrated that our method outperformed state-of-the-art methods in terms of quality and quantity.
In future work, we will consider further enhancements on the transformer to reduce GPU consumption even more. In addition, the existing network could be enhanced more effectively to further improve performance.
ACKNOWLEDGMENT
The authors would like to thank the authors of [13], [22], [25], [39], [43], [46], [62], [64], and [73] for providing their codes, Prof. P. Gamba from the University of Pavia for providing the PC dataset, and the anonymous reviewers for their constructive comments on this article. The authors would also like to thank the National Centre for Airborne Laser Mapping and the Hyperspectral Image Analysis Lab.