Introduction
Multispectral image (MSI) generally refers to the satellite image with the rich spectral information. It has been widely used in many applications, including land cover classification, change detection, and object recognition [1], [2]. However, due to the limitation of the imaging sensors, their low spatial resolution limits their application range and accuracy. Therefore, the coupled PAN image with the higher spatial resolution is always fused with LRMSI to get HRMSI with higher spatial resolution, which is called the pansharpening [3], [4] [5], [6], [7].
Many approaches have been developed to fuse the LRMSI and PAN images, which can be summarized into four categories, including component substitute (CS)-based, multiresolution analysis (MRA)-based, variational optimization (VO), and deep learning (DL)-based. CS-based methods substitute the simulated intensity band from the LRMSI by the histogram-matched PAN image to improve the spatial quality of the LRMSI, mainly including intensity-hue-saturation (IHS) [8] and Gram–Schmidt adaptive (GSA) [9], [10]. MRA-based methods decompose the PAN image into different spatial scales and inject the corresponding high-frequency details into the LRMSI, including smoothing filter-based intensity modulation (SFIM) [11], wavelet transform (Wavelet) [12], modulation transfer function with generalized Laplacian pyramid (MTF_GLP) [23], and MTF_GLP with high-pass modulation (MTF_GLP_HPM) [23]. VO-based methods usually design the specific constrained target function and alternatively optimize the variables to achieve the best fusion performance. Among them, coupled nonnegative matrix factorization (CNMF) [23] is the most representative one. However, CS-based and MRA-based methods always introduce the spectral and spatial distortion into the fused results, while VO-based methods suffer from a high computational cost.
In recent years, DL-based methods have been widely explored due to their powerful feature learning ability [16], [17], [18]. Especially the convolutional neural networks (CNN), which can capture the complex local spatial and spectral features of the image, have been widely used in the image fusion field. For example, Masi et al. [16] first proposed a CNN-based pansharpening network–PNN, which includes three convolutional layers. Then, Yang et al. [17] developed a deep network structure–Pannet with a high-pass filtering method to fuse the LRMSI and PAN image. Kwan et al. [19] summarized 11 pansharpening algorithms to enhance the images of the left imager in the mastcam by its right imager. Li et al. [20] designed a detail injection network to inject the pan details into the LR image by rectifying the incorrect data distribution, which achieves great fusion performance.
Actually, the training stage of the above supervised methods is on the low resolution dataset, while the testing is on the high resolution scale. However, it is laborious and difficult to generate the low resolution dataset that could maintain the scale-invariant property of the image feature. As a result, this kind of training strategy may degrade the fused image quality on the original HR scale. Therefore, the unsupervised fusion network that performs the training process on the high resolution scale has been gradually developed in recent years. For example, Ma et al. [21] designed an unsupervised fusion network–PanGAN with spectral and spatial discriminators to get the fused HRMSI. Zhou et al. [22] proposed an unsupervised generative adversarial framework on the original high resolution scale. This framework extracts the modality-specific features from the LRMSI and PAN images with the designed generator and then fuse them to get the HRMSI. In addition, based on the self-attention mechanism, Qu et al. [23] propose an unsupervised pansharpening method in a deep-learning framework that achieves great performance.
However, these unsupervised fusion networks are trained by relying on the commonly used loss functions. For the remote sensing image, the space-invariant degradation kernels (e.g., “bicubic”) are always used to constrain the spectral fidelity of the fused HRMSI [21]. And the spatial constraint is to enforce the consistency between the PAN image and the bandwise average of the fused HRMSI. But these two constraints in the spectral and spatial domain may not be totally correct and lack the robustness for different image features. For example, as shown in Fig. 2, the spectral response function (SRF) of the MSI and PAN image bands do not fully follow the linear relationship, which demonstrates the irrationality of the “bandwise average” consistency. Therefore, the fusion network that can correctly learn and explicitly use the spatial and spectral correlations is essential to achieve excellent fusion performance.
Overall framework of the proposed fusion method. ‘concat’ means the bandwise concatenation operation on the image features generated from two cross-scale attention blocks.
Different bands' spectral response functions (SRF) of the WorldView2 satellite imaging sensor.
Furthermore, most of the fusion networks are constructed based on the widely used CNN structure. But the limited local feature learning of the CNN hinders the exploitation of the long-range dependence of the image features. Recently, the transformer structure [24] has been developed in computer vision fields, such as super-resolution [25], [26]. Its main idea is to use nonlocal self-attention to enhance the image feature. In the image fusion field, several nonlocal attention-based methods have also been proposed and achieved great performance. For example, Zhang et al. [27] proposed a multiscale spatial-spectral interaction transformer to complete the pansharpening task. Wele [28] adopted the transformer structure to inject the PAN image details into the hyperspectral image. However, these methods compute the nonlocal attention on the pixel level and may not sufficiently learn the local texture information of the image feature, which may degrade the resulting image quality.
To effectively solve the above problems, we make several contributions, which can be summarized as follows.
First, we design a cross spectral-scale nonlocal attention module. This module could map the spectral correlation computed between the LRMSI and PAN image features to the high resolution scale and reconstruct the HRMSI feature. Without using the “average” function to constrain the spatial fidelity of the fused HRMSI, we explicitly learn and exploit the spectral band correlation through the designed network structure.
Second, to sufficiently adapt to the spatial resolution difference between the LRMSI and PAN images, we propose a cross spatial-scale nonlocal attention module. This module first computes the patch-level nonlocal similarity between the encoded LRMSI and PAN images. Then we map this similarity to the original LRMSI feature to reconstruct the HRMSI feature.
Finally, compared with the other nonlocal attention-based fusion methods, the nonlocal similarity matrix in the designed cross spatial-scale block is computed on the patch level. The patch-level computation could greatly reserve the local texture and context information. And the nonlocal similarity matrix could capture the global attention information. This combination of local and nonlocal is much more effective and efficient in enhancing the image feature learning and improving the fusion performance.
The rest of this article is organized as follows. Section II introduces the related work with the research issues, including unsupervised pansharpening, patch recurrence property and nonlocal attention. Section III describes the designed network modules and loss functions. The results of the comparative experiments are given in Section IV. Section V shows the experimental result of the ablation study. Finally, Section VI concludes this article.
Related Work
A. Unsupervised Pansharpening
Actually, traditional methods including CS, MRA, and VO-based all belong to unsupervised pansharpening methods. However, due to the poor performance of these methods in spatial and spectral fidelity or huge computational cost, DL-based methods have been explored in recent years. The DL-based unsupervised method takes the training process on high resolution scale and could use appropriate loss functions to achieve satisfactory fusion performance.
For example, Zhou et al. [29] proposed a novel pansharpening framework that adopts the auto-encoder and perceptual loss to complete the fusion process. Qu et al. [23] proposed an unsupervised deep-learning framework based on the self-attention mechanism and achieved great fusion performance. In addition, Seo et al. [30] designed an unsupervised learning framework with registration learning for pansharpening and designed two novel loss functions to train the network. Diao et al. [31] designed a multiscale fusion network and adopted multiple GAN structures at different scales to improve the fusion performance. Xu et al. [32] proposed an iterative network based on spectral and texture loss constraints and generative adversarial network to fuse the LRMSI and PAN images. Although the great fusion performance they achieved, the lack of exploring and utilizing the correct spectral and spatial correlations is still the obstacle to achieving a better fusion performance.
B. Patch Recurrence Property and Nonlocal-Based Attention
Image patches tend to recur within and across scales of a same image, which is called the patch recurrence property [33], [34]. This property has been widely used in the image super-resolution field to enhance the HR patch using similar patches in images of different resolutions. In the pioneering study, Glasner et al. [33] integrated the methods of multiple images SR and example-based SR to exploit repeating patches within and across multiscale images. Furthermore, Freedman et al. [35] effectively extracted the patches from the localized regions of the input image, which could reduce the computational complexity.
This patch recurrence property within and across scales represents the long-range dependence of image patches, which is similar to the idea of the transformer network–nonlocal self-attention [24]. Some studies have used the nonlocal attention module to super-resolve the LR image. For example, Liu et al. [36] propose a nonlocal recurrent network to incorporate the nonlocal operations into a recurrent neural network for image restoration with fewer parameters. Dai et al. [37] proposed a channel attention module to adaptively rescale the channelwise features and a nonlocally enhanced residual group to capture long-distance spatial information. Different from the nonlocal self-attention in the super-resolution field, we need to consider the cross-attention between different resolutions' images in the fusion field. In addition, the nonlocal attention in some methods is computed on the pixel level, which may not be sufficient to learn the local texture feature of the image.
Method
A. Overall Fusion Framework
The proposed fusion framework is shown in Fig. 1. We design two cross spectral-scale and cross spatial-scale blocks to sufficiently learn the spectral and spatial relations of two input images. These relations are then mapped to high resolution scale to reconstruct the HRMSI features. Then a “Feature aggregation module” is designed to integrate the HRMSI features from these two cross scale blocks and output the fused HRMSI.
B. Cross Spectral-Scale Nonlocal Attention Block
Actually, in some SRF (the mapping function from MSI to PAN image) estimation methods, they use the downsampled images to estimate the SRF and apply it on the high resolution scale [38], which achieves great performance. Therefore, the spectral band correlation on the LR scale is approximately the same as that on the HR scale. This could be expressed as
\begin{equation*}
\psi (P_{h\times w}, M_{h\times w}^{j}) = \psi (\mathbb {P}_{H\times W}, \mathbb {M}_{H\times W}^{j}) \tag{1}
\end{equation*}
\begin{equation*}
\psi (P_{h\times w}, M_{h\times w}^{j}) = \theta (P_{h\times w})\cdot \delta (M_{h\times w}^{j}) \tag{2}
\end{equation*}
However, most methods only use the estimated spectral band correlation to constrain the spatial fidelity in the loss function part, which may not make full use of it. Therefore, we design a cross spectral-scale nonlocal attention block to explicitly map the computed bandwise nonlocal similarity from the LR scale to the HR scale and reconstruct the HRMSI feature, as shown in Fig. 3. In this block, we first downsample both input images by a single stride convolution from
Network structure of the proposed cross spectral-scale nonlocal attention block. The “Feature downsampling block” includes a convolution layer with kernel size 3 and a LeakyRelu activation function. The strides of convolution are set to
Then we compute the bandwise similarity matrix
\begin{equation*}
\phi (P_{1\times h^{\prime }\times w^{\prime }}, M_{C\times h^{\prime }\times w^{\prime }}) = (P_{1\times h^{\prime }w^{\prime }})(M_{C\times h^{\prime }w^{\prime }})^{T} \Rightarrow S_{1\times C} \tag{3}
\end{equation*}
To map the similarity matrix computed on the LR scale to the HR scale, as shown in the right part of Fig. 3, we perform the matrix multiplication operation between the similarity matrix and
\begin{equation*}
(S_{1\times C})^{T}\times P_{1\times H\times W}^{\prime} \Rightarrow \mathbb {M}_{C\times H\times W}^{\prime} \tag{4}
\end{equation*}
C. Cross Spatial-Scale Nonlocal Attention Block
The spatial resolution gap between the LRMSI and the PAN image is another major obstacle to achieving higher fusion performance. Some methods preupsample the LRMSI and concatenate it with the PAN image in the spectral dimension before entering it into the fusion model. This may introduce undesired spatial distortion or noise into the fused result. Therefore, to adapt to the spatial resolution gap between the LRHMI and PAN images and to benefit from both the local and nonlocal feature learning, we propose a cross spatial-scale nonlocal attention block, as shown in Fig. 4. This block could use the spatially cross-scale nonlocal similarity between the LRMSI and PAN images to reconstruct the HRMSI feature.
Network structure of the proposed cross spatial-scale nonlocal attention block. Note that “p” is the unfolding patch size and “
As shown in the left part of Fig. 4, we first measure the patchwise nonlocal similarity matrix between the encoded high-level features of LRMSI and PAN images. Then, we map this similarity to the low-level feature of the LRMSI image to obtain the low-level feature of the reconstructed HRMSI. Note that the patch-level nonlocal similarity could retain more local texture information compared to the pixelwise similarity. And it could explore the long-range dependency of the image feature. This combination of local and nonlocal feature learning could greatly improve the fused image quality.
Actually, the assumption of the similarity mapping theory is that the similarity between the images' high-level features is closely related to the similarity between their low-level features. Generally, the high-level feature means the deep semantic information and the low-level feature represents the shallow spatial texture and spectral feature. For example, the trees in the same category (high-level feature) always have nearly the same spatial and spectral feature (low-level feature). Therefore, this computed spatially nonlocal similarity matrix could be shared between the high-level and low-level features
\begin{equation*}
\psi (M_{i,j}^{h}, P_{m,n}^{h}) = \psi (M_{i,j}^{l}, \mathbb {M}_{m,n}^{l}) \tag{5}
\end{equation*}
In detail, we first obtain the high-level features of LRMSI and PAN images through the “high-level feature encoding block,” as shown in Fig. 4. The detailed network structure of the “high-level feature encoding block” is shown at the bottom of Fig. 4. It consists of two cascaded convolution blocks, which include the convolution layer, batch normalization, and LeakyRELU functions. Then, five “Resnet blocks” are used to extract the high-level features of the image. Then, we unfold the encoded high-level feature of LRMSI into patches
\begin{equation*}
M^{h}_{c^{h}\times h\times w} \rightarrow unfold() \rightarrow M^{h}_{n_{1}\times c^{h}\times p\times p} \tag{6}
\end{equation*}
To compute the cross-scale patch-level similarity with high efficiency, we use the convolution operation from the unfolded patches–
\begin{equation*}
f_{M^{h}_{n_{1}\times c^{h}\times p\times p}}(P^{h}_{c^{h}\times H\times W}) \Rightarrow S_{n_{1}\times (H/p)\times (W/p)} \tag{7}
\end{equation*}
\begin{equation*}
M^{l}_{C\times h\times w} \rightarrow unfold() \rightarrow M^{l}_{n_{1}\times C\times p\times p}. \tag{8}
\end{equation*}
In the next step, as indicated by [39], we map the similarity matrix
\begin{equation*}
f_{M^{l}_{n_{1}\times C\times p\times p}}(S_{n_{1}\times (H/p)\times (W/p)}) \Rightarrow \widehat{\mathrm{{M}}}^{l}_{C\times H\times W} \tag{9}
\end{equation*}
Note that the proposed cross spatxial-scale attention block is inspired by [39], but there are two main differences between that study and our proposed network.
On the one hand, the cross-scale attention computed in [39] is between the original image and its downsampled version, while ours is computed between the LRMSI and PAN images on their original spatial resolutions. The downsampling operation in [39] may introduce errors and degrade the accuracy of the computed similarity matrix.
On the other hand, the nonlocal similarity in [39] is computed on low resolution scale and used on high resolution scale. While in our proposed model, the computation and use of the similarity matrix are on the same resolution scales. The resolution gap in [39] may degrade the resulting image quality.
D. Feature Aggregation Module
After getting the reconstructed HRMSI feature from the cross spatial-scale attention block, we concatenate it with the feature from the cross spectral-scale block. Then, as shown in Fig. 1, the “Feature aggregation module” including the “Dwconv”- with kernel sizes 11 and 1 is used to integrate the image features and output the fused HRMSI. Note that we add the residual connection [40] from the upsampled LRMSI (by the “bicubic”—interpolation method) to the final result to accelerate the training process.
E. Loss Function
In this part, different from the supervised methods, we design several loss functions to complete the training stage of the proposed unsupervised fusion network. The supervised methods could directly use the difference between the label and the fused image to update the network parameters. While the unsupervised methods need to use the input images to constrain and improve the quality of the fused image.
1) Bicubic Loss
To guide the fused HRMSI to preserve the main spectral structure of the LRMSI, we adopt the “bicubic” upsampling consistency loss function
\begin{equation*}
L_{bic\_{u}p} = \Vert {M\uparrow -\widehat{\mathrm{{M}}}}\Vert _{1} \tag{10}
\end{equation*}
2) Detail Consistency Loss
In fact, the spatial details involved in the HRMSI and PAN images are very similar. Therefore, we impose the detail consistency loss between the fused
\begin{equation*}
L_{dc} = r(\widehat{\mathrm{{M}}}-M\uparrow, P-P_{b}) \tag{11}
\end{equation*}
\begin{equation*}
r(x, y) = \frac{E((x-E(x))(y-E(y)))}{std(x)std(y)} \tag{12}
\end{equation*}
3) Local Spectral and Spatial Consistency Loss
The local texture similarity losses in both spectral and spatial domains are added by the SSIM index [41]
\begin{align*}
L_{SSIM\_{s}pe} = 1-SSIM(Down_{spa}(\widehat{\mathrm{{M}}}), M) \tag{13}
\\
L_{SSIM\_{s}pa} = 1-SSIM(Down_{spe}(\widehat{\mathrm{{M}}}), P) \tag{14}
\\
L_{SSIM} = L_{SSIM\_{s}pe} + L_{SSIM\_{s}pa} \tag{15}
\end{align*}
4) Total Loss Function
The total loss function is the combination of the above three loss functions as
\begin{equation*}
L_{total} = \alpha _{1}L_{bic\_{u}p} + \alpha _{2}L_{dc} + \alpha _{3}L_{SSIM} \tag{16}
\end{equation*}
Experimental Results
In this section, we take the fusion experiments and show the fused results. First, we describe the two datasets used in this study and the training details. Then, we list the comparison methods and quality measure metrics. Finally, we show the qualitative and quantitative fusion results of different methods on two datasets on reduced and full resolution scales.
A. Datasets
We select two datasets including WorldView2 (WV2) and GaoFen2 (GF2) satellite images to verify the superiority of the proposed fusion method. The spatial resolutions of these two datasets are 0.5 and 0.8 m for PAN image, 2.0 and 3.2 m for LRMSI, respectively. And their number of bands are 8 and 4 in fusion experiments. The spatial sizes of these two datasets are 5059 × 2145 and 6907 × 7300 for the LRMSI, 20236 × 8580 and 27628 × 29200 for the PAN image.
The fusion experiments are conducted on reduced and full resolution datasets. The reduced resolution dataset is generated according to the Wald protocol [42] by downsampling the original LRMSI and PAN image with the Gaussian kernel. The original LRMSI is then regarded as the reference image. The patch size is set to 64 and 256 for the LRMSI and PAN images, respectively. And we randomly select 90% of the cropped patches in the training stage and the last is for the performance test. Note that the reduced resolution experiments are actually to compare the different methods on the simulated dataset, which lacks practicality. And the full resolution experiments on the original dataset are more practical in real-world applications.
B. Training Details
Due to the different number of training image pairs on the reduced and full resolution scales, we set the training epochs to 500 and 50 for these two scales' datasets, respectively. The batch size is set to 10. And the learning rate is initialized to 1e-4, and decays with a rate of 0.1 in half of the training epochs. The ADAM optimizer is selected to update the model parameters with the
C. Comparison Methods and Quality Measures Metrics
We compare the proposed CSFNet with several SOTA pansharpening methods, including IHS [8], SFIM [11], Wavelet [12], MTF_GLP [23], and MTF_GLP_HPM [23], which belong to the traditional methods. As for the DL-based methods, we select several unsupervised and supervised methods. The first class includes LDPnet [43] and PanGAN [21]. Like the proposed CSFNet, these two methods are trained on reduced and full resolution datasets, respectively. For the supervised methods, Pannet [17], TFNet [44], and PanFormer [45] are selected to compare the fusion performance. For these three methods, due to the requirement of the reference image to supervise the training process, we only train them on the reduced resolution dataset. And the full resolution's performance test is done by the trained parameters on the reduced dataset.
The quality indices used to measure and compare the performance of different methods on the reduced resolution datasets include Spectral Angle Mapper (SAM) [46], relative dimensionless global error in synthesis (ERGAS) [47], universal image quality index (UIQI) [48], and root-mean-squared error (rmse). SAM measures spectral distortion in fused images. ERGAS, UIQI, and rmse evaluate the spectral and spatial quality of the fused results comprehensively. For the full resolution experiments, due to the lack of reference image, we choose the spectral index
D. Fusion Results on Reduced Resolution Dataset
We first conduct experiments on the reduced WV2 and GF2 datasets to compare the results of different methods visually and quantitatively. Note that these experiments are conducted on the simulated datasets, which reduces their significance in real-world applications and lack of practicality.
The fused results of these two simulated datasets are shown in Figs. 5 and 6, respectively. We show both the fused results and the error maps to clearly compare the different methods. Note that the error map represents the average of all bands' absolute difference between the reference and fused result. For the WV2 dataset, as shown in Fig. 5(k), our results show the best detail restoration and spectral fidelity among the unsupervised methods. Compared to the supervised methods, the unsupervised methods have no reference image to guide the training process. So they perform slightly worse than the supervised methods on the simulated datasets. The fusion results of SFIM and Wavelet all suffer from the blurring effect, especially at the edge of buildings and roads, as shown in the Fig. 5(b) and (c). And severe spectral distortion occurs on the fused result of the LDPnet method. The quantitative results in Table I also show that our method achieves the best rank in three indices except for the supervised methods, as indicated by the red font in this table. And LDPnet gets the worst rank in four indices, which is consistent with the visual result in Fig. 5(f). Note that the best result in each column is highlighted in the bold style and underlined for the inferior result. And the red font indicates the best rank among the unsupervised methods.
Fused results (odd row) and error maps (even row) of different methods on the reduced resolution WV2 dataset. (a) IHS. (b) SFIM. (c) Wavelet. (d) MTF_GLP. (e) MTF_GLP_HPM. (f) LDPnet. (g) PanGAN. (h) Pannet. (i) TFNet. (j) PanFormer (k) Ours. (l) Ground truth.
Fused results (odd row) and error maps (even row) of different methods on the reduced resolution GF2 dataset. (a) IHS. (b) SFIM. (c) Wavelet. (d) MTF_GLP. (e) MTF_GLP_HPM. (f) LDPnet. (g) PanGAN. (h) Pannet. (i) TFNet. (j) PanFormer (k) Ours. (l) Ground truth.
We also perform experiments on the GF2 dataset with different land covers to demonstrate the robustness of our method. The qualitative results in Fig. 6 show that our method achieves the best spatial and spectral fidelity compared to others (except the supervised method). Note that the PanGAN method suffers from the spatial artifact, as shown in Figs. 5(g) and 6(g). The reason may be that the spectral and spatial constraints used in the loss functions of this method are not appropriate totally.
E. Fusion Results on Full Resolution Datasets
In this part, we conduct the fusion experiments on the original full resolution dataset, which is more practical in real applications than the experiments on the reduced resolution dataset.
The qualitative fusion results on WV2 and GF2 datasets are shown in the Figs. 7 and 8, respectively. It can be seen that our results show better spectral and spatial fidelity than other methods. For the WV2 dataset, as shown in Fig. 7, some methods' fusion results suffer from the spatial blurring effect, such as SFIM, Wavelet and LDPnet. And several methods suffer from spectral artifacts and distortion, such as LDPnet and PanGAN, as shown in Figs. 7(g) and 8(g), (h). Note that the supervised methods are trained on the simulated reduced resolution dataset, so they perform poorly on the full resolution images in restoring the spatial details. This could be obviously seen from their visual results, especially the Pannet method.
Fused results of different methods on the full resolution WV2 dataset. (a) Upsampled LRMSI. (b) IHS. (c) SFIM. (d) Wavelet. (e) MTF_GLP. (f) MTF_GLP_HPM. (g) LDPnet. (h) PanGAN. (i) Pannet. (j) TFNet. (k) PanFormer. (l) Ours.
Fused results of different methods on the full resolution GF2 dataset. (a) Upsampled LRMSI. (b) IHS. (c) SFIM. (d) Wavelet. (e) MTF_GLP. (f) MTF_GLP_HPM. (g) LDPnet. (h) PanGAN. (i) Pannet. (j) TFNet. (k) PanFormer. (l) Ours.
The quantitative results of the nonreference quality indices also show the superiority of our methods over others, as shown in the last three columns of Tables I and II. Our method achieves the best results on all indices except the
Discussion
This section conducts the ablation study to demonstrate the effectiveness of the designed network structures and loss functions. Then, we test the different unfolding kernel sizes to achieve the best balance between local and nonlocal feature learning. Finally, we compare the computational complexity of all the methods.
A. Network Structures
To demonstrate the effectiveness of the proposed network modules, we gradually remove the cross spectral-scale and cross spatial-scale nonlocal attention blocks to measure their contributions, respectively. As shown in Table III, which are experimental results on the full resolution GF2 dataset, it could be inferred that two designed cross-scale blocks all improve the fused image quality from the first three rows of this table. And the cross spectral-scale block contributes more compared to the cross spatial-scale block. For example, the
The quantitative results on the reduced GF2 dataset also demonstrate the effectiveness of two cross-scale blocks. As shown in Table IV, the quality indices have all improved after the addition of each cross-scale block. All in all, the above experimental results verify the effectiveness of these two designed network modules.
B. Loss Functions
We conduct experiments on loss functions to demonstrate the necessity of each of them. First, we test the “bicubic” upsampling consistency loss function. This loss function could preserve the original spectral structure of the LRMSI well. And we compare this loss function with the “bicubic” downsampling loss function, which means that the fused HRMSI is downsampled and should be consistent with the input LRMSI as much as possible. As shown in Tables V and VI, the fusion results on both high and low resolution GF2 datasets demonstrate the effectiveness of the ‘bicubic’ upsampling consistency loss function.
Then, as shown in Tables VII and VIII, we, respectively, remove three proposed loss functions to verify their effectiveness. It could be inferred that these loss functions all contribute to the final excellent fusion performance, especially the
C. Unfolding Kernel Size
To take a balance between the local and nonlocal feature learning ability in computing the spatially nonlocal similarity matrix, we take the experiment on different unfolding patch sizes-p in cross spatial-scale attention block. As shown in Tables IX and X, kernel size 3 achieves the best fusion performance, which means it could extract the global and similar local features effectively. And it could achieve the best balance between learning the local texture feature and nonlocal feature similarity.
D. Effect of the Cross Scale Blocks
To verify the effect of the designed two cross scale blocks, we visualize the learned feature by the cross spectral-scale and cross spatial-scale blocks, as shown in Fig. 9. It could be clearly seen that the cross spectral-scale block mainly reconstructs the spectral feature, while the cross spatial-scale block mainly reconstructs the spatial feature. Therefore, these visualization results verify the effectiveness of the proposed cross scale blocks.
Visualization of the learned feature by two cross scale blocks on the full resolution GF2 dataset. (a) Upsampled LRMSI. (b) PAN image. (c) Learned feature by cross spectral-scale block. (d) Learned feature by cross spatial-scale block. (e) Fusion results.
E. Comparison of Computational Complexity
In this part, we list the inference time and the number of parameters of all methods to compare their fusion efficiency. This calculation is performed on the WV2 dataset with the patch size of 8 × 64 × 64 (LRMSI) and 1 × 256 × 256 (PAN image).
As shown in Table XI, the traditional methods all suffer from the large inference time consumption, especially the
Conclusion
In this article, in order to correctly learn and utilize the spectral and spatial correlation between LRMSI, HRMSI and PAN images to reconstruct the fusion result, we design two cross spectral-scale and cross spatial-scale nonlocal attention blocks. The designed cross spectral-scale block computes the bandwise nonlocal similarity on low-resolution images and maps this similarity to the high-resolution scale to reconstruct the HRMSI feature. The designed cross spatial-scale block computes the patch-level nonlocal similarity on the high-level feature of the input images. Then, it maps this similarity to the original LRMSI image feature to reconstruct the HRMSI feature. So the proposed cross spatial-scale block could effectively combine the advantages of nonlocal and local feature learning. Finally, a “Feature aggregation module” integrates the HRMSI features constructed by these two blocks and outputs the fused HRMSI. Experimental results on two reduced and full resolution datasets demonstrate the effectiveness and superiority of the proposed fusion network.
ACKNOWLEDGMENT
The authors would like to thank the Key Laboratory of Natural Resources Monitoring in Tropical and Subtropical Area of South China, Ministry of Natural Resources. The authors would also like to thank Baidu AI Studio for providing the computing power supports.