Dear Editor,
In pansharpening task, the most existing deep-learning-based pan-sharpening methods fail to fully utilize the different level features, inevitably leading to spectral or spatial distortions. To address this challenge, in this letter, we propose a dual-branch multi-level feature aggregation network for pansharpening (DMFANet). The experimental results on the WorldView-II (WV-II) and QuickBird (QB) dataset confirmed the notable superiority of our method over the current state-of-the-art methods from quantitative and qualitative point of view. The source code is available at https://github.com/Gui-Cheng/DMFANet.
Introduction
Multispectral (MS) image with a wealth of spectral information has the potential to distinguish the surface materials and thus owns a broad remote sensing application. Due to the technical limitations, there exists a trade-off in remote sensing sensors between the spatial and spectral resolutions [1]. As a consequence, it is challenging to directly acquire images with high spatial and spectral resolution via a single sensor. However, the panchromatic (PAN) image with high spatial resolution and the corresponding multispectral (LRMS) image with low spatial resolution widely exist, which can not meet the needs of high-precision remote sensing applications to a certain degree. To address this challenge, the pansharpening technique is applied to integrate the spatial structure information from the PAN image and the spectral information from the LRMS image to generate the high-resolution multispectral (HRMS) image.
In the past few decades, numerous pansharpening methods have been proposed, which can be broadly divided into four major cate-gories: 1) component substitution (CS)-based methods [2]; 2) multiresolution analysis (MRA)-based methods [3]; 3) hybrid methods [4]; 4) deep-learning-based methods [5].
In recent years, the CNN-based pansharpening methods have been developed and achieved promising results, such as PNN [6], MSD-CNN [7], Pan-GAN [8], GTP-PNet [9], GPPNN [10]. However, some problems still remain to be solved. The most existing deep-learning-based pansharpening methods fail to fully utilize the diff-erent level features, inevitably leading to spectral or spatial distortions.
To address these challenges, a dual-branch multi-level feature aggregation network for pansharpening is proposed, called DMFA-Net. The main branch of DMFANet is the MS image multi-level feature extraction and aggregation branch to obtain the final HRMS image. Another branch is the PAN image feature extraction branch that provides high spatial structure information for the main branch. Specially, we conduct multi-level feature fusion throughout the whole network for better usage of the multi-level spectral and spatial information from MS image and PAN image. Inspired by the high efficient residual feature aggregation (RFA) framework [11], we also designed two RFA framework-based feature extraction modules for MS image and PAN image respectively, named MS image feature extraction module (MSFEM) and PAN image feature extraction module (PFEM). MSFEM aims to extract the spectral features from MS images, while the PFEM aims to extract spatial details from PAN images.
The main contributions of this study are summarized as follow: 1) We design a dual-branch network to fully extract the spectral features from MS image and spatial features from PAN image respectively. 2) We apply multi-level feature fusion throughout the whole network to take advantage of the multi-level effective information from PAN and MS images. 3) We design two high efficient feature extraction module, i.e., the MSFEM and PFEM.
Problem Formulation
The target of our DMFANet is to extract the spectral features from MS image and spatial features from PAN image as much and as accurately as possible via a dual-branch network, fuse them at different feature levels, and aggregate fused features to make full use of the multi-level spectral and spatial information for generating promising fusion results. Fig. 1 presents the overall fusion framework of our DMFANet. We donate the LRMS image as \begin{equation*}
I_{\mathrm{H}\mathrm{R}\mathrm{M}\mathrm{S}}=f((I_{\mathrm{L}\mathrm{R}\mathrm{M}\mathrm{S}},I_{\mathrm{P}\mathrm{A}\mathrm{N}});\Theta)
\tag{1}
\end{equation*}
To be more specific, we extract spectral and spatial features from two branches and fuse them at different levels. We formulate the multi-level fusion function as follow:\begin{equation*}
D_{i}^{Fused}=H(D_{i}^{MS},\ D_{i}^{PAN})
\tag{2}
\end{equation*}
\begin{align*}
&D_i^{M S}=f_{M S}(D_{i-1}^{Fused})
\tag{3}\\
&D_i^{PAN}=f_{\mathrm{P}\mathrm{A}\mathrm{N}}(D_{i-1}^{PAN})
\tag{4}
\end{align*}
Finally, an N-level fusion is conducted, with fused features aggregated. The generated HRMS image can be obtained by (5)\begin{equation*}
I_{\mathrm{H}\mathrm{R}\mathrm{M}\mathrm{S}}=f_{\mathrm{c}\mathrm{o}\mathrm{n}\mathrm{v}}(cat(D_{1}^{MS},D_{2}^{MS},\ldots,D_{N}^{MS})).
\tag{5}
\end{equation*}
Ms Image Feature Extraction Module
Despite that MS image contains rich spectral information, it is a challenging task to fully extract their spectral information. In this study, we propose an MS image feature extraction module (MSFEM) (Fig. 2) to complete this task. The proposed MSFEM combines the residual channel attention blocks (RCAB) and the RFA framework. The RCAB [12] integrates the channel attention into a residual block. The residual features are first extracted by two convolutional layers. Then the channel attention block extracts the channel statistic among channels via a global pooling layer followed by two convolutional layers with a ReLU function and a Sigmoid function, respectively. Therefore, the MSFEM can better extract spectral features with an enhanced discriminative ability. We apply multi-level MSFEMs in our network.
Pan Image Feature Extraction Module
The proposed PAN image feature extraction module (PFEM) consists of 4 spatial attention (SA) blocks based on the RFA framework (Fig. 3(a)). The structure of the SA block is detailed in Fig. 3(b). The effectiveness of the spatial attention strategy that leads to the focus on the inter-spatial relationship of features has been verified in many tasks [13]. The SA block first extracts features by a 1 × 1 convolutional layer with a ReLU function. Then, an AvePooling layer and a MaxPooling layer are used to aggregate channel information. Finally, by concatenating these two kinds of features, a 5 × 5 convolutional layer with a Sigmoid function is applied to generate the spatial attention map. The combination of SA blocks and RFA framework results in better extraction of effective features among the spatial dimension from PAN image. In this study, we implement multi-level PFEMs.
Experimental Setup
We perform experiments on WV-II and QB datasets with four MS bands: blue, green, red and NIR. The experimental LRMS and PAN image patches have the size of 64 × 64 × 4 and 256 × 256 × 1, respectively. For WV-II and QB datasets, the number of image patches from training, reduced-resolution testing, and full-resolution testing are 1254 and 308 120 and 80 400 and 200, respectively.
These experiments are conducted on a desktop with two NVIDIA GTX 2080Ti GPUs. Our proposed DMFANet and the deep-learning-based methods are implemented by PyTorch 1.5.1 library with Python 3.6.9. The Adam optimizer is applied to optimize the proposed method with
To verify the performance of our DMFANet, we conduct reduced-resolution testing based on Wald's protocol [14] and full-resolution testing. We take both qualitative and quantitative evaluation on these two types of testing. We compare our method with eight mainstream fusion algorithms, including three widely used traditional pansharpening algorithms, i.e., Brovey [2], Gram-Schmidt (GS), MTF-GLP [3], and five deep-learning-based methods, i.e., MSDCNN [7], PNN [6], DIRCNN [15], GPPNN [10], MUCNN [16].
We apply six widely used metrics, i.e., PSNR, SSIM, ERGAS, SAM, UIQI, SCC, for the reduced-resolution testing. For the full-resolution testing, the quality of no reference (QNR) index is utilized to characterize the fusion performance. The QNR consists of two parts: the spectral distortion index
Results From the WV-II Dataset
We firstly present qualitative and quantitative testing results on the WV-II dataset from the aspects of reduced-resolution and full-resolution to demonstrate the performance of each method.
The qualitative testing results under the reduced-resolution are shown in Fig. 4. Intuitively speaking, our proposed DMFANet presents the highest consistency with the referenced HRMS image. Obviously, the Brovey and GS suffer from spectral distortion, while the MTF_GLP suffer great spatial distortion, resulting blurring details. Compared to traditional methods that suffer from notable spectral and spatial distortion, the comparison deep-learning-based methods are able to better preserve spatial information, but also suffer from a little spectral distortion. Instead, our DMFANet can largely preserve the spectral distribution and spatial structure, thanks to the introduced MS features extraction branch that learns the spectral information and the PAN features extraction branch that preserves the spatial structure detail. The above results demonstrate that our DMFANet not only reconstructs more accurate spectral distribution but also generate reasonable spatial structure details, outperforming other selected methods.
The qualitative testing results from comparison methods under the reduced-resolution on the WV-II dataset.
The qualitative testing results under the full-resolution (Fig. 5) also demonstrate that our method leads to better spectral information preservation, evidenced by the clearer texture details. For example, we can observe that the spectral distribution of the land is largely consistent with the LRMS image and the spatial structure details of the land are similar to the ones in PAN image.
We further provide the quantitative testing results on WV-II dataset (Table 1). Table 1 displays that DMFANet achieves the best average values on PSNR, SSIM, SAM, ERGAS, SCC, and UIQI, indicating that the fused results generated by our method are most consistent with the reference HRMS image from the aspects of spectral distribution and spatial structure details. Compared with no-reference metrics, our DMFANet ranks the second on
Results From the Qb Dataset
To further validate the effectiveness of DMFANet, we conduct comparison experiments on the QB dataset. Fig. 6 shows the qualitative testing results under the reduced-resolution. Compared to traditional methods, ours DMFANet presents a well spectral preservation. The results of MTF_GLP suffer a great spatial distortion. Similarly, the results of the deep-learning based methods such as MSDCNN PNN and MUCNN cannot preserve the spectral information well. For the spatial information preservation, our proposed DMFANet can rebuilt the spatial texture of building, outperforming the comparison methods. The quantitative results (Table 2) under the reduced-resolution testing also illustrate the best performance of methods among the comparison methods. Furthermore the qualitative comparison results under full-resolution in Fig. 7 demonstrate that our proposed DMFANet are most similar to the LRMS image in term of spectral feature and PAN image in terms of spatial feature. Therefore, both the qualitative results and the quantitative results have shown that our proposed DMFANet achieves the best performance compared with the selected competing methods.
The qualitative testing results from the comparison methods under the full-resolution on the WV-II dataset.
Ablation Study
To verify the effectiveness of each strategy in our proposed method, we perform the ablation experiments on WV-II dataset. Table 3 records the results of four variants of DMFANet. In the following, we make a detail analysis of each strategy.
Multi-level feature fusion: To evaluate the best feature fusion level, we perform comparison experiments with the feature fusion level from 1 to 12 based on our proposed DMFANet (Table 4). Through experiments, it can be seen that by increasing the fusion level from 1 to 5, the performance of pansharpening is notably improved. However, the performance of pansharpenging will decrease when the fusion level continues to increase. The reason is that the input MS image and PAN image can already be finely inte-grated by 5 fusion levels, keeping increasing the fusion level makes the training inefficient.
Aggregation structure: To confirm the effectiveness of aggregation structure, we compare the performance of DMFANet and DMF ANet without aggregation structure. From the results in the first line at Table 3, we observe that the performance of DMFANet reduces when the aggregation structure is discarded. For example, the reductions in SSIM and ERGAS are 0.011 and 0.026. The results prove that the aggregation structure contributes to the performance of DMFANet.
Dual-branch structure: To verify the superiority of the dual-branch structure, we compare the performance of the model with dual-branch structure and the model with single-branch structure under the same setting of other parameters. The model with single-branch structure only contains the MS image multi-level feature extraction and aggregation branch as mentioned in DMFANet. The results in the second line at Table 3 show that the dual-branch structure can significantly improve the performance of pansharpening.
MSFEM: To verify the effectiveness of MSFEM, we replace the RCAB with convolutional blocks with the same filter size and other parameters setting in MSFEM and conduct experiments. Through the experimental results in the third line at Table 3, the performance of each metric significantly reduces, especially for the reduction in ERGAS which is 0.798, indicating the spectral distortion increases. The results prove that the MSFEM contributes to the spectral feature extraction.
PFEM: Similarly, we replace the SA block with convolutional blocks with the same filter size and other parameters setting in PFEM. The experimental results were recorded in the fourth line at Table 3, we can see that the reductions in SSIM and SCC are 0.01 and 0.01, indicting more spatial distortion. The results confirm that the PFEM contributes to the spatial feature extraction.
The qualitative testing results from the comparison methods under the reduced-resolution on the QB dataset.
The qualitative testing results from the comparison methods under the full-resolution on the QB dataset.
Conclusion
In this letter, we propose a dual-branch multi-level feature aggregation network for pansharpening, called DMFANet. Our network consists of two branches designed by the residual feature aggregation framework. The purpose of our DMFANet is to extract the spectral distribution features and spatial structure features in an efficient and comprehensive manner via a dual-branch network, fuse them at multi-levels, and finally aggregate each fused feature, thus taking full advantage of the complementary information for generating promising fusion results. Such a design allows not only the approximation to the HRMS reference image in terms of spectral distribution but also the reconstruction of reasonable spatial structure details. The experimental results from WV-II and QB datasets demonstrate the notable superiority of our method over the current state-of-the-art methods from quantitative and qualitative point of view.
ACKNOWLEDGMENTS
This work was supported in part by the National Natural Science Foundation of China (42090012), 03 Special Research and 5G Project of Jiangxi Province in China (20212ABC03A09), and the Open Grants of the State Key Laboratory of Severe Weather (2021LASW-A17).