Introduction
With a large number of satellite launches, remote sensing images are widely used in urban planning, resource exploration, land cover classification, environmental protection, and climate monitoring [1], [2], [3], [4]. Due to the limitations of remote sensing hardware, satellite sensors can only acquire a high-resolution single-band image and a matching low-resolution multispectral image. Remote sensing image fusion techniques are mostly applied to panchromatic sharpening. In recent years, researchers have made significant efforts to extend the panchromatic sharpening technique to the fusion of hyperspectral and multispectral images [5].
The common pansharpening methods can be divided into three main categories: component substitution (CS); multireso- lution analysis (MRA); and super-resolution (SR) paradigm. The CS-based method focuses on the spatial and spectral separation of low-resolution multispectral image (LRMS) and then replacing the spatial structure with panchromatic (PAN) images. The current CS-based algorithms are e principal component analysis (PCA) [6], Gram–Schmidt (GS) [7], partial replacement adaptive component substitution [8], and band-dependent spatial detail [9]. Although these methods have rich spatial details almost as much as PAN images, there is significant spectral distortion in the fusion image due to the fact that the replaced components are not necessarily all matched.
The MRA-based methods mainly decompose the PAN and multispectral (MS) images into low-frequency components and high-frequency components, then designs different fusion rules to fuse the low-frequency and high-frequency components separately, and finally obtains the fused image through inverse transformation. The current MRA-based algorithms decimated Wavelet Transform using an additive injection model (Indusion) [10], Additive A Trous Wavelet Transform with unitary injection model [11], Additive Wavelet Luminance Proportional, a generalization of additive wavelet luminance proportional (AWLP) to more than three bands [12] and the modulation transfer function (MTF-GLP) [13]. However, these methods are very prone to cause distortion of the spatial structure.
The SR method primarily considers the LRMS and PAN images as low-resolution versions of the high spatial resolution multispectral (HRMS) image. It regards pansharpening as a process of image restoration to recover the HRMS image from these degraded versions. The SR-based methods are sparse representation of injected details (SR-D) [14], model-based fusion using PCA and wavelets [15], but the model is complex and computationally intensive. Most of the deep learning-based remote sensing image fusion methods are derived from super-resolution reconstruction [16]. The first three-layer convolutional neural network structure is proposed for feature extraction and fusion (PNN) [17]. In addition, Yuan et al. [18] proposed a multiscale and multidepth convolutional neural network (MSDCNN) for pansharpening from a multiscale perspective. Since PAN images and MS images contain different information, TFNet [19] proposes two stream networks to extract features from PAN images and MS images and fuse them in the feature domain. Xiang et al. [20] proposed a weighted fusion module (MC-JAFN) that enhances the important information and sup-presses the redundant information greatly improving the effectiveness of pansharpening. Zhou et al. [21] proposed an unsupervised cycle-consistent generative adversarial networks (UC-GAN) that learns from the full-scale images without the ground truths for pan-sharpening. Finally, Yang et al. [22] proposed a multilevel dense parallel attention fusion framework (DPAFNet) for solving the pansharpening problem based on the advantages of attention mechanisms for extracting spectral information and spatial details. Yet, the abovementioned methods do not consider the effect of dimensional changes of MS images on the fusion results during the fusion process and the intrinsic correlation between PAN images and MS images, and these methods simply concatenation in the feature aggregation stage, which leads to some significant information loss. To address the abovementioned drawbacks, we propose a new remote sensing sharpening network. The contributions of this article are as follows.
We propose a three-stage progressive fusion structure to fully capture the spatial details from the PAN image and spectral information from the MS image. The features extracted at each stage are fused at different resolutions, allowing for the utilization of crucial information from different-resolution images.
We design a high-frequency information preservation block (HPB) to enhance the saliency of important information in the upsampled MS image while removing redundant information and reducing the number of trainable parameters in the network, and propose a spatial aggregation module (SAM) and bands aggregation module (BAM) to fully exploit spatial details and spectral information, respectively.
We design an enhancement fusion module (EFM), which consists of three encoders and two decoders. They are responsible for feature aggregation of input features as well as self-enhancement and interactive enhancement of features.
Proposed Method
In this section, we introduce the MPEFNet for generating HRMS images by fusing PAN and MS images. First, we introduce the motivation of our study. Second, the overall framework and details of the network modules are explained. Finally, the loss function is used in the training phase.
A. Motivation
Due to the complex physical geography factors of the spatial distribution of land cover and different spectral responses of different sensors, we cannot directly obtain high spatial resolution MS images through a single sensor, but only through the fusion of high spatial resolution PAN images and multispectral images [23].
The objective of pan-sharpening is similar to super-resolution, to optimize the image at a scale level, making spatial scale a crucial factor in restoring high-resolution images. However, significant changes in image size can lead to a large amount of irrelevant information, hampering the capture of important details and impeding network training. Inspired by LapSRN [24], we have designed a sharpening network that progressively upscales the low-resolution image to enhance both the accuracy and efficiency of the algorithm.
The overall framework of MPEFNet is shown in Fig. 1, and it consists of four main modules, namely REB for feature extraction, SAM for extracting deeper spatial features based on the spatial attribute of PAN images, BAM for retaining more spectral information based on the spectral attribute of MS images, and EFM for enhancing the fused features. First, we input the PAN (
\begin{equation*}
{\rm{HRMS\ }} = \hat{M} + \varphi \left({\hat{M},P} \right) \tag{1}
\end{equation*}
B. Network Framework
In Fig. 1, the first, PAN is matched in size with LRMS images by downsampling and known as
The first two stages are multiresolution learning to fully exploit the spectral information and spatial details of MS images and PAN images of different resolutions, which emulates the structure of DBPN [25] in processing multiresolution images. In the last stage, we employ a nondownsampling approach to process the image information, aiming to improve the utilization of the original information. The details of each module in the network are as described below.
1) High Preserving Block
A new HPB is proposed to reduce the resolution of the processed features in order to reduce the computational cost. However, the reduction of feature map size often leads to the loss of image details, which results in visually unnatural fused images. To solve this issue, in HPB, we propose the high-frequency filtering module (HFM) and the adaptive residual feature block (ARFB).
As shown in Fig. 2, the feature of
\begin{equation*}
\ M^{\prime}{^{\prime}}_{\text{high}} = ARFB\left({{M}_{\text{high}}} \right) \copyright \left\{ { \uparrow ARF{B}^{\unicode{x27F3} ^5}\left({ \downarrow {M}_{\text{high}}} \right)} \right\} \tag{2}
\end{equation*}
A 1 × 1 convolution is used to reduce the number of channels of
The HFM in Fig. 2 was used to estimate the high frequency information from the LRMS images. k denotes the kernel size of the Average Pooling and the value of each point on the middle feature map represents the average intensity of each pooled region on the input feature map. After that, the intermediate feature maps are upsampled to the original input feature size, and finally subtracted from the original input features to obtain the high-frequency information of
As shown in Fig. 2, the ARFB consists of two residual units (RU) and two convolutional layers, where the RU contains two modules: Reduction and expansion used to reduce the number of channels and restore the number of channels, respectively. The process can be expressed as follows:
\begin{equation*}
\ {y}_{\text{RU}} = {\lambda }_{res}\ \otimes EX\left({RE({x}_{\text{RU}}} \right))\ \oplus {\lambda }_{res} \otimes x \tag{3}
\end{equation*}
2) Spatial Aggregation Module
The detailed structure of the Spatial Aggregation Module is shown in Fig. 3. Feature integration theory [27] suggests that human vision perceives objects by extracting basic contextual features and associating individual features with attention. However, CNN is not enough to learn diverse contextual features that rely on the sole presence of regionality perceptions [28], [29]. Thus, we propose a spatial aggregation module to capture contextual multiorder interactions, which consists of two cascaded components
\begin{equation*}
Z = X + \text{MGA}\left({\text{FDM}\left({Norm\left(X \right)} \right)} \right) \tag{4}
\end{equation*}
\begin{align*}
&Y = Con{v}_{1 \times 1} \left(X \right) \tag{5}\\
&Z = \text{GELU}\left({Y + {\gamma }_s \odot \left({Y - GAP\left(Y \right)} \right)} \right) \tag{6}
\end{align*}
3) Bands Aggregation Module
Inspired by the latest breakthroughs in transformers [33], we have reconsidered the basic convolutional blocks used for feature extraction in the pansharpening task, aiming to improve the ability to mine spectral information. As depicted in Fig. 4, the BAM consists of two parts: the multiscale large kernel attention module (MLKA) and the gate channel attention unit (GCAU). Assuming that the input feature is X, the process of calculation of BAM as follows:
\begin{align*}
L =& LN\left(X \right)\\
X =& X + {F}_3\left({\text{MLKA}\left({{F}_1\left(L \right)} \right) \otimes {F}_2\left(N \right)} \right))\\
L =& LN\left(X \right)\\
X =& X + {F}_6\left({\text{GCAU}\left({{F}_4\left(L \right),{F}_5\left(N \right)} \right)} \right) \tag{7}
\end{align*}
Previous attention mechanisms cannot consider both local and long-term dependence modeling, so we are inspired by the latest visual attention method [34] to combine large kernel decomposition and multiscale learning, aiming to solve the above existing problems. Specifically, MLKA includes three main functions: establishing interdependent large-core attention (LKA), acquiring heterogeneous scale-dependent multiscale mechanisms, and gating aggregation for dynamic recalibration.
Encouraged by [35], [36], we combined the simple channel attention and gated linear unit into the presented GCAU to fulfill the adaptive gating mechanism and minimize the parameters and computations. To capture the spectral information more efficiently, we employ a single layer depth-convolution to weight the feature maps.
4) Enhancement Fusion Module
Some methods [37], [38] utilize transformers in the Siamese structure to fuse images of two different modalities to enhance the quality of the fused images. However, these works use a single encoder and decoder to pattern the relationship between two different modal images, and therefore, do not implement multiple self-enhancements and interactive enhancements. Therefore, we propose an enhancement fusion transformer module, which is composed of three encoders and two decoders for self and interactive enhancement of aggregated features and pattern-specific features, respectively.
Instead of the previous transformer, we separate the encoder and decoder [39], then we use three separated encoders to self-enhance the features from SAM and BAM, and specific features from PAN and LRMS after convolutional layers, while using two separated decoders to further enhance these encoded features, interactively. To reduce the complexity of the model, we employ a single-headed attention mechanism, where the k and v matrices share weights in the encoder and decoder.
We suppose that
\begin{align*}
X_1^E =& \text{Encoder}\left({{X}_1} \right) \in {R}^{C \times H \times W}\\
X_2^E =& \text{Encoder}\left({{X}_2} \right) \in {R}^{C \times H \times W}\\
X_3^E =& \text{Encoder}\left({{X}_3} \right) \in {R}^{C \times H \times W} \tag{8}
\end{align*}
And the decoder is designed to interactively enhance the aggregated features and the attribute-based specific features from latest encoder. The decoder process is described as follows:
\begin{align*}
Y =& \left\{ {\text{Decoder}\left({X_1^E,\ X_2^E} \right) \oplus \text{Decoder}\left({X_3^E,\ X_2^E} \right)} \right\}\\
& \in {R}^{C \times H \times W} \tag{9}
\end{align*}
5) Loss Function
We achieve the optimization of MPEFNet by reducing the error between fused images and GT. Some previous fusion methods [40], [41] employed
\begin{equation*}
{l}_1 \left(\theta \right) = \frac{1}{N}\ \mathop \sum \limits_{i = 1}^N {\left| {\varphi \left({{P}^i,{M}^i;\theta } \right) - G{T}^i} \right|}_1 \tag{10}
\end{equation*}
Experiments and Analysis
In this section, we are going to offer some details of the experimental procedure, including data set, comparison methods, and qualitative and quantitative analysis of the experimental results. We first introduce the dataset sources and their properties. Then, we give the settings of the equipment and parameters, as well as ablation study is provided to demonstrate the effectiveness of each structure. Finally, the experimental results and comparative analysis of the proposed method is displayed.
A. Experimental Details
1) Datasets and Metrics
We choose two datasets for proving our method, including IKONOS and WorldView-2(WV-2). IKONOS contain four bands of red(R), green(G), blue(B), and near-infrared (NIR), and IKONOS provide the PAN image with a spatial resolution of 1 m and the MS image of 4 m. WV-2 contain eight bands of red(R), green(G), blue(B), near-infrared1(NIR1), coastal blue, yellow, red edge, and near-infrared2 (NIR2), and WV-2 provide the PAN image with a spatial resolution of 0.5 m and the MS image of 2 m. Due to lack of ground truth (GT), we degraded the MS and PAN image based on Wald's protocol [13] to obtain the down(D)-resolution data and use the original MS images as GT, so the size of input MS image is 64 × 64 × B and PAN image is 256 × 256, and the size of full (F) resolution MS image is 256 × 256 × B and PAN image is 1024 × 1024. For different data sets, we conducted simulated and real experiments. In the simulated experiments, we trained our methods using 120 pairs of IKONOS images and 400 pairs of WV-2 images, and we tested our model with 80 pairs of IKONOS images and 100 pairs of WV-2 images, respectively. We evaluated the fusion results using subjective evaluation and objective indicators. First, we visualize the fusion results on the monitor so that it can be conveniently evaluated. After that, we use several common metrics to quantitatively evaluate the fused images. The quantitative evaluation metrics in the simulated experiments include Erreur Relative Globale Adimensionnelle de Synthese (ERGAS) [44], SAM [44], [45], spatial correlation coefficient (SCC) [45], and image quality (Q) [46]. Another aspect, we measure the fused images of real experiments using the quality without reference (QNR) [47], which is calculated by
\begin{equation*}
\text{QNR} = {(1 - {D}_S)}^\alpha {(1 - {D}_\lambda)}^\beta . \tag{11}
\end{equation*}
2) Comparison Method
We demonstrate the effectiveness of our method by comparing it with eight fusion methods, which include GS [7] from CS, MTF-GLP [13] from MRA, MSDCNN [18], A-PNN [17], TFNet [19], MC-JAFN [20], UC-GAN [21], and DPAFNet [22]. The first two are traditional methods, and the last six are DL-based methods. Among them, UCGAN belongs to unsupervised algorithm and the others belong to supervised algorithm. For the sake of experimental fairness, all the comparative methods used in the article were trained and tested under the same experimental settings, using the same training and testing datasets as this study.
3) Experimental Settings
All comparison methods are performed on a single Nvidia GeForce RTX 2080Ti GPU with the PyTorch platform. We use Adam to learn minimization loss. The learning rate is set to 0.0001, and each method is trained for 500 epochs, separately, and model saved each 100 epochs.
B. Comparative Analysis
In this section, our algorithm is compared with eight methods described previously in experiments on simulated and real data, and the fusion quality of each method is evaluated using qualitative and quantitative analyses.
1) Simulated Experiments
In these experiments, the PAN and LRMS images were down-sampled to simulate the down-resolution input PAN and MS, while the original MS images were used as GT to evaluate the qualities of the pan-sharpened results. The performance of nine remote sensing image fusion algorithms is tested on two datasets, WV2 and IKONOS, by qualitative and quantitative analysis.
First, we compared the nine fusion methods by quanlitative analysis. Based on the results of the simulated experiments, two sets of images were selected that typically highlight the advantages and shortcomings of various methods, as shown in Figs. 6 and 7, which contains the results of the nine-comparison algorithm and the images of the GT. Simultaneously, in order to more visually represents the effect of each image fusion method, the absolute error map of the fusion results with respect to the GT of each band is given at the bottom of Figs. 6 and 7 (the darker the color, the better the fusion performance). As can be seen in Figs. 6 and 7, the fusion results based on the GS method have a significant blue blurring phenomenon. Compared with the fusion results of GS, MTF-GLP is improved in spectral retention, but there is still partial loss of spatial information, such as the existence of blurred edge structures of house buildings and spatial distortion of car parking space lines. By comparing the images shown in Figs. 6 and 7, it can be observed that the results of the DL-based approach are most closely resembling GT in both spatial detail and spectral fidelity. For instance, the tiny buildings that the neighborhood housing and factory buildings in Fig. 6, and the ground transportation routes in Fig. 7 all show different degrees of spatial distortion. In the other six deep learning methods, while overall spectral distortion is effectively alleviated, spatial blurring is also obvious (MSDCNN [18], A-PNN [17], TFNet [19]), for example, the spatial distribution of the edge of the lake and the woods in the lake on the right side in Fig. 6. And in Fig. 7, although the six comparison methods based on deep learning achieve better recovery of spatial information and spectral distortion in general, there are still some shortcomings in details, for example, the white building at the bottom right of Fig. 7 is confused with the building roof sign, producing spatial blurring. In Fig. 7, the GS, MTF-GLP, MSDCNN, A-PNN, TFNet, and MC-JAFN leads to the loss of spectral information of the green tiny building on the roof. In the other hand, UC-GAN, DPAFNet and the proposed method not only have uniform spectral distribution but also are like to GT.
Results images of the nine methods and GT on the IKONOS simulated dataset, as well as images of the absolute error. Zoomed-in view on the image to see more details.
Results images of the nine methods and GT on the WV-2 simulated dataset, as well as images of the absolute error. Zoomed-in view on the image to see more details.
Since subjective evaluation varies from person to person, specific quantitative indicators are also needed to make a more reasonable, fair, and just evaluation of the fusion quality, thus avoiding spectral artifacts and spatial distortions arising from subjective analysis. Detailed results of the quantitative analysis of each method are given in Tables I and II. The left part of the table depicts the measurement results obtained on the simulated dataset and the right part is the measurement results obtained on the real dataset. The results are in Tables I and II shows that the DL-based method achieves better performance in quantitative metrics compared to the traditional methods, whereas our proposed method shows better results than other comparative methods and contains more spectral information and spatial details.
2) Real Experiments
The fusion results of the real experiments are shown in Figs. 8 and 9. Due to the lack of GT in the real dataset, we give enlarged views of some of the details in the figure below the fusion result plots, distinguished by red and blue boxes, respectively. As illustrated in the figure, the fusion results of GS and MTF-GLP showed obvious spectral distortion and overall whitening of the images. The fusion results of MSDCNN, A-PNN, TFNet, and MC-JAFN showed obvious spatial distortion and spectral distortion, and the yellowish color of the fused houses and over-sharpening of the edges can be seen in the partial zoomed image below. Since it is difficult to objectively and fairly distinguish the fusion results of other methods based only on the visual subjective evaluation, we need to refer to the quantitative indicators shown in Tables I and II. From Table I, we can see that for the fusion results of Fig. 8, our method has the best results on
Results images of the nine methods on the IKONOS real dataset, and the lower part indicates the magnified details of the fused results (red and blue boxes). Zoomed-in view on the image to see more details.
Results images of the nine methods on the WV-2 real dataset, and the lower part indicates the magnified details of the fused results (red and blue boxes). Zoomed-in view on the image to see more details.
C. Ablation Study
In order to verify the effectiveness of each module designed in this work, we conducted 8 ablation experiments on the WV-2 simulated dataset. By employing the controlled variable method, we compared the impact of different modules on the network (see Table III for details).
w/o HPB: This procedure is to test the effect of HPB on the network. Compared with the MPEFNet, we remove all HPB in the network. Channels of LRMS and PAN are concatenated directly after passing the REB.
w/o BAM&SAM: Compared with the MPEFNet, the network structure of this ablation experiment only is removed the BAM and SAM, so the input feature maps are processed by 4 × REB and then added as the output.
w/o EFM: We use addition instead of EFM, which means that three feature inputs are directly added as output.
w/o Stage 1(S1) &S2: The abovementioned three types of ablation experiments are conducted with the original framework unchanged. Differently, this process canceled the fusion procedure of 64 × 64 image size and 128 × 128 image size in the original network.
MPEFNet: MPEFNet is our proposed method. It contains the structure of the abovementioned five ablation experiments.
We also conducted ablation experiments to validate the effects of two or more modules simultaneously. Please refer to Table III for detailed information.
Table III shows the average index results for each ablation module corresponding to the model results, Fig. 10 shows the fusion results of the ablation experiments conducted on the WV-2 simulation dataset, and Fig. 11 presents a bar chart illustrating the changes in objective indicators for different ablation models. From Table III, Figs. 10, and 11, it can be observed that the network is most influenced by EFM. The absence of EFM results in severe oversharpening of spatial structures and loss of spectral information in the fused image. Additionally, through the controlled variable method, the impact of other modules on the network can be determined. It can be seen from the absolute error plots that the proposed model has less residuals compared to the other ablation models, which is further confirmed by the results of the quantitative metrics in Table III.
Results images of different types of ablation experiments on the WV-2 simulated dataset, and the absolute error images.
The next module that influences the overall network performance is the BAM&SAM, and it can be seen from the absolute error plot that there is a serious spatial ambiguity after removing the BAM&SAM. Without the inclusion of HPB, BAM, SAM, and S1&S2 components, the fusion results are inferior to some extent. Therefore, these components are crucial to improve the performance of the network in this work. The importance of these modules, from high to low, is as follows: EFM, BAM&SAM, S1&S2, and HPB.
Conclusion
In this article, we propose a novel remote sensing image pan-sharpening network called MPEFNet, which employs a segmented progressive fusion structure. While each stage has a similar processing procedure, the resolution of the input image in each stage varies and increases by a factor of two progressively. Before feeding the upsampled MS image into each stage, we first apply an HPB to reduce information redundancy and computational complexity caused by the variation in image size. Next, during the feature extraction process, we introduce SAM and BAM to enhance spatial detail features and spectral information features, respectively. Finally, in each stage, we introduce EFM for self-enhancement and mutual enhancement of important features.