Journals & Magazines >IEEE Journal of Selected Topi... >Volume: 16

MPEFNet: Multilevel Progressive Enhancement Fusion Network for Pansharpening

Abstract:

Remote sensing image fusion is a key technique to fuse low spatial resolution multispectral (MS) images with high spatial resolution panchromatic (PAN) images to obtain h...Show More

Metadata

Abstract:

Remote sensing image fusion is a key technique to fuse low spatial resolution multispectral (MS) images with high spatial resolution panchromatic (PAN) images to obtain high spatial resolution multispectral images. However, many existing fusion algorithms typically perform a single upsampling on the MS image to match its spatial resolution with that of the PAN image, and subsequently output the fused image through steps of feature extraction, fusion, and decoding. This single-stage fusion approach not only fails to fully utilize the low-frequency and high-frequency spatial information in the PAN image, but also leads to inadequate extraction of internal spatial and spectral information in the original MS image, resulting in problems such as blurring, artifacts, and incomplete spectral information recovery in the fused image. To address these issues, this article proposed a multilevel progressive enhancement fusion network. To fully fuse the spatial and spectral information of different resolution images, this article employs a three-stage network structure. The high preserving block is used to alleviate spatial detail distortion and spectral information loss caused by upsampling. Bands aggregation module and spatial aggregation module are used to refine the feature extraction module's spectral and spatial detail features. Meanwhile, the enhanced fusion module further performs self-enhancement fusion on the refined features, as well as mutual-enhancement fusion with the original information. The method is superior to the comparison method by qualitative analysis and quantitative comparison on the IKONOS and WorldView-2 datasets.

Published in: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing ( Volume: 16)

Page(s): 9358 - 9368

Date of Publication: 28 July 2023

ISSN Information:

DOI: 10.1109/JSTARS.2023.3298995

Funding Agency:

Contents

SECTION I.

Introduction

With a large number of satellite launches, remote sensing images are widely used in urban planning, resource exploration, land cover classification, environmental protection, and climate monitoring [1], [2], [3], [4]. Due to the limitations of remote sensing hardware, satellite sensors can only acquire a high-resolution single-band image and a matching low-resolution multispectral image. Remote sensing image fusion techniques are mostly applied to panchromatic sharpening. In recent years, researchers have made significant efforts to extend the panchromatic sharpening technique to the fusion of hyperspectral and multispectral images [5].

The common pansharpening methods can be divided into three main categories: component substitution (CS); multireso- lution analysis (MRA); and super-resolution (SR) paradigm. The CS-based method focuses on the spatial and spectral separation of low-resolution multispectral image (LRMS) and then replacing the spatial structure with panchromatic (PAN) images. The current CS-based algorithms are e principal component analysis (PCA) [6], Gram–Schmidt (GS) [7], partial replacement adaptive component substitution [8], and band-dependent spatial detail [9]. Although these methods have rich spatial details almost as much as PAN images, there is significant spectral distortion in the fusion image due to the fact that the replaced components are not necessarily all matched.

The MRA-based methods mainly decompose the PAN and multispectral (MS) images into low-frequency components and high-frequency components, then designs different fusion rules to fuse the low-frequency and high-frequency components separately, and finally obtains the fused image through inverse transformation. The current MRA-based algorithms decimated Wavelet Transform using an additive injection model (Indusion) [10], Additive A Trous Wavelet Transform with unitary injection model [11], Additive Wavelet Luminance Proportional, a generalization of additive wavelet luminance proportional (AWLP) to more than three bands [12] and the modulation transfer function (MTF-GLP) [13]. However, these methods are very prone to cause distortion of the spatial structure.

The SR method primarily considers the LRMS and PAN images as low-resolution versions of the high spatial resolution multispectral (HRMS) image. It regards pansharpening as a process of image restoration to recover the HRMS image from these degraded versions. The SR-based methods are sparse representation of injected details (SR-D) [14], model-based fusion using PCA and wavelets [15], but the model is complex and computationally intensive. Most of the deep learning-based remote sensing image fusion methods are derived from super-resolution reconstruction [16]. The first three-layer convolutional neural network structure is proposed for feature extraction and fusion (PNN) [17]. In addition, Yuan et al. [18] proposed a multiscale and multidepth convolutional neural network (MSDCNN) for pansharpening from a multiscale perspective. Since PAN images and MS images contain different information, TFNet [19] proposes two stream networks to extract features from PAN images and MS images and fuse them in the feature domain. Xiang et al. [20] proposed a weighted fusion module (MC-JAFN) that enhances the important information and sup-presses the redundant information greatly improving the effectiveness of pansharpening. Zhou et al. [21] proposed an unsupervised cycle-consistent generative adversarial networks (UC-GAN) that learns from the full-scale images without the ground truths for pan-sharpening. Finally, Yang et al. [22] proposed a multilevel dense parallel attention fusion framework (DPAFNet) for solving the pansharpening problem based on the advantages of attention mechanisms for extracting spectral information and spatial details. Yet, the abovementioned methods do not consider the effect of dimensional changes of MS images on the fusion results during the fusion process and the intrinsic correlation between PAN images and MS images, and these methods simply concatenation in the feature aggregation stage, which leads to some significant information loss. To address the abovementioned drawbacks, we propose a new remote sensing sharpening network. The contributions of this article are as follows.

We propose a three-stage progressive fusion structure to fully capture the spatial details from the PAN image and spectral information from the MS image. The features extracted at each stage are fused at different resolutions, allowing for the utilization of crucial information from different-resolution images.
We design a high-frequency information preservation block (HPB) to enhance the saliency of important information in the upsampled MS image while removing redundant information and reducing the number of trainable parameters in the network, and propose a spatial aggregation module (SAM) and bands aggregation module (BAM) to fully exploit spatial details and spectral information, respectively.
We design an enhancement fusion module (EFM), which consists of three encoders and two decoders. They are responsible for feature aggregation of input features as well as self-enhancement and interactive enhancement of features.

SECTION II.

Proposed Method

In this section, we introduce the MPEFNet for generating HRMS images by fusing PAN and MS images. First, we introduce the motivation of our study. Second, the overall framework and details of the network modules are explained. Finally, the loss function is used in the training phase.

A. Motivation

Due to the complex physical geography factors of the spatial distribution of land cover and different spectral responses of different sensors, we cannot directly obtain high spatial resolution MS images through a single sensor, but only through the fusion of high spatial resolution PAN images and multispectral images [23].

The objective of pan-sharpening is similar to super-resolution, to optimize the image at a scale level, making spatial scale a crucial factor in restoring high-resolution images. However, significant changes in image size can lead to a large amount of irrelevant information, hampering the capture of important details and impeding network training. Inspired by LapSRN [24], we have designed a sharpening network that progressively upscales the low-resolution image to enhance both the accuracy and efficiency of the algorithm.

The overall framework of MPEFNet is shown in Fig. 1, and it consists of four main modules, namely REB for feature extraction, SAM for extracting deeper spatial features based on the spatial attribute of PAN images, BAM for retaining more spectral information based on the spectral attribute of MS images, and EFM for enhancing the fused features. First, we input the PAN ( $P \in {R}^{H \times W}$ ) and MS ( $M \in {R}^{h \times w \times B}$ ) images, meanwhile, H and W represent the size of images, B is the number of bands. Generally, $H = r \times h$ , $W = r \times w$ , where r represents the ratio of the spatial resolution of the LRMS and PAN images. Most of the network is implemented by the following method: $\begin{equation*} {\rm{HRMS\ }} = \hat{M} + \varphi \left({\hat{M},P} \right) \tag{1} \end{equation*}$ View Sourcewhere $\hat{M} \in {R}^{H \times W \times B}$ obtained by the low spatial resolution MS image unsampled, and $\varphi (\cdot)$ represents the function that we proposed the innovation.

Fig. 1.

Framework of the proposed MPEFNet.

Show All

B. Network Framework

In Fig. 1, the first, PAN is matched in size with LRMS images by downsampling and known as ${P}_1$ . Then, LRMS and ${P}_1$ into REB, simultaneously, and obtain the feature, which has the same number of channels. Then, HPB was then used to avoid loss of detail and visual unnaturalness associated with LRMS size variations. Second, the features of LRMS with ${P}_1$ in the spectral dimension is concatenated as the input ${I}_1 \in {R}^{H \times W \times ({B + 1})}$ to the stage 1 (S1) of MPEFNet. Then, the result of S1 is used as the input of the stage 2 (S2) together with LRMS and PAN, and both LRMS images and PAN images are available in the whole network to ensure the fully utilization of source information.

The first two stages are multiresolution learning to fully exploit the spectral information and spatial details of MS images and PAN images of different resolutions, which emulates the structure of DBPN [25] in processing multiresolution images. In the last stage, we employ a nondownsampling approach to process the image information, aiming to improve the utilization of the original information. The details of each module in the network are as described below.

1) High Preserving Block

A new HPB is proposed to reduce the resolution of the processed features in order to reduce the computational cost. However, the reduction of feature map size often leads to the loss of image details, which results in visually unnatural fused images. To solve this issue, in HPB, we propose the high-frequency filtering module (HFM) and the adaptive residual feature block (ARFB).

As shown in Fig. 2, the feature of ${F}_{{\rm{\ ms}}}$ is first extracted with ARFB as the input feature for HFM. Then, the high frequency information of LRMS feature (marked as ${M}_{\text{high}}$ ) is calculated with HFM. After the ${M}_{\text{high}}$ is obtained, we reduce the size of the feature mapping to reduce computational cost and information redundancy. The downsampled feature mapping is noted as $M{^{\prime}}_{\text{high}}$ , and then multiple ARFB are used to fully explore the potential high-frequency details of the LRMS image, aiming to improve the spectral information of the fused image. At the same time, we apply a single ARFB to ${M}_{\text{high}}$ to make it consistent with $M{^{\prime}}_{\text{high}}$ in the feature space. Then, $M{^{\prime}}_{\text{high}}$ is upsampled by bilinear interpolation to recover to the original resolution, and $M{^{\prime}}_{\text{high}}$ is fused with $F{^{\prime}}_{{\rm{\ ms}}}$ to preserve the initial details to obtain $M^{\prime}{^{\prime}}_{\text{high}}$ , the process can be expressed as follows: $\begin{equation*} \ M^{\prime}{^{\prime}}_{\text{high}} = ARFB\left({{M}_{\text{high}}} \right) \copyright \left\{ { \uparrow ARF{B}^{\unicode{x27F3} ^5}\left({ \downarrow {M}_{\text{high}}} \right)} \right\} \tag{2} \end{equation*}$ View Sourcewhere ↑ and ↓ denotes up-sampling and down-sampling, © denotes concatenation, and $\unicode{x27F3}^5$ denotes 5 serial ARFBs, respectively.

Fig. 2.

Framework of the proposed HPB.

Show All

A 1 × 1 convolution is used to reduce the number of channels of $M^{\prime}{^{\prime}}_{\text{high}}$ . The channel attention module [26] is used to extract finer spectral details. Finally, the final features are extracted utilizing ARFB and the global residual join is proposed to sum the original features ${F}_{{\rm{\ ms}}}$ to ${F}_{{\rm{\ ms}}}$ . This operation aims to learn the residual information from the input to make the training results more stable.

The HFM in Fig. 2 was used to estimate the high frequency information from the LRMS images. k denotes the kernel size of the Average Pooling and the value of each point on the middle feature map represents the average intensity of each pooled region on the input feature map. After that, the intermediate feature maps are upsampled to the original input feature size, and finally subtracted from the original input features to obtain the high-frequency information of ${M}_{\text{high}}$ .

As shown in Fig. 2, the ARFB consists of two residual units (RU) and two convolutional layers, where the RU contains two modules: Reduction and expansion used to reduce the number of channels and restore the number of channels, respectively. The process can be expressed as follows: $\begin{equation*} \ {y}_{\text{RU}} = {\lambda }_{res}\ \otimes EX\left({RE({x}_{\text{RU}}} \right))\ \oplus {\lambda }_{res} \otimes x \tag{3} \end{equation*}$ View Sourcewhere ${y}_{\text{RU}}$ denotes the output from RU, meanwhile, EX and RE represent the Expansion and Reduction operations. ${\lambda }_{res}$ and ${\lambda }_x$ are two adaptive weight residual scaling factors representing the two paths, and applied to dynamically adjust the importance of the residual and identity paths.

2) Spatial Aggregation Module

The detailed structure of the Spatial Aggregation Module is shown in Fig. 3. Feature integration theory [27] suggests that human vision perceives objects by extracting basic contextual features and associating individual features with attention. However, CNN is not enough to learn diverse contextual features that rely on the sole presence of regionality perceptions [28], [29]. Thus, we propose a spatial aggregation module to capture contextual multiorder interactions, which consists of two cascaded components $\begin{equation*} Z = X + \text{MGA}\left({\text{FDM}\left({Norm\left(X \right)} \right)} \right) \tag{4} \end{equation*}$ View Sourcewhere FDM is a feature decomposition module and MGA is a multiorder gated aggregation module comprising the gating ${F}_\emptyset (\cdot)$ and the context branch ${G}_\varphi (\cdot)$ . As a pure convolutional structure, we extract multiorder features with both static and adaptive regionality perceptions. Except for m-order interactions, there are two trivial interactions, 0-order interaction of each patch itself, and 1-order interaction covering all patches, which can be modeled by $Con{v}_{1 \times 1}(\cdot)$ and GAP (•). To force the network focus on the multiorder interactions, we propose FDM to dynamically exclude trivial interactions, which is formulated as $\begin{align*} &Y = Con{v}_{1 \times 1} \left(X \right) \tag{5}\\ &Z = \text{GELU}\left({Y + {\gamma }_s \odot \left({Y - GAP\left(Y \right)} \right)} \right) \tag{6} \end{align*}$ View Sourcewhere ${\gamma }_s \in {R}^{C \times 1}$ is a scaling factor initialized as zeros. By reweighting the trivial interaction component Y-GAP(Y), FDM also increase feature diversities [30], [31]. Then, we design depth-wise convolutions (DWConv) to constitute multiorder features in the context branch of MGA. Different from previous methods [32] that using common DWConv and self-attentions to simulate the interactions of local and global, we employ three DWConv layers with dilation ratios $d \in \{ {1,2,3} \}$ in parallel to capture low, middle, and high-order interactions: given the input feature $X \in {R}^{C \times HW}$ , $D{W}_{5 \times 5,d\ = \ 1}$ is first applied for low-order features; then, the output is decomposed into ${X}_l \in {R}^{{C}_l \times HW}$ , ${X}_m \in {R}^{{C}_m \times HW}$ , and ${X}_h \in {R}^{{C}_h \times HW}$ along the channel dimension, where $C\ = \ {C}_l + {C}_m + {C}_h$ . Afterward, $D{W}_{5 \times 5,d\ = \ 2}$ and $D{W}_{5 \times 5,d\ = \ 3}$ are assigned to ${X}_m$ and ${X}_h$ , respectively. Finally, the output of ${X}_l$ , ${X}_m$ , and ${X}_h$ are concatenated as multiorder connexts, $\ {Y}_C = \ concat({Y}_l,\ {Y}_m,\ {Y}_h)$ . To aggregate the output spatial contexts from the context branch, SiLU activation is employed in the gating branch, i.e., x.Sigmoid (x), which can be regared an advanced version of Sigmoid. Finally, we obtain the features with high spatial detail properties from the spatial aggregation module.

Fig. 3.

Framework of the proposed SAM.

Show All

3) Bands Aggregation Module

Inspired by the latest breakthroughs in transformers [33], we have reconsidered the basic convolutional blocks used for feature extraction in the pansharpening task, aiming to improve the ability to mine spectral information. As depicted in Fig. 4, the BAM consists of two parts: the multiscale large kernel attention module (MLKA) and the gate channel attention unit (GCAU). Assuming that the input feature is X, the process of calculation of BAM as follows: $\begin{align*} L =& LN\left(X \right)\\ X =& X + {F}_3\left({\text{MLKA}\left({{F}_1\left(L \right)} \right) \otimes {F}_2\left(N \right)} \right))\\ L =& LN\left(X \right)\\ X =& X + {F}_6\left({\text{GCAU}\left({{F}_4\left(L \right),{F}_5\left(N \right)} \right)} \right) \tag{7} \end{align*}$ View Sourcewhere layer normalization ( $LN$ ), and $\otimes$ and ${F}_i$ denote elementwise multiplication and $i\text{th}$ pointwise convolution that keeps the dimensions. $\text{MLKA}()$ and $\text{GCAU}()$ are proposed operation in the Fig. 4, and be going to introduce in following sections.

Fig. 4.

Details of proposed BAM.

Show All

Previous attention mechanisms cannot consider both local and long-term dependence modeling, so we are inspired by the latest visual attention method [34] to combine large kernel decomposition and multiscale learning, aiming to solve the above existing problems. Specifically, MLKA includes three main functions: establishing interdependent large-core attention (LKA), acquiring heterogeneous scale-dependent multiscale mechanisms, and gating aggregation for dynamic recalibration.

Encouraged by [35], [36], we combined the simple channel attention and gated linear unit into the presented GCAU to fulfill the adaptive gating mechanism and minimize the parameters and computations. To capture the spectral information more efficiently, we employ a single layer depth-convolution to weight the feature maps.

4) Enhancement Fusion Module

Some methods [37], [38] utilize transformers in the Siamese structure to fuse images of two different modalities to enhance the quality of the fused images. However, these works use a single encoder and decoder to pattern the relationship between two different modal images, and therefore, do not implement multiple self-enhancements and interactive enhancements. Therefore, we propose an enhancement fusion transformer module, which is composed of three encoders and two decoders for self and interactive enhancement of aggregated features and pattern-specific features, respectively.

Instead of the previous transformer, we separate the encoder and decoder [39], then we use three separated encoders to self-enhance the features from SAM and BAM, and specific features from PAN and LRMS after convolutional layers, while using two separated decoders to further enhance these encoded features, interactively. To reduce the complexity of the model, we employ a single-headed attention mechanism, where the k and v matrices share weights in the encoder and decoder.

We suppose that ${X}_1$ represents the feature with spatial attribute after REB processing from PAN image, and ${X}_2$ represents aggregative feature after feature extraction from PAN and LRMS image, finally, ${X}_3$ represents the feature with spectral attribute after REB and HPB processing from LRMS image. The details of the EFM are given in Fig. 5, and the self-enhanced encoders process is as follows: $\begin{align*} X_1^E =& \text{Encoder}\left({{X}_1} \right) \in {R}^{C \times H \times W}\\ X_2^E =& \text{Encoder}\left({{X}_2} \right) \in {R}^{C \times H \times W}\\ X_3^E =& \text{Encoder}\left({{X}_3} \right) \in {R}^{C \times H \times W} \tag{8} \end{align*}$ View Sourcewhere $C$ , $H$ , and $W$ represents the channel number, height, and width from the feature matrix, respectively.

Fig. 5.

Details of proposed EFM.

Show All

And the decoder is designed to interactively enhance the aggregated features and the attribute-based specific features from latest encoder. The decoder process is described as follows: $\begin{align*} Y =& \left\{ {\text{Decoder}\left({X_1^E,\ X_2^E} \right) \oplus \text{Decoder}\left({X_3^E,\ X_2^E} \right)} \right\}\\ & \in {R}^{C \times H \times W} \tag{9} \end{align*}$ View Sourcewhere $X_1^E$ and $X_3^E$ can be considered as the dominant information, while $X_2^E$ can be considered as the auxiliary information, enabling the output features to have more finely, accurately, and completely spatial details and spectral information.

5) Loss Function

We achieve the optimization of MPEFNet by reducing the error between fused images and GT. Some previous fusion methods [40], [41] employed ${l}_2$ to constrain the gap between the fused image and the GT, but this method caused a certain degree of visual blurring of the fusion results and loss of spectral information. Therefore, some research methods [42], [43] have switched to ${l}_1$ for this task, and experimental results show that ${l}_1$ is more efficient compared to ${l}_2$ , hence, we leverage ${l}_1$ as the loss function to train MPEFNet. The details of the loss function are shown below $\begin{equation*} {l}_1 \left(\theta \right) = \frac{1}{N}\ \mathop \sum \limits_{i = 1}^N {\left| {\varphi \left({{P}^i,{M}^i;\theta } \right) - G{T}^i} \right|}_1 \tag{10} \end{equation*}$ View Sourcewhere ${P}^i$ , ${M}^i$ , $G{T}^i$ , and $\theta$ represent the original $i$ th PAN image, LRMS image, ground truth, and network parameters of MPEFNet. N is the number of training samples in a mini-batch, and $\varphi (\cdot)$ is proposed network in this article.

SECTION III.

Experiments and Analysis

In this section, we are going to offer some details of the experimental procedure, including data set, comparison methods, and qualitative and quantitative analysis of the experimental results. We first introduce the dataset sources and their properties. Then, we give the settings of the equipment and parameters, as well as ablation study is provided to demonstrate the effectiveness of each structure. Finally, the experimental results and comparative analysis of the proposed method is displayed.

A. Experimental Details

1) Datasets and Metrics

We choose two datasets for proving our method, including IKONOS and WorldView-2(WV-2). IKONOS contain four bands of red(R), green(G), blue(B), and near-infrared (NIR), and IKONOS provide the PAN image with a spatial resolution of 1 m and the MS image of 4 m. WV-2 contain eight bands of red(R), green(G), blue(B), near-infrared1(NIR1), coastal blue, yellow, red edge, and near-infrared2 (NIR2), and WV-2 provide the PAN image with a spatial resolution of 0.5 m and the MS image of 2 m. Due to lack of ground truth (GT), we degraded the MS and PAN image based on Wald's protocol [13] to obtain the down(D)-resolution data and use the original MS images as GT, so the size of input MS image is 64 × 64 × B and PAN image is 256 × 256, and the size of full (F) resolution MS image is 256 × 256 × B and PAN image is 1024 × 1024. For different data sets, we conducted simulated and real experiments. In the simulated experiments, we trained our methods using 120 pairs of IKONOS images and 400 pairs of WV-2 images, and we tested our model with 80 pairs of IKONOS images and 100 pairs of WV-2 images, respectively. We evaluated the fusion results using subjective evaluation and objective indicators. First, we visualize the fusion results on the monitor so that it can be conveniently evaluated. After that, we use several common metrics to quantitatively evaluate the fused images. The quantitative evaluation metrics in the simulated experiments include Erreur Relative Globale Adimensionnelle de Synthese (ERGAS) [44], SAM [44], [45], spatial correlation coefficient (SCC) [45], and image quality (Q) [46]. Another aspect, we measure the fused images of real experiments using the quality without reference (QNR) [47], which is calculated by ${D}_\lambda$ and ${D}_s$ , QNR can be defined as $\begin{equation*} \text{QNR} = {(1 - {D}_S)}^\alpha {(1 - {D}_\lambda)}^\beta . \tag{11} \end{equation*}$ View Source

2) Comparison Method

We demonstrate the effectiveness of our method by comparing it with eight fusion methods, which include GS [7] from CS, MTF-GLP [13] from MRA, MSDCNN [18], A-PNN [17], TFNet [19], MC-JAFN [20], UC-GAN [21], and DPAFNet [22]. The first two are traditional methods, and the last six are DL-based methods. Among them, UCGAN belongs to unsupervised algorithm and the others belong to supervised algorithm. For the sake of experimental fairness, all the comparative methods used in the article were trained and tested under the same experimental settings, using the same training and testing datasets as this study.

3) Experimental Settings

All comparison methods are performed on a single Nvidia GeForce RTX 2080Ti GPU with the PyTorch platform. We use Adam to learn minimization loss. The learning rate is set to 0.0001, and each method is trained for 500 epochs, separately, and model saved each 100 epochs.

B. Comparative Analysis

In this section, our algorithm is compared with eight methods described previously in experiments on simulated and real data, and the fusion quality of each method is evaluated using qualitative and quantitative analyses.

1) Simulated Experiments

In these experiments, the PAN and LRMS images were down-sampled to simulate the down-resolution input PAN and MS, while the original MS images were used as GT to evaluate the qualities of the pan-sharpened results. The performance of nine remote sensing image fusion algorithms is tested on two datasets, WV2 and IKONOS, by qualitative and quantitative analysis.

First, we compared the nine fusion methods by quanlitative analysis. Based on the results of the simulated experiments, two sets of images were selected that typically highlight the advantages and shortcomings of various methods, as shown in Figs. 6 and 7, which contains the results of the nine-comparison algorithm and the images of the GT. Simultaneously, in order to more visually represents the effect of each image fusion method, the absolute error map of the fusion results with respect to the GT of each band is given at the bottom of Figs. 6 and 7 (the darker the color, the better the fusion performance). As can be seen in Figs. 6 and 7, the fusion results based on the GS method have a significant blue blurring phenomenon. Compared with the fusion results of GS, MTF-GLP is improved in spectral retention, but there is still partial loss of spatial information, such as the existence of blurred edge structures of house buildings and spatial distortion of car parking space lines. By comparing the images shown in Figs. 6 and 7, it can be observed that the results of the DL-based approach are most closely resembling GT in both spatial detail and spectral fidelity. For instance, the tiny buildings that the neighborhood housing and factory buildings in Fig. 6, and the ground transportation routes in Fig. 7 all show different degrees of spatial distortion. In the other six deep learning methods, while overall spectral distortion is effectively alleviated, spatial blurring is also obvious (MSDCNN [18], A-PNN [17], TFNet [19]), for example, the spatial distribution of the edge of the lake and the woods in the lake on the right side in Fig. 6. And in Fig. 7, although the six comparison methods based on deep learning achieve better recovery of spatial information and spectral distortion in general, there are still some shortcomings in details, for example, the white building at the bottom right of Fig. 7 is confused with the building roof sign, producing spatial blurring. In Fig. 7, the GS, MTF-GLP, MSDCNN, A-PNN, TFNet, and MC-JAFN leads to the loss of spectral information of the green tiny building on the roof. In the other hand, UC-GAN, DPAFNet and the proposed method not only have uniform spectral distribution but also are like to GT.

Fig. 6.

Results images of the nine methods and GT on the IKONOS simulated dataset, as well as images of the absolute error. Zoomed-in view on the image to see more details.

Show All

Fig. 7.

Results images of the nine methods and GT on the WV-2 simulated dataset, as well as images of the absolute error. Zoomed-in view on the image to see more details.

Show All

Since subjective evaluation varies from person to person, specific quantitative indicators are also needed to make a more reasonable, fair, and just evaluation of the fusion quality, thus avoiding spectral artifacts and spatial distortions arising from subjective analysis. Detailed results of the quantitative analysis of each method are given in Tables I and II. The left part of the table depicts the measurement results obtained on the simulated dataset and the right part is the measurement results obtained on the real dataset. The results are in Tables I and II shows that the DL-based method achieves better performance in quantitative metrics compared to the traditional methods, whereas our proposed method shows better results than other comparative methods and contains more spectral information and spatial details.

TABLE I Quantitative Comparison of All Methods on the IKONOS Simulation Dataset

TABLE II Quantitative Comparison of All Methods on the WV-2 Simulation Dataset

2) Real Experiments

The fusion results of the real experiments are shown in Figs. 8 and 9. Due to the lack of GT in the real dataset, we give enlarged views of some of the details in the figure below the fusion result plots, distinguished by red and blue boxes, respectively. As illustrated in the figure, the fusion results of GS and MTF-GLP showed obvious spectral distortion and overall whitening of the images. The fusion results of MSDCNN, A-PNN, TFNet, and MC-JAFN showed obvious spatial distortion and spectral distortion, and the yellowish color of the fused houses and over-sharpening of the edges can be seen in the partial zoomed image below. Since it is difficult to objectively and fairly distinguish the fusion results of other methods based only on the visual subjective evaluation, we need to refer to the quantitative indicators shown in Tables I and II. From Table I, we can see that for the fusion results of Fig. 8, our method has the best results on ${D}_\lambda$ and QNR, and the third best results on ${D}_s$ . Meanwhile, from Table II, we can see that for the fusion results of Fig. 9, our method achieved the best results in all evaluation indicators, which proves the effectiveness of the proposed method.

Fig. 8.

Results images of the nine methods on the IKONOS real dataset, and the lower part indicates the magnified details of the fused results (red and blue boxes). Zoomed-in view on the image to see more details.

Show All

Fig. 9.

Results images of the nine methods on the WV-2 real dataset, and the lower part indicates the magnified details of the fused results (red and blue boxes). Zoomed-in view on the image to see more details.

Show All

C. Ablation Study

In order to verify the effectiveness of each module designed in this work, we conducted 8 ablation experiments on the WV-2 simulated dataset. By employing the controlled variable method, we compared the impact of different modules on the network (see Table III for details).

w/o HPB: This procedure is to test the effect of HPB on the network. Compared with the MPEFNet, we remove all HPB in the network. Channels of LRMS and PAN are concatenated directly after passing the REB.
w/o BAM&SAM: Compared with the MPEFNet, the network structure of this ablation experiment only is removed the BAM and SAM, so the input feature maps are processed by 4 × REB and then added as the output.
w/o EFM: We use addition instead of EFM, which means that three feature inputs are directly added as output.
w/o Stage 1(S1) &S2: The abovementioned three types of ablation experiments are conducted with the original framework unchanged. Differently, this process canceled the fusion procedure of 64 × 64 image size and 128 × 128 image size in the original network.
MPEFNet: MPEFNet is our proposed method. It contains the structure of the abovementioned five ablation experiments.
We also conducted ablation experiments to validate the effects of two or more modules simultaneously. Please refer to Table III for detailed information.

TABLE III Average Objective Evaluation of Different Models Combinations in Ablation Study on the WV-2 Simulation Dataset

Table III shows the average index results for each ablation module corresponding to the model results, Fig. 10 shows the fusion results of the ablation experiments conducted on the WV-2 simulation dataset, and Fig. 11 presents a bar chart illustrating the changes in objective indicators for different ablation models. From Table III, Figs. 10, and 11, it can be observed that the network is most influenced by EFM. The absence of EFM results in severe oversharpening of spatial structures and loss of spectral information in the fused image. Additionally, through the controlled variable method, the impact of other modules on the network can be determined. It can be seen from the absolute error plots that the proposed model has less residuals compared to the other ablation models, which is further confirmed by the results of the quantitative metrics in Table III.

Fig. 10.

Results images of different types of ablation experiments on the WV-2 simulated dataset, and the absolute error images.

Show All

Fig. 11.

Bar chart showing changes in indicators for different ablation models.

Show All

The next module that influences the overall network performance is the BAM&SAM, and it can be seen from the absolute error plot that there is a serious spatial ambiguity after removing the BAM&SAM. Without the inclusion of HPB, BAM, SAM, and S1&S2 components, the fusion results are inferior to some extent. Therefore, these components are crucial to improve the performance of the network in this work. The importance of these modules, from high to low, is as follows: EFM, BAM&SAM, S1&S2, and HPB.

SECTION IV.

Conclusion

In this article, we propose a novel remote sensing image pan-sharpening network called MPEFNet, which employs a segmented progressive fusion structure. While each stage has a similar processing procedure, the resolution of the input image in each stage varies and increases by a factor of two progressively. Before feeding the upsampled MS image into each stage, we first apply an HPB to reduce information redundancy and computational complexity caused by the variation in image size. Next, during the feature extraction process, we introduce SAM and BAM to enhance spatial detail features and spectral information features, respectively. Finally, in each stage, we introduce EFM for self-enhancement and mutual enhancement of important features.

References is not available for this document.

MPEFNet: Multilevel Progressive Enhancement Fusion Network for Pansharpening

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction