Loading [MathJax]/extensions/MathZoom.js
Pansharpening Using Unsupervised Generative Adversarial Networks With Recursive Mixed-Scale Feature Fusion | IEEE Journals & Magazine | IEEE Xplore

Pansharpening Using Unsupervised Generative Adversarial Networks With Recursive Mixed-Scale Feature Fusion


Abstract:

Panchromatic sharpening (pansharpening) is an important technology for improving the spatial resolution of multispectral (MS) images. The majority of the models are imple...Show More
Topic: Radiation Modeling and Remote Sensing

Abstract:

Panchromatic sharpening (pansharpening) is an important technology for improving the spatial resolution of multispectral (MS) images. The majority of the models are implemented at the reduced resolution, leading to unfavorable results at the full resolution. Moreover, the complicated relationship between MS and panchromatic (PAN) images is often ignored in detail injection. For the mentioned problems, unsupervised generative adversarial networks with recursive mixed-scale feature fusion for pansharpening (RMFF-UPGAN) are modeled to boost the spatial resolution and preserve the spectral information. RMFF-UPGAN comprises a generator and two U-shaped discriminators. A dual-stream trapezoidal branch is designed in the generator to obtain multiscale information. Further, a recursive mixed-scale feature fusion subnetwork is designed. Perform a prior fusion on the extracted MS and PAN features of the same scale. A mixed-scale fusion is conducted on the prior fusion results of the fine scale and coarse scale. The fusion is executed sequentially in the aforementioned manner building a recursive mixed-scale fusion structure, and finally, generating key information. A compensation information mechanism is also designed for the reconstruction of key information to compensate for information. A nonlinear rectification block for the reconstructed information is developed to overcome the distortion induced by neglecting the complicated relationship between MS and PAN images. Two U-shaped discriminators are designed and a new composite loss function is defined. The presented model is validated using two satellite data and the outcomes reveal better than the prevalent approaches regarding both visual assessment and objective indicators.
Topic: Radiation Modeling and Remote Sensing
Page(s): 3742 - 3759
Date of Publication: 20 March 2023

ISSN Information:

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Remote sensing images are extensively utilized in geological exploration, terrain classification, agricultural yield prediction, pest detection, disaster prediction, national defense, environmental change detection, and so on [1], [2]. In these applications, images with high spatial resolution, high spectral resolution, or high temporal resolution are required. However, due to the limitations of sensor technology, we obtain low spatial resolution multispectral or hyperspectral (LRMS/LRHS) images, low temporal resolution multispectral or hyperspectral images, and low spectral resolution panchromatic (PAN) images [3], [4]. This requires fusion technology to fuse LRMS and PAN images to generate high spatial resolution multispectral (HRMS) images. This fusion technology is called panchromatic sharpening (pansharpening). The pansharpening techniques are generally divided into component substitution (CS) approaches, multiresolution analysis (MRA) techniques, variational optimization (VO) methods, and deep learning (DL) models [1], [5], [6].

CS techniques primarily involve intensity-hue-saturation (IHS) and variants [7], Gram–Schmidt (GS) [8], GS adaptive (GSA) [9], principal component analysis (PCA) [10], and band-dependent spatial detail (BDSD) [11]. First, the LRMS images are projected into another spatial domain, then the spatial structure information is extracted and replaced by the high resolution image. Finally, the image is inversely transformed into the original space to obtain the fused image. The strengths of the CS are simplicity, extensive application, integration in individual software, easy implementation, and greatly enhancing the spatial resolution of LRMS images. The drawbacks involve spectral distortion, oversharpening, aliasing, and fuzzy problems.

MRA approaches principally include the smoothing-filter-based intensity modulation (SFIM) [12], Laplacian pyramid (LP) transform [13], generalized LP (GLP) transform [14], curvelet transform [15], contourlet transform [16], nonsampled contourlet transform (NSCT) [17], and modulation transfer function-GLP (MTF-GLP) transform and variants [7]. The MRA approaches decompose the LRMS and PAN images, then fuse them through some rules and generate the fused images by inverse transformation. Compared with CS methods, MRA can preserve more spectral information and reduce spectral distortion, but their spatial resolution is relatively low.

VO methods can be divided into two parts: energy function and optimization methods. The core is the optimization of the variational model, such as the panchromatic and multispectral image (P+XS) model [18], the nonlocal variational panchromatic sharpening model [19], and the others [7], [20]. Compared with the CS methods and MRA methods, the VO methods have higher spectral fidelity, but the calculations are more complex.

Convolutional neural networks (CNNs) and generative adversarial networks (GANs) have been widely applied in image processing. Some achievements have been made in the pansharpening of remote sensing images. Early on, pansharpening by CNN (PNN) with three layers was designed [21] based on the superresolution reconstruction. The nonlinear mapping of the CNN is employed to generate HRMS images by feeding LRMS and PAN image pairs into the PNN. The PNN is relatively simple and easy to implement, but it is prone to overfitting. Subsequently, the target adaptive CNN (TA-CNN) [22] was modeled, which utilizes the target adaptive adjustment stage to solve the problems of mismatched data sources and insufficient training data. Yang et al. [23] presented a deep pansharpening network based on ResNet modules, i.e., PanNet, employing the high-frequency information of LRMS and PAN images as the input and outputting the residual between HRMS and LRMS images. Nevertheless, the PanNet overlooks the low-frequency information, causing spectral distortions. Wei et al. [24] modeled a deep residual pansharpening neural network (DRPNN), implemented on the ResNet block. Although the DRPNN is realized by using the powerful nonlinear capability of the CNN, the number of samples required should increase with increasing network depth to avoid overfitting. Regarding training in the spatial domain, the generalization ability of the model still needs to be improved. Deng et al. [25] proposed the FusionNet model based on a CS and MRA detail injection model. The injection details are obtained with a deep CNN (DCNN). Difference from other networks, the input of the network is the difference between PAN images, which are copied to the same number of channels as the LRMS images, and LRMS images. Thus, this network can introduce multispectral information and reduce spectral distortion. Hu et al. [26] proposed a multiscale dynamic convolutional neural network (MDCNN). This MDCNN mainly contains three modules: a filter generation network, a dynamic convolution network, and a weight generation network. The MDCNN uses multiscale dynamic convolution to extract multiscale features of LRMS and PAN images and designs a weight generation network to adjust the relationship between features at different scales to improve the adaptability of the network. Although dynamic convolution improves the flexibility of the network, the network design is more complicated. Simultaneously extracting the features of LRMS and PAN images, the network tends to reduce the effective detail information and spectral information. Wu et al. [27] proposed RDFNet based on a distributed fusion structure and residual module, which extracts multilevel features of LRMS and PAN images, respectively. Then, the corresponding level MS and PAN image features and the fusion result of the previous step are fused gradually to obtain HRMS images. Although the network uses multilevel LRMS and PAN features as much as possible, it is affected by the depth of the network and cannot obtain more details and spectral information. Wu et al. [28] also designed TDPNet based on the cross-scale fusion and multiscale detail compensation. GAN offers great potential for generating images [5]. Shao et al. [29] presented a supervised conditional GAN comprising a residual encoder–decoder, i.e., RED-cGAN, which enhances the sharpening ability with the restriction of PAN images. Liu et al. [30] developed a deeply CNN-based pansharpening GAN, i.e., PsGAN, consisting of a dual-stream generator and a discriminator, which distinguishes the generated MS image from the reference image. Benzenati et al. [31] introduced a detail injection GAN (DIGAN) constructed by a dual-stream generator and a relativistic average discriminator. RED-cGAN, PsGAN, and DIGAN are supervised approaches trained on degraded resolution data, nevertheless, the products are not satisfactory for applying to full-resolution data. Ozcelik et al. [32] constructed a self-supervised learning framework considering pansharpening as a colorization, i.e., PanColorGAN, which reduces blurring by color injection and random-scale downsampling. Li et al. [33] put forward a self-supervised approach using a cycle-consistent GAN trained on reduced resolution data, which builds two generators and two discriminators. The LRMS and PAN images are fed into the first generator to yield the predicted image, and then, the predicted image is input to the second generator to acquire the PAN image, which remains consistent with the input PAN. Regarding the problem of having no reference HRMS images, some unsupervised GANs were presented. Ma et al. [34] suggested an unsupervised pansharpening GAN (Pan-GAN) composed of a generator and two discriminators (a spectral discriminator and a spatial discriminator). The generator produces HRMS images with concatenated MS and PAN images. The spectral discriminator is adopted to judge the spectral information between HRMS and LRMS images, and to produce HRMS data with the consistent spectrum of LRMS data. The spatial discriminator discerns the spatial information between the HRMS and PAN images, enabling the generated HRMS image to agree with the spatial information of the PAN image. Pan-GAN uses two discriminators to better retain spectral information and spatial structure information and solves the problem of the ambiguity caused by downsampling in the supervised training process. However, the input is the concatenated MS and PAN images, resulting in insufficient details and spectral information. Zhou et al. [35] proposed an unsupervised dual-discriminator GAN (PGMAN), which utilizes a dual-stream generator to yield HRMS and two discriminators to retain spectral information and details individually. Pan-GAN and PGMAN are trained on the original data directly with no reference images, which obtains better results at full resolution, but the results on the degraded resolution data are not desirable. This reveals the poor generalization ability of the models.

Although various scholars have proposed a variety of pansharpening networks and achieved certain fusion results, a majority of the models are trained on reduced resolution data, which exhibits problems of spectral distortion and loss of details in fusing the full-resolution data due to changes in resolution. Moreover, in the detail injection model, the details are directly added to the upsampled MS image, ignoring the complicated relationship between the MS image and the PAN image, which is likely to lead to spectral distortion or ringing. For the mentioned problems, unsupervised GANs with recursive mixed-scale feature fusion for pansharpening (RMFF-UPGAN) are modeled to boost the spatial resolution and preserve the spectral information, which is trained on observed data without reference images. The main contributions of this article are as follows.

  1. A dual-stream trapezoidal branch is designed in the generator to obtain multiscale information. We employ a ResNeXt block and residual learning block to obtain the spatial structure and spectral information of four scales.

  2. A recursive mixed-scale feature fusion structure is designed by executing prior fusion and mixed-scale fusion sequentially and generates key information.

  3. A compensation information mechanism is also designed for the reconstructing of key information to compensate for information.

  4. A nonlinear rectification block for the reconstructed information is developed to overcome the distortion induced by ignoring the complicated relationship between MS and PAN images.

  5. Two U-shaped discriminators are designed and a new composite loss function is defined to better preserve spectral information and details.

The rest of this article is organized as follows. Section II describes related work. Section III describes the proposed model in detail. Section IV introduces datasets, evaluation indicators, experimental settings, and comparative experiments. Finally, Section V concludes this article.

SECTION II.

Related Work

A. MRA-Based Detail Injection Model

MRA methods [36], [37] are a class of image fusion methods and are particularly common in the field of remote sensing images. These methods have good multiscale spatial frequency decomposition characteristics, singularity structure representation abilities, and visual perception characteristics. The implementation form of the efficient filter bank of the wavelet provides the possibility for processing large-scale remote sensing image fusion. Based on MRA methods, first, the image is decomposed into a low-frequency component and a high-frequency component by some decomposition method, and then, the high-frequency component and low-frequency component are fused by a fusion method. Finally, the fused high-frequency component and low-frequency component are reconstructed by inverse transformation to generate the fused image. An MRA-based detail injection model can be represented by a general detail injection framework, as shown in the following expression: \begin{equation*} \hat{F}_{k}\,=\,\uparrow M_{k}+g_{k}\left(P-P_{L}\right) \quad k=1,2,\ldots, N \tag{1} \end{equation*} View SourceRight-click on figure for MathML and additional features.where \hat{F}_{k} represents the kth-band fused HRMS image, \uparrow M_{k} represents the kth-band upsampled LRMS image, g_{k} is the kth-band detail injection gain, P represents the PAN image, P_{L} is the low-frequency component of the PAN image, and N is the number of bands of the MS image.

B. ResNeXt

Xie et al. [38] proposed a ResNeXt structure, which is an improvement of ResNet [39]. The network uses group convolution to reduce the network complexity and improve the expression ability. The core of ResNeXt is the proposal of cardinality, which is used to measure the complexity of the model. ResNeXt proves that in the case of similar computational complexity and model parameters, increasing the cardinality can achieve better expression ability than increasing the depth or width of the network. The ResNeXt network structure [38] takes advantage of the split-transform-merge idea. However, the convolution operation of each topology is the same, which reduces the computational complexity. The mathematical expression is as follows: \begin{equation*} y=x+\sum _{i=1}^{C} \mathcal {T}_{i}(x) \tag{2} \end{equation*} View SourceRight-click on figure for MathML and additional features.where C is the cardinality, i.e., the number of identical paths; x represents the input and y represents the output; and \mathcal {T}_{i}() represents the function of the ith path.

SECTION III.

Methodology

RMFF-UPGANs are modeled to improve spatial resolution and retain spectral information. RMFF-UPGAN is trained directly using the raw full-resolution data to decrease the effect of resolution variation on results. The overall architecture of the RMFF-UPGAN is illustrated in Fig. 1 and is composed of one dual-stream generator and two U-shaped relative average discriminators (i.e., \rm U-RaLSD_{pe} and \rm U-RaLSD_{pa}). In Fig. 1, M and P stand for the raw MS and PAN images, and \uparrow M refers to the upsampled MS image, \rm HM is the fused image. For the generator, first, a dual-stream trapezoidal branch is designed to obtain multiscale information. A ResNeXt block extracts low-level semantic information of fine-scale and residual learning blocks extract high-level semantic information of mesoscale and coarse-scale to obtain the spatial structure and spectral information of four scales. Second, a recursive mixed-scale feature fusion subnetwork is designed via residual learning. Perform a prior fusion on the extracted MS and PAN features of the same scale. A mixed-scale fusion is conducted on the prior fusion results of the fine scale and coarse scale. The fusion is executed sequentially in the aforementioned manner building a recursive mixed-scale fusion structure and finally generating key information. Then, the key information is reconstructed and a supplemental information structure is also designed for the reconstruction of key information to compensate for information. Finally, a rectification block for the reconstructed information is developed to obtain the fused image, which overcomes the distortion induced by neglecting the complicated relationship between MS and PAN images. Two U-shaped discriminators are designed to better preserve spectral information and details. The \rm U-RaLSD_{pa} discriminator differentiates the details of the \rm HM image from the details in the P image and prompts the details of the \rm HM to be consistent with that in the P image. The \rm U-RaLSD_{pe} discriminator is applied to distinguish the spectral information of the \rm HM from the spectral information in the M image, which drives the spectral information of the \rm HM to be consistent with that in the M image.

Fig. 1. - Implementation framework of the RMFF-UPGAN.
Fig. 1.

Implementation framework of the RMFF-UPGAN.

A. Dual-Stream Generator

The designed dual-stream generator consists of a dual-stream trapezoidal multiscale feature extraction module, a recursive mixed-scale feature fusion module, a dual-stream multiscale feature reconstruction module, and a reconstructed information rectification module. The architecture of each module is explained in detail as follows.

1) Dual-Stream Trapezoidal Multiscale Feature Extraction (DSTMFE)

The structure of the DSTMFE branch of the generator is shown in Fig. 2, which consists of two independent branches and differs from our previous work TDPNet [28]. We substitute the maxpooling operation with the Conv4, i.e., a convolution operation with a kernel size of 4 and a stride of 2. The top branch extracts four scale features of PAN images and the bottom branch extracts four scale features of MS images, where P_{1}-P_{4} express four scale features extracted from the PAN images, M_{1}-M_{4} represent four scale features extracted from the MS images, and their sizes are 256 × 256 × 32, 128 × 128 × 64, 64 × 64 × 128, and 32 × 32 × 256. Because the information of the PAN and MS images represented by low-level semantic features are the most abundant, the group convolution of the ResNeXt provides multiple branches of convolution, which provides a better way to retain information. This can increase cardinality and improve the network accuracy while reducing network complexity. Therefore, to retain more original information and to reduce network complexity, the ResNeXt module extracts the first scale features P_{1} and M_{1}, respectively. In the latter three scales, residual learning blocks and downsampling operations (i.e., Conv4) extract P_{2}-P_{4} and M_{2}-M_{4} features, respectively. The structures of the ResNeXt block [38] and residual learning block [39] used in the RMFF-UPGAN are depicted in Fig. 3(a) and 3(b). In Fig. 3(a), the parameters used in the ResNeXt block, i.e., 1(4), 1 × 1, 4, the parameter 1(4) represents the number of channels of the PAN (MS) image, and 1 × 1 and 4 represent the kernel size and the number of convolutions. In Fig. 3(b), the leaky ReLU (LReLU) function is employed.

Fig. 2. - Structure of the dual-stream trapezoidal multiscale feature extraction.
Fig. 2.

Structure of the dual-stream trapezoidal multiscale feature extraction.

Fig. 3. - (a) ResNeXt block. (b) Residual learning block.
Fig. 3.

(a) ResNeXt block. (b) Residual learning block.

The expressions that extract features of the MS image and PAN image using the ResNeXt module are displayed in (3) and (4), respectively. The expressions that extract features of the MS image and PAN image using the residual learning module are presented in (5)–​(8), i=2, 3, 4, respectively. \begin{align*} M_{1}=& \uparrow M+\sum _{j=1}^{16} \mathcal {T}_{j}(\uparrow M) \tag{3} \\ P_{1}=&P+\sum _{j=1}^{16} \mathcal {T}_{j}(P) \tag{4} \\ M_{i}=&\Phi _{m}\left(h\left(M_{i-1}\right)+\mathcal {F}\left(M_{i-1}, W_{mi}\right)\right) \tag{5} \\ &\qquad h\left(M_{i-1}\right)=W_{mi}^{\prime } \ast M_{i-1} \tag{6} \\ P_{i}=&\Phi _{p}\left(h\left(P_{i-1}\right)+\mathcal {F}\left(P_{i-1}, W_{pi}\right)\right) \tag{7} \\ &\qquad h\left(P_{i-1}\right)=W_{pi}^{\prime } \ast P_{i-1} \tag{8} \end{align*} View SourceRight-click on figure for MathML and additional features.where P and \uparrow M represent the PAN image and upsampled MS image at full resolution and \mathcal {T}_{i}() represents the ith path function of the ResNeXt module. M_{i} and P_{i} denote the ith scale feature of the MS image and PAN image. h() represents the direct connection part of the residual learning module and \mathcal {F}() represents the residual part of the residual learning module. W_{mi} and W_{pi} are convolutions with the size of 3 × 3, W_{mi}^{\prime } and W_{pi}^{\prime } are convolutions with the size 1 × 1, and the numbers of convolutions are 64, 128, and 256, respectively. \ast indicates the convolution operation. \Phi _{m}() and \Phi _{p}() refer to the downsampling function.

2) Recursive Mixed-Scale Feature Fusion

According to the four-scale MS and PAN images generated in the DSTMFE stage, a recursive mixed-scale feature fusion (RMSFF) subnetwork is designed based on residual learning, as illustrated in Fig. 4, comprising prior fusion blocks and mixed-scale fusion blocks. For four-scale features of the MS image and PAN image, the prior fusion block (PFB) is designed to aggregate the information of the MS image and PAN image. The PFB is helpful for the learning of multimodal information and the fusion of preliminary features of the MS image and PAN image. A “concatenate+Conv3+residual block” mode is employed to build the PFB, illustrated in Fig. 5(a). Conv3 is a convolution operation followed by LReLU function to implement the primary fusion and adaptively adjust the number of channels, then the residual block implements further fusion. The kernel size of the Conv3 and residual block is 3 × 3, and the stride is 1. The numbers of the convolution kernels are 32, 64, 128, and 256, respectively. The mixed-scale fusion block (MSFB) performs the fusion of information from different scales, displayed in Fig. 5(b). The MSFB is constructed using a scale transfer block (STB), concatenation, Conv3, and residual block, where H_{i} represents a fine-scale image and L_{i+1} means a coarse-scale image. The STB is shown in Fig. 6. The fine-scale image H_{i} is downsampled by the STB to generate an image with the same scale as L_{i+1}, and then, fuses with L_{i+1}. The downsampling operation is conducted by Conv4, and the numbers are 64, 128, and 256, respectively. The mixed-scale fusion yields three-scale results, i.e., \text{Mix}\_{f}_{5}, \text{Mix}\_{f}_{9} and \text{Mix}\_{f}_{13}.

Fig. 4. - Structure of the recursive mixed-scale feature fusion subnetwork.
Fig. 4.

Structure of the recursive mixed-scale feature fusion subnetwork.

Fig. 5. - (a) Prior fusion block (PFB). (b) Mixed-scale fusion block (MSFB). (c) Multiscale reconstruction block (MRB).
Fig. 5.

(a) Prior fusion block (PFB). (b) Mixed-scale fusion block (MSFB). (c) Multiscale reconstruction block (MRB).

Fig. 6. - Scale transfer block (STB).
Fig. 6.

Scale transfer block (STB).

As illustrated in Fig. 4, first, the same-scale features M_{i} and {P}_{i} (i=1,2,3,4) are fused by the PFB to generate P\_{M}_{i} (i=1,2,3,4). Then, the MSFB fuses the prior fusion result P\_{M}_{i} (i=1,2,3) with the next scale result P\_{M}_{i+1} (i=1,2,3) to generate the feature \text{Mix}\_{f}_{i+4} (i=1,2,3) with the same scale as P\_{M}_{i+1} (i=1,2,3). The mixed-scale information fusion is realized in an aforementioned manner sequentially, and the recursive fusion is carried out to generate the key information \text{Mix}\_{f}_{13}. The entire fusion subnetwork constitutes a recursive mixed-scale fusion architecture, which utilizes the information of MS and PAN images with various modalities and scales to reduce the loss of information in MS and PAN images.

The expression of the PFB is as follows: \begin{equation*} P\_{M}_{i}={\text{PF}_{i}}\left(P_{i}, M_{i}, W_{\text{PF}_{i}}\right) \quad i=1, \dots, 4 \tag{9} \end{equation*} View SourceRight-click on figure for MathML and additional features.where P\_{M}_{i} represents the prior fusion result of the ith-scale features P_{i} and M_{i}, \text{PF}_{i} indicates the function of the PFB, and W_{\text{PF}_{i}} represents the parameter.

The expression of the MSFB is as follows: \begin{equation*} \text{Mix}\_{f}_{i+4}\,=\,{ {MF}_{i}}\left(H_{i}, L_{i+1}, W_{\text{MF}_{i}}\right) \!\quad i\,=\,1,2,3,5,6,9\! \tag{10} \end{equation*} View SourceRight-click on figure for MathML and additional features.where \text{Mix}\_{f}_{i+4} means the mixed fusion result, \text{MF}_{i} represents the function of the MSFB, and W_{\text{MF}_{i}} is the parameter. H_{i} represents a fine-scale image and L_{i+1} means a coarse-scale image, i.e., H_{1} represents P\_{M}_{1}, L_{2} represents P\_{M}_{2} and \text{Mix}\_{f}_{5} means the mixed fusion result of P\_{M}_{1} and P\_{M}_{2}; H_{5} represents \text{Mix}\_{f}_{5}, L_{6} represents \text{Mix}\_{f}_{6} and \text{Mix}\_{f}_{9} means the mixed fusion result of \text{Mix}\_{f}_{5} and \text{Mix}\_{f}_{6}.

3) Dual-Stream Multiscale Feature Reconstruction

To obtain more precise reconstruction information, a dual-stream multiscale reconstruction (DSMR) subnetwork is designed to reconstruct the key information \text{Mix}\_{f}_{13}, as depicted in Fig. 7. Two branches reconstruct features of the same scale and different levels. To compensate for information, a compensation information mechanism (CIM) is designed to reconstruct information of each scale, as shown by the green arrow in Fig. 7. The \text{Mix}\_{f}_{13}, prior fusion results in the RMSFF stage with the same scale and finer scale as the information to be reconstructed, and a mixed-scale fusion result with the same scale as the information to be reconstructed are introduced to the CIM. The upper branch employs reconstructed result of the previous step and CIM to generate multiscale information by the multiscale reconstruction block (MRB). The bottom-branch employs the reconstructed result of the previous step, upper branch result, \text{Mix}\_{f}_{13} and prior fusion results of CIM to generate multiscale information. The reconstructed results M\_{R}_{2} and M\_{R}_{4} of the upper branch provide supplementary information for the reconstruction of M\_{R}_{3} and M\_{R}_{5}, respectively. The multiscale information gradually generates the final reconstruction information T_{R}.

Fig. 7. - Structure of the dual-stream multiscale reconstruction subnetwork.
Fig. 7.

Structure of the dual-stream multiscale reconstruction subnetwork.

The MRB is presented in Fig. 5(c). Compared with the scale of the information to be reconstructed, H represents finer scale information, S represents the same-scale information, and L represents coarser-scale information. Multiscale information needs to be converted into information with the same scale before reconstruction, and the STB is presented in Fig. 6. The coarse-scale information is converted to fine-scale information through a deconvolution operation and the fine-scale information is converted to coarse-scale information through a downsampling operation. The size of the convolution kernels of the Conv3 and residual learning block used by the MRB is 3 × 3, the stride is 1, and the numbers are 128, 64, and 32, respectively.

The proposed DSMR structure reuses the extracted low-level features for reconstruction through multiscale skip connections. The low-level features contain rich details, such as edges and contours, which can reduce the loss of details. In this way, the loss of details in the PAN image and MS image is reduced, and the spatial resolution is upgraded.

4) Reconstructed Information Rectification

Since the physical imaging of different sensors, the relationship between the MS image and PAN image is very complex. The band ranges of the MS image and PAN image do not exactly overlap, the linear combination of MS image bands cannot accurately express PAN image [4]. The detail injection model directly adds the injected details and the upsampled MS image, as the expression (1). The detail injection model ignores the complex relationship between the PAN image and MS image, which may result in spectral distortion. Therefore, we design a “concatenate+Conv1+conv(3 × 3)” mode to construct a simple reconstructed information rectification block (RIRB), which builds a nonlinear injection relation. The RIRB is displayed in the orange box in Fig. 7. The kernel size of Conv1 is 1\times 1, the number is 12, followed by the LReLU function. The kernel size of conv(3 × 3) is 3, the number is 4. The \rm HM image is generated by the nonlinear mapping of \uparrow M image and the reconstructed information T_{R}.

The expression for the generator of the pansharpening model is as follows: \begin{equation*} \text{HM}=G_{P}\left(\uparrow M, P, W_{P}\right) \tag{11} \end{equation*} View SourceRight-click on figure for MathML and additional features.where HM denotes the fused HRMS image, G_{P} indicates the function of the designed generator, and W_{P} is the parameter.

B. U-Shaped Relative Average Least-Squares Discriminator

To promote the performance and stability of the pansharpening model, we employ a relativistic average discriminator to distinguish the relative probabilities between the generated image and the real image and optimize the model using a least-squares loss function, i.e., the relativistic average least-squares discriminator (RaLSD). The architecture of RaLSD is similar to that of the Real-esrgan [40], enhancing the capability of the RaLSD using a U-shaped structure. However, the differences are that the residual structure is applied to replace the existing convolution operation, and we utilize the “concatenate+SN(conv1-1)+LReLU” mode to substitute the sum operation to increase the discriminative capacity of the network in the skip connection part. SN(conv1-1) indicates spectral normalization (SN) [41] for the convolution operation with a kernel size of 1 and a stride of 1. The structure of the proposed U-shaped RaLSD (U-RaLSD) network is illustrated in Fig. 8, which consists of a spectral discriminator \rm U-RaLSD_{pe} and a detail discriminator \rm U-RaLSD_{pa}, and the structures of the \rm U-RaLSD_{pe} and \rm U-RaLSD_{pa} are the same. The interpretation of the colored arrows in the U-shaped structure is presented in Fig. 8, where the SN operation is conducted except for the convolution operation in the last layer. The architectures of the DRB and URB employed in the U-shaped structure are displayed in Fig. 9(a) and 9(b). In the DRB and URB, we utilize a convolution with a stride of 2 instead of a maxpooling operation for downsampling, i.e., SN(conv3-2) refers to a convolution operation with a kernel size of 3 and a stride of 2, and an SN operation is performed on conv3-2. Moreover, we employ a deconvolution with a stride of 2 instead of an interpolation operation for upsampling, i.e., SN(deconv3-2) refers to a transposed convolution operation with a kernel size of 3 and a stride of 2, and an SN operation is performed on deconv3-2. The FURB operation is performed by a simple fusion, i.e., the “concatenate+SN(conv1-1)+LReLU ” mode, followed by the URB. The original MS image or \rm DHM_{pa}, which is the spatially reduced version of the \rm HM image, is fed into the \rm U-RaLSD_{pe} to generate relativistic probabilities. \rm U-RaLSD_{pa} takes the original PAN image or \rm DHM_{pe}, which is the spectrally reduced version of the \rm HM image, as input.

Fig. 8. - Architecture of the dual U-shaped RaLSDs.
Fig. 8.

Architecture of the dual U-shaped RaLSDs.

Fig. 9. - Structures of the DResidual block (DRB) and UResidual block (URB).
Fig. 9.

Structures of the DResidual block (DRB) and UResidual block (URB).

The expressions of the U-RaLSD are given in (12) and (13). \begin{align*} D^{{\text{URaLS} }}\left(Z_{m}, Z_{g}\right)=&\sigma \left(C\left(Z_{m}\right)-\mathbb {E}_{Z_{g} \sim \mathcal {Q}}\left[C\left(Z_{g}\right)\right]\right) \tag{12} \\ D^{{\text{URaLS} }}\left(Z_{g}, Z_{m}\right)=&\sigma \left(C\left(Z_{g}\right)-\mathbb {E}_{Z_{m} \sim \mathcal {R}}\left[C\left(Z_{m}\right)\right]\right) \tag{13} \end{align*} View SourceRight-click on figure for MathML and additional features.where D^{{\text{URaLS}}} means the relative probability of the U-RaLSD, \sigma represents the sigmoid function, and C() denotes the untranslated output of the U-RaLSD. \mathcal {R} and \mathcal {Q} indicate the distribution of the real data Z_{m} (M or P image in Fig. 8) and fake data Z_{g} (\rm DHM_{pa} or \rm DHM_{pe} image in Fig. 8). \mathbb {E}_{Z_{m} \sim \mathcal {R}} and \mathbb {E}_{Z_{g} \sim \mathcal {Q}} indicate the mean operation for the real data and fake data in a batch.

C. Composite Loss Function

We build a new composite loss function composed of a spatial consistency loss function, a spectral consistency loss function, a no-reference loss function, and two adversarial loss functions.

The loss function of the spatial consistency is presented as follows: \begin{align*} \mathcal {L}_{p c}=&\frac{1}{T} \sum _{t}^{T}\left\Vert h\left(\text{HM}_{t}\right)-h\left(P_{t}\right)\right\Vert _{F}^{2}\\ &+\frac{1}{T} \sum _{t}^{T}\left\Vert \nabla \left(\text{HM}_{t}\right)-\nabla \left(P_{t}\right)\right\Vert _{F}^{2} \tag{14} \end{align*} View SourceRight-click on figure for MathML and additional features.where \mathcal {L}_{p c} means the spatial consistency loss function, T refers to the batch size, \text{HM}_{t} is the tth generated image, and ||\cdot ||_{F} means F norm. h() and \nabla () indicate a high-pass filter and a gradient operator to obtain high-frequency information and gradient information of the image. The goal is to integrate the spatial information of PAN images into MS images. Since the reference image does not exist, we boost the spatial information of MS images by utilizing the high-frequency information and gradient information of PAN images.

To maintain the consistency of spectral information between the \text{HM}_{t} and raw MS images, the spectral consistency loss function is described as follows: \begin{equation*} \mathcal {L}_{m c}=\frac{1}{T} \sum _{t}^{T}\left\Vert ds(\text{HM}_{t})- M_{t}\right\Vert _{F}^{2} \tag{15} \end{equation*} View SourceRight-click on figure for MathML and additional features.where \mathcal {L}_{m c} indicates the spectral consistency loss function, ds represents the down-resolution operation consisting of a blurring operation and a downsampling operation, and M_{t} is the raw MS image.

Since without the reference data, we adopt the no-reference index QNR to measure the quality of the generated image. The desired value of QNR is 1, i.e., the generated image has neither spectral loss nor spatial detail loss. Therefore, the expression of the no-reference loss function is as follows: \begin{equation*} \mathcal {L}_{q}=1 - \text{QNR} \tag{16} \end{equation*} View SourceRight-click on figure for MathML and additional features.where \mathcal {L}_{q} stands for the no-reference loss function.

The QNR relates to the spectral loss metric D_{\lambda } and the spatial loss indicator D_{S}, and the representation is as follows: \begin{equation*} \text{QNR}=(1 - D_{\lambda })^{l}(1-D_{S})^{v} \tag{17} \end{equation*} View SourceRight-click on figure for MathML and additional features.where the expressions for D_{\lambda } and D_{S} are (18) and (19), and the l and v are constants, generally 1. \begin{equation*} D_\lambda \!=\!\sqrt{\frac{1}{B(B\!-\!1)} \sum \nolimits_{n=1}^{B} \sum \nolimits_{\substack{k=1 \\ k \ne n}}^{B}\left|Q\left(M_{n}, M_{k}\right)\!-\!Q\left(F_{n}, F_{k}\right)\right|} \tag{18} \end{equation*} View SourceRight-click on figure for MathML and additional features.where B is the number of bands, and M_{n} and F_{n} are the nth-band LRMS image and generated HRMS image, respectively. Q is the image quality index and the representation is as follows: \begin{equation*} D_{S}=\sqrt{\frac{1}{B} \sum \nolimits_{n=1}^{B} \left|Q\left(F_{n}, P\right)-Q\left(M_{n}, \widetilde{P}\right)\right|} \tag{19} \end{equation*} View SourceRight-click on figure for MathML and additional features.where P refers to the PAN image and \widetilde{P} is the low-resolution version of the PAN image. \begin{equation*} Q(h,k)=\frac{4 \sigma _{h k} \cdot \bar{h} \cdot \bar{k}}{\left(\sigma _{h}^{2}+\sigma _{k}^{2}\right)\left[(\bar{h})^{2}+(\bar{k})^{2}\right]} \tag{20} \end{equation*} View SourceRight-click on figure for MathML and additional features.where h and k are the inputs, \sigma _{h k} denotes the covariance between h and k, \sigma _{h} and \sigma _{k} represent the variance of h and k, and \bar{h} and \bar{k} indicate the mean of h and k.

We optimize the adversarial model using a relative average least-squares (RaLS) loss function to improve the performance and stability of the model. The adversarial loss of the generator with the \rm U-RaLSD_{pe} and \rm U-RaLSD_{pa} discriminators, respectively, is expressed as \begin{equation*} \mathcal {L}_{G}^{{\text{URaLS}}}=\mathcal {L}_{G D_{\text{pe}}}^{{\text{URaLS}}}+\mathcal {L}_{G D_{\text{pa}}}^{{\text{URaLS}}} \tag{21} \end{equation*} View SourceRight-click on figure for MathML and additional features.where \mathcal {L}_{G}^{{\text{URaLS}}} represents the adversarial loss of the network, \mathcal {L}_{G D_{\text{pe}}}^{{\text{URaLS}}} denotes the adversarial loss with the \rm U-RaLSD_{pe} discriminator, and \mathcal {L}_{G D_{\text{pa}}}^{{\text{URaLS}}} denotes the adversarial loss with the \rm U-RaLSD_{pa} discriminator.

The expressions for \mathcal {L}_{G D_{\text{pe}}}^{{\text{URaLS}}} and \mathcal {L}_{G D_{\text{pa}}}^{{\text{URaLS}}} are presented as follows: \begin{align*} \mathcal {L}_{G D_{\text{pe}}}^{{\text{URaLS}}}=&\mathbb {E}_{M \sim \mathcal {R}}\left[\left(C(M)-\mathbb {E}_{\text{HM}_{\text{pa}} \sim \mathcal {Q}}\left[C\left(\text{HM}_{\text{pa}}\right)\right]+1\right)^{2}\right] \\ &+\mathbb {E}_{\text{H M}_{\text{pa}} \sim \mathcal {Q}}\left[\left(C\left(\text{HM}_{\text{pa}}\right)-\mathbb {E}_{M \sim \mathcal {R}}[C(M)]-1\right)^{2}\right]\\ \tag{22} \\ \mathcal {L}_{G D_{\text{pa}}}^{{\text{URaLS}}}=& \mathbb {E}_{P \sim \mathcal {R}}\left[\left(C(P)-\mathbb {E}_{\text{H M}_{\text{pe}} \sim \mathcal {Q}}\left[C\left(\text{HM}_{\text{pe}}\right)\right]+1\right)^{2}\right]\\ &+\mathbb {E}_{\text{HM}_{\text{pe}} \sim \mathcal {Q}}\left[\left(C\left(\text{HM}_{\text{pe}}\right)-\mathbb {E}_{P \sim \mathcal {R}}[C(P)]-1\right)^{2}\right]\\ \tag{23} \end{align*} View SourceRight-click on figure for MathML and additional features.where M refers to the raw MS image, \text{HM}_{\text{pa}} indicates the spatially reduced-resolution version of HM, i.e., \rm DHM_{pa} in Fig. 8. P denotes the raw PAN image, and \text{HM}_{\text{pe}} means the spectrally reduced-resolution version of HM, i.e., \rm DHM_{pe} in Fig. 8.

The RaLS loss function for the \rm U-RaLSD_{pe} and \rm U-RaLSD_{pa} discriminators are given as follows: \begin{align*} \mathcal {L}_{D_{\text{pe}}}^{{\text{URaLS}}}=&\mathbb {E}_{M \sim \mathcal {R}}\left[\left(C(M)-\mathbb {E}_{\text{HM}_{\text{p a}} \sim \mathcal {Q}}\left[C\left(\text{HM}_{\text{pa}}\right)\right]-1\right)^{2}\right]\\ &+\mathbb {E}_{\text{HM}_{\text{pa}} \sim \mathcal {Q}}\left[\left(C\left(\text{HM}_{\text{pa}}\right)-\mathbb {E}_{M \sim \mathcal {R}}[C(M)]+1\right)^{2}\right]\\ \tag{24} \\ \mathcal {L}_{D_{\text{pa}}}^{{\text{URaLS}}}=&\mathbb {E}_{P \sim \mathcal {R}}\left[\left(C(P)-\mathbb {E}_{\text{HM}_{\text{pe}} \sim \mathcal {Q}}\left[C\left(\text{HM}_{\text{pe}}\right)\right]-1\right)^{2}\right]\\ &+\mathbb {E}_{\text{HM}_{\text{pe}} \sim \mathcal {Q}}\left[\left(C\left(\text{HM}_{\text{pe}}\right)-\mathbb {E}_{P \sim \mathcal {R}}[C(P)]+1\right)^{2}\right]\\ \tag{25} \end{align*} View SourceRight-click on figure for MathML and additional features.where \mathcal {L}_{D_{\text{pe}}}^{{\text{URaLS}}} denotes the loss of the \rm U-RaLSD_{pe} discriminator, and \mathcal {L}_{D_{\text{pa}}}^{{\text{URaLS}}} denotes the loss of the \rm U-RaLSD_{pa} discriminator.

The total loss function is expressed as \begin{equation*} \mathcal {L}_{t}=\lambda \mathcal {L}_{G D_{\text{pe}}}^{{\text{URaLS}}}+\mu \mathcal {L}_{G D_{\text{pa}}}^{{\text{URaLS}}}+\xi \mathcal {L}_{q}+\kappa \mathcal {L}_{pc}+\rho \mathcal {L}_{mc} \tag{26} \end{equation*} View SourceRight-click on figure for MathML and additional features.where \mathcal {L}_{t} is the total loss, and \lambda, \mu, \xi, \kappa, and \rho are the coefficients.

SECTION IV.

Experimental Results

A. Datasets

To verify the pansharpening performance of the designed RMFF-UPGAN model, we employ data from Gaofen-2 (GF-2) and QuickBird satellites. For visual observation, red, green, and blue bands are used as R, G, and B channels.

GF-2 data with four-band were acquired from the regions of Beijing and Shenyang, China, with a total of seven large data, and one of them works for testing and the other six serve for training and validation. The resolution ratio of the MS and PAN images is 4, the spatial resolutions of the PAN and MS images are 1 and 4 m, and the radiation resolution is 10-bit. We generate 12 000 samples by randomly cropping the six training data, of which 9 600 serve for training and 2 400 for validation. We precisely adhere to the Wald's protocol [42] to create the reduced-resolution testing data and the full-resolution testing data with the number of 286, respectively.

QuickBird data with four-band were acquired from the regions of Chengdu, Beijing, Shenyang, and Zhengzhou, China, with a total of eight large data, and one of them works for testing and the other seven serve for training and validation. The resolution ratio of the MS and PAN images is 4, the spatial resolutions of the PAN and MS images are 0.6 and 2.4 m, and the radiation resolution is 11-bit. We generate 8000 samples by randomly cropping the seven training data, of which 6400 serve for training and 1600 for validation. We create the reduced and full-resolutions testing data with the number of 158, respectively.

The sizes of the MS and PAN images for training and validation are 64 × 64 × 4 and 256 × 256 × 1. The sizes of the MS and PAN images of the testing data at the degraded and full resolutions are 100 × 100 × 4 and 400 × 400 × 1.

B. Quality Evaluation Metrics

To verify the designed RMFF-UPGAN model, we carry out two types of experiments, i.e., reduced-resolution pansharpening and full-resolution pansharpening. In addition, subjective assessment and objective index evaluation of the pansharpening results are conducted. The subjective assessment mainly compares a pansharpened image (PI) with a reference image (RI), judging the retention of spatial details and spectral information. The objective evaluation indexes include full-reference metrics and no-reference indexes.

The full-reference indexes are applied to evaluate the reduced-resolution pansharpening and compare the PI with the RI. The metrics we utilized include the structural correlation coefficient (SCC) [42], structural similarity (SSIM) [43], universal image quality index (UIQI, abbreviated as Q) [44] extended to n bands (Qn) [45], [46], spectral angle mapping (SAM) [47] and erreur relative global adimensionnelle de synthése (ERGAS) [48]. Specifically, the SCC determines the structural correlation between the PI and RI. The SSIM measures the similarity between the PI and RI from three aspects: luminance, contrast, and structure. The Q comprehensively measures the difference between the PI and RI in terms of correlation loss, luminance distortion, and contrast distortion. The smaller the difference is, the closer the Q is to 1, and the better the PI is. The SAM evaluates the angle of the spectral vector between the PI and RI. The smaller the angle, the closer the SAM is to 0, and the closer the PI is to the RI. The ERGAS evaluates the spectral quality of bands in the spectral range, which represents the overall conditions of the spectral changes. The closer the value is to 0, the better the pansharpening effect in the spectral range.

The quality with no-reference (QNR) indicators do not need the RI when evaluating the pansharpening performance. They are used to evaluate the pansharpening results of full resolution. The metrics include D_{\lambda }, D_{S}, and QNR [49].

The ideal value of the SCC, SSIM, Q, Qn, and QNR is 1, the closer to 1, the better the pansharpening effect is. The ideal value of the SAM, ERGAS, D_{\lambda }, and D_{S} is 0. The closer to 0, the better the effect of pansharpening is.

C. Implementation Details

The implemented framework is TensorFlow, and the experimental setups involve an Intel Xeon CPU and an NVIDIA Tesla V100 PCIe GPU with 16-GB video memory. In the training phase, we optimize the model using the Adam optimizer [50], setting the batch size to 8 and the epoch to 40. The initial learning rate is set to 2 \times 10^{-4}. The factors in (26) are set following [35] as follows: \lambda =2\times 10^{-4}, \mu =1\times 10^{-4}, and \xi =\kappa =\rho =1. To compare the pansharpening results of various approaches more fairly, the CNN-based approaches are completed on the GPU and the CS/MRA-based approaches are performed on the CPU.

D. Reduced Resolution Experiments

To evaluate the pansharpening performance of the presented RMFF-UPGAN model, we have carried out comparative experiments with the advanced traditional methods and GAN-based methods. The compared traditional methods are GSA [9], BDSD [11], SFIM [12], and MTF-GLP [51]. The GAN-based methods include RED-cGAN [29], PsGAN [30], PGMAN [35], and DIGAN [31]. We carry out comparative experiments on the GF-2 and QuickBird data with reduced and full resolutions. The averages of the quantitative evaluation of the GF-2 and QuickBird experimental results at the reduced resolution are listed in Tables I and II. From Tables I and II, it is noticed that the proposed RMFF-UPGAN is optimal in all six indicators, demonstrating a superior pansharpening capability.

TABLE I Average of Quantitative Assessment of Experimental Results on GF-2 Data With the Degraded Resolution
Table I- Average of Quantitative Assessment of Experimental Results on GF-2 Data With the Degraded Resolution
TABLE II Average of Quantitative Assessment of Experimental Results on QuickBird Data With the Degraded Resolution
Table II- Average of Quantitative Assessment of Experimental Results on QuickBird Data With the Degraded Resolution

We display and analyze the experimental outcomes of GF-2 and QuickBird as follows. For better observation of the differences between various comparative approaches, we carry out the representation of the differences between the PI and RI via the average spectral difference map (ASDM), which is the spectral angle between the PI and RI, and the average intensity difference map (AIDM).

The pansharpening results of all the compared models on the testing data of GF-2 at the degraded resolution are shown in Fig. 10. Fig. 10(a) is an upsampled degraded-resolution MS image, Fig. 10(b) is the corresponding PAN image, Fig. 10(c)–(k) illustrates the pansharpening results of the GSA, BDSD, SFIM, MTF-GLP, RED-cGAN, PsGAN, PGMAN, DIGAN, and RMFF-UPGAN models, respectively, and Fig. 10(l) is the reference image, i.e., ground truth (GT). To visualize the details more distinctly, we magnify the contents of the red and yellow boxes in Fig. 10, depicted in Fig. 11. From Fig. 10, it is evident that the pansharpening result of the RMFF-UPGAN model is the most similar to GT regarding both spectrum and structure. From Fig. 11, it is significantly observed that the result of the GSA approach exhibits spectral distortion in comparison with the GT. For the contents in the red box, spectral distortion caused by the blue rendering occurs for all the approaches except the proposed RMFF-UPGAN model. Furthermore, the outcomes of the RecGAN, PGMAN, and DIGAN approaches suffer from blurring edges. The result of the RMFF-UPGAN approach is the closest to GT visually. As to the contents in the yellow box, the outcomes of the BDSD, RED-cGAN, and PGMAN are rather fuzzy. Moreover, only the RMFF-UPGAN accurately captures the spectral information in the white ellipse, whereas the others fail to express precisely.

Fig. 10. - Visual evaluation of the pansharpening results on the GF-2 testing data at the decreased resolution. (a) EXP. (b) D_P. (c) GSA. (d) BDSD. (e) SFIM. (f) MTF-GLP. (g) RED-cGAN. (h) PsGAN. (i) PGMAN. (j) DIGAN. (k) RMFF-UPGAN. (l) GT.
Fig. 10.

Visual evaluation of the pansharpening results on the GF-2 testing data at the decreased resolution. (a) EXP. (b) D_P. (c) GSA. (d) BDSD. (e) SFIM. (f) MTF-GLP. (g) RED-cGAN. (h) PsGAN. (i) PGMAN. (j) DIGAN. (k) RMFF-UPGAN. (l) GT.

Fig. 11. - Enlargement of the contents of the red and yellow boxes in Fig. 10. (a) EXP. (b) D_P. (c) GSA. (d) BDSD. (e) SFIM. (f) MTF-GLP. (g) RED-cGAN. (h) PsGAN. (i) PGMAN. (j) DIGAN. (k) RMFF-UPGAN. (l) GT.
Fig. 11.

Enlargement of the contents of the red and yellow boxes in Fig. 10. (a) EXP. (b) D_P. (c) GSA. (d) BDSD. (e) SFIM. (f) MTF-GLP. (g) RED-cGAN. (h) PsGAN. (i) PGMAN. (j) DIGAN. (k) RMFF-UPGAN. (l) GT.

Figs. 12 and 13 depict the ASDM and AIDM. To facilitate comparison, the differences are highlighted in colors, the color bar varies from dark blue to red, and the values vary from 0 to 1. From Figs. 12 and 13, we can evidently observe that the difference between the PI of the RMFF-UPGAN approach and GT is the smallest, and the fusion property is optimal.

Fig. 12. - ASDM between the PI and GT on the GF-2 testing data at the decreased resolution. (a) GSA. (b) BDSD. (c) SFIM. (d) MTF-GLP. (e) RED-cGAN. (f) PsGAN. (g) PGMAN. (h) DIGAN. (i) RMFF-UPGAN.
Fig. 12.

ASDM between the PI and GT on the GF-2 testing data at the decreased resolution. (a) GSA. (b) BDSD. (c) SFIM. (d) MTF-GLP. (e) RED-cGAN. (f) PsGAN. (g) PGMAN. (h) DIGAN. (i) RMFF-UPGAN.

Fig. 13. - AIDM between the PI and GT on the GF-2 testing data at the decreased resolution. (a) GSA. (b) BDSD. (c) SFIM. (d) MTF-GLP. (e) RED-cGAN. (f) PsGAN. (g) PGMAN. (h) DIGAN. (i) RMFF-UPGAN.
Fig. 13.

AIDM between the PI and GT on the GF-2 testing data at the decreased resolution. (a) GSA. (b) BDSD. (c) SFIM. (d) MTF-GLP. (e) RED-cGAN. (f) PsGAN. (g) PGMAN. (h) DIGAN. (i) RMFF-UPGAN.

Table III lists the objective assessment indicators between the PI and GT on the GF-2 testing data at the decreased resolution. In Table III, the bold indicators reflect that the RMFF-UPGAN is superior for all the indices and has the best pansharpening performance amongst all the comparative approaches. The SAM and ERGAS of the RMFF-UPGAN approach are minimal, signifying that there is the least spectral distortion, and the result retains more spectral information. Combining Figs. 10–​13 and the SCC and SSIM indices in Table III, we can distinctly reveal that the RMFF-UPGAN approach preserves the most structural information, i.e., the definition of the pansharpening results is superior.

The pansharpening results of the compared approaches for the QuickBird testing data at the decreased resolution are exhibited in Fig. 14. The corresponding ASDM and AIDM are presented in Figs. 15 and 16, respectively. The assessment indicators between the PI and GT on the QuickBird testing data at the decreased resolution are reported in Table IV. From Figs. 14 and 15, we can noticeably notice that there are relatively severe spectral distortions in the results of the GSA, BDSD, SFIM, MTF-GLP, and PGMAN. Combining Figs. 14 and 16, it is apparent that the results of the BDSD, SFIM, MTF-GLP, and PGMAN are relatively blurry. According to the visual presentation in Fig. 14, the outcomes are better for the RecGAN, PsGAN, DIGAN, and RMFF-UPGAN, however, in Table IV, RMFF-UPGAN achieves the optimal results for the indices Q4, Q, SCC, SSIM, SAM, and ERGAS. Consequently, the pansharpening result of the RMFF-UPGAN is optimal.

TABLE III Objective Indices Between the PI and GT on the GF-2 Testing Data At the Decreased Resolution
Table III- Objective Indices Between the PI and GT on the GF-2 Testing Data At the Decreased Resolution
TABLE IV Objective Indices Between the PI and GT on the QuickBird Testing Data At the Decreased Resolution
Table IV- Objective Indices Between the PI and GT on the QuickBird Testing Data At the Decreased Resolution
Fig. 14. - Visual evaluation of the pansharpening results on the QuickBird testing data at the decreased resolution. (a) EXP. (b) D_P. (c) GSA. (d) BDSD. (e) SFIM. (f) MTF-GLP. (g) RED-cGAN. (h) PsGAN. (i) PGMAN. (j) DIGAN. (k) RMFF-UPGAN. (l) GT.
Fig. 14.

Visual evaluation of the pansharpening results on the QuickBird testing data at the decreased resolution. (a) EXP. (b) D_P. (c) GSA. (d) BDSD. (e) SFIM. (f) MTF-GLP. (g) RED-cGAN. (h) PsGAN. (i) PGMAN. (j) DIGAN. (k) RMFF-UPGAN. (l) GT.

Fig. 15. - ASDM between the PI and GT on the QuickBird testing data at the decreased resolution. (a) GSA. (b) BDSD. (c) SFIM. (d) MTF-GLP. (e) RED-cGAN. (f) PsGAN. (g) PGMAN. (h) DIGAN. (i) RMFF-UPGAN.
Fig. 15.

ASDM between the PI and GT on the QuickBird testing data at the decreased resolution. (a) GSA. (b) BDSD. (c) SFIM. (d) MTF-GLP. (e) RED-cGAN. (f) PsGAN. (g) PGMAN. (h) DIGAN. (i) RMFF-UPGAN.

Fig. 16. - AIDM between the PI and GT on the QuickBird testing data at the decreased resolution. (a) GSA. (b) BDSD. (c) SFIM. (d) MTF-GLP. (e) RED-cGAN. (f) PsGAN. (g) PGMAN. (h) DIGAN. (i) RMFF-UPGAN.
Fig. 16.

AIDM between the PI and GT on the QuickBird testing data at the decreased resolution. (a) GSA. (b) BDSD. (c) SFIM. (d) MTF-GLP. (e) RED-cGAN. (f) PsGAN. (g) PGMAN. (h) DIGAN. (i) RMFF-UPGAN.

E. Full-Resolution Experiments

This section presents the pansharpening experiments on the full-resolution testing data of GF-2 and QuickBird. Because there are no reference data, subjective visual estimation and objective index assessment are carried out. Table V presents the average of the objective indicators of the experimental results for the testing data of GF-2 and QuickBird. From Table V, it is evident that the proposed RMFF-UPGAN approach is optimal for the three metrics, D_{\lambda }, D_{S}, and QNR, revealing that it is superior for the preservation of spectral and structural information, and the optimal performance for pansharpening. The results of the full-resolution experiments performed on the GF-2 and QuickBird data are described as follows, respectively.

TABLE V Average of Quantitative Assessment of Experimental Results on GF-2 and QuickBird Data With the Full Resolution
Table V- Average of Quantitative Assessment of Experimental Results on GF-2 and QuickBird Data With the Full Resolution

Fig. 17 presents the visual assessment of the pansharpening results for comparative approaches on the GF-2 testing data with the full resolution. Fig. 17(a) is the raw MS image, Fig. 17(b) is the upsampled MS image, Fig. 17(c) is the raw PAN image, and Fig. 17(d)–​(l) are the pansharpening results of all the comparative models. Table VI illustrates the objective assessment indexes of the PI of the various comparative models on the GF-2 testing data with the full resolution. From Fig. 17, we evidently note that the results of the GSA and MTF-GLP exhibit rather severe spectral distortion, as presented in the yellow ellipse of the figure. The outcomes of the BDSD and SFIM are blurry and exhibit ringing. Compared to the MS image, the results of the RED-cGAN and PsGAN exhibit darker colors. The edges of the result of DIGAN suffer from artifacts, such as the edge in the yellow ellipse in Fig. 17. The pansharpening results of the PGMAN and RMFF-UPGAN models are better in details retention, however, for the preservation of spectral information, the RMFF-UPGAN approach is superior. Furthermore, from Table VI, it is noticed that the D_{\lambda }, D_{S}, and QNR of the RMFF-UPGAN model are optimal, therefore, the designed RMFF-UPGAN method is superior.

TABLE VI Objective Indexes of the Comparative Models on the GF-2 Testing Data With the Full Resolution
Table VI- Objective Indexes of the Comparative Models on the GF-2 Testing Data With the Full Resolution
Fig. 17. - Visual assessment of the pansharpening results on the GF-2 testing data with the full resolution. (a) MS. (b) EXP. (c) PAN. (d) GSA. (e) BDSD. (f) SFIM. (g) MTF-GLP. (h) RED-cGAN. (i) PsGAN. (j) PGMAN. (k) DIGAN. (l) RMFF-UPGAN.
Fig. 17.

Visual assessment of the pansharpening results on the GF-2 testing data with the full resolution. (a) MS. (b) EXP. (c) PAN. (d) GSA. (e) BDSD. (f) SFIM. (g) MTF-GLP. (h) RED-cGAN. (i) PsGAN. (j) PGMAN. (k) DIGAN. (l) RMFF-UPGAN.

Fig. 18 is the visual estimation of the pansharpening results on the QuickBird testing data with the full resolution. To provide a better overview of the details, we magnify the contents of the red, yellow, and orange rectangles in Fig. 18. The enlargements are depicted in Fig. 19, corresponding to the upper left, upper right, and bottom, respectively. Table VII displays the objective estimation indexes of the pansharpened image on the QuickBird testing data at the full resolution. From Figs. 18 and 19, it is noticeable that the resolution of the results is enhanced for the various models. However, there is gross spectral distortion overall in the result of the BDSD approach, spectral distortion (as displayed in the upper right in Fig. 19) and structural distortion (as highlighted in the upper left in Fig. 19) exist in the result of the SFIM approach, and artifacts arise in the results of the SFIM and MTF-GLP approaches. The outcome of the DIGAN model is rather blurry. The upper right enlargements of the RecGAN, PsGAN, and PGMAN suffer from structural distortion. Integrating the retention of spectral and structural information, the designed RMFF-UPGAN model is preferable. It is also validated from Table VII that the result of the RMFF-UPGAN is the best in D_{\lambda }, D_{S}, and QNR indexes, illustrating better conservation of spectral information and details and more satisfactory pansharpening capability.

TABLE VII Objective Indexes of the Comparative Models on the QuickBird Testing Data With the Full Resolution
Table VII- Objective Indexes of the Comparative Models on the QuickBird Testing Data With the Full Resolution
Fig. 18. - Visual assessment of the results on the QuickBird testing data with the full resolution. (a) MS. (b) EXP. (c) PAN. (d) GSA. (e) BDSD. (f) SFIM. (g) MTF-GLP. (h) RED-cGAN. (i) PsGAN. (j) PGMAN. (k) DIGAN. (l) RMFF-UPGAN.
Fig. 18.

Visual assessment of the results on the QuickBird testing data with the full resolution. (a) MS. (b) EXP. (c) PAN. (d) GSA. (e) BDSD. (f) SFIM. (g) MTF-GLP. (h) RED-cGAN. (i) PsGAN. (j) PGMAN. (k) DIGAN. (l) RMFF-UPGAN.

Fig. 19. - Enlargement of the contents in Fig. 18. (a) MS. (b) EXP. (c) PAN. (d) GSA. (e) BDSD. (f) SFIM. (g) MTF-GLP. (h) RED-cGAN. (i) PsGAN. (j) PGMAN. (k) DIGAN. (l) RMFF-UPGAN.
Fig. 19.

Enlargement of the contents in Fig. 18. (a) MS. (b) EXP. (c) PAN. (d) GSA. (e) BDSD. (f) SFIM. (g) MTF-GLP. (h) RED-cGAN. (i) PsGAN. (j) PGMAN. (k) DIGAN. (l) RMFF-UPGAN.

F. Ablation Experiment

Four ablation experiments are conducted to verify the usefulness of the structure of the presented network, completed on the QuickBird testing data with reduced and full resolutions. The first is to validate the effectiveness of the ResNeXt block in the RMFF-UPGAN model, the second is to prove the usefulness of the CIM in the reconstruction phase, the third is to verify the validity of the RIRB in the RMFF-UPGAN model, and the fourth is to confirm the validity of the U-shaped structure in the discriminator.

1) Effectiveness of the ResNeXt Module

The ResNeXt block is a vital component of the proposed RMFF-UPGAN model. We replace the ResNeXt block with a plain residual block, keeping the other parameters of the network invariant, recorded as NX. The objective metrics in Table VIII reveal that the results of the RMFF-UPGAN are better than those of the NX on the QuickBird testing data with reduced and full resolutions, demonstrating the positive impact of the ResNeXt block on feature extraction.

TABLE VIII Objective Indexes of the Ablation Experiments on the QuickBird Testing Data At the Reduced and Full Resolutions
Table VIII- Objective Indexes of the Ablation Experiments on the QuickBird Testing Data At the Reduced and Full Resolutions

2) Validity of the CIM

The CIM in the reconstruction stage serves to complement the spectral and structural information. We remove the CIM for comparison, and the parameters of the network remain invariant, denoted as NMF. As presented in Table VIII, the pansharpening property of the RMFF-UPGAN outperforms the NMF on the reduced-resolution data. For the full-resolution data, the NMF is preferable to the RMFF-UPGAN for the D_{\lambda } index, but for metrics D_{S} and QNR, the RMFF-UPGAN is superior to the NMF, and more spatial structure information is preserved. It can be noticed that for metrics, the NX is better than the NMF, indicating that the CIM plays a more effective role in the reconstruction stage for retaining spectral and structural information.

3) Effectiveness of the RIRB

We generate \rm HM by replacing the RIRB with a convolutional layer and an addition operation, keeping the other parameters of the network unchanged, named NRM. The capability of the RMFF-UPGAN is superior to that of the NRM on data with decreased and full resolutions, as reflected by the metrics in Table VIII, revealing that the RIRB reduces spectral distortion and structural distortion.

4) Validness of U-Shaped Discriminator

We remove the upsampling structure from the U-shaped discriminator and only retain the downsampling component to form a plain discriminator, holding the other parameters of the network invariant, denoted as NU. As presented in Table VIII, the pansharpening capability of the RMFF-UPGAN is better than that of the NU on the reduced and full-resolutions data, signifying that the U-shaped discriminator boosts the capability of the network. Furthermore, from the contribution to enhancing the performance of the network, the CIM contributes the most, followed by the U-shaped discriminator, the ResNeXt block, and the RIRB.

G. Computation and Time

The model complexity is discussed from model parameters and test time, as presented in Table IX. For the traditional models, only the test time is given. The parameters and test time of DL-based models are provided. The test time is obtained by taking average for time of 286 pairs of GF-2 test data at full resolution. The sizes of MS image and PAN image of GF-2 test data are 100 × 100 × 4 and 400 × 400 × 1, respectively. The experiments of the first four traditional models are implemented on the above mentioned CPU, and the experiments of the later five DL-based models are implemented on the previously mentioned GPU. The unit M of parameters in Table IX denotes 10^{6}. The unit of time is seconds (s). From Table IX, it can be observed that GSA and SFIM approaches are relatively simple and fast to implement, while BDSD and MTF-GLP approaches are relatively complex and slow to implement. For the DL-based models, the RMFF-UPGAN model has the most parameters, followed by the PGMAN, while the PGMAN has the least test time. The parameters of RED-cGAN and PsGAN are very close, and the test time is similar. This is because the PGMAN processes the high-frequency information of MS and PAN images, while other models process the entire image. Because of the downsampling in the RMFF-UPGAN model, the image size is reduced, so the calculation is relatively small. From the model size and speed, the RMFF-UPGAN model is efficient.

TABLE IX Parameters of Comparative Models and Average Test Time on GF-2 Data
Table IX- Parameters of Comparative Models and Average Test Time on GF-2 Data
SECTION V.

Conclusion

In this article, we model the RMFF-UPGAN to boost spatial resolution and preserve spectral information. RMFF-UPGAN comprises a generator and two U-shaped discriminators. A dual-stream trapezoidal branch is designed in the generator to obtain multiscale information. The ResNeXt block and residual learning blocks are employed to obtain the spatial structure and spectral information of four scales. Further, a recursive mixed-scale feature fusion subnetwork is designed. Perform the prior fusion on the extracted MS and PAN features of the same scale. A mixed-scale fusion is conducted on the prior fusion results of the fine scale and coarse scale. The fusion is executed sequentially in the aforementioned manner building a recursive mixed-scale fusion structure, and finally, generating key information. The CIM is also designed for the reconstructing of key information to compensate for the information. The nonlinear RIRB is developed to overcome the distortion induced by neglecting the complicated relationship between MS and PAN images. Two U-shaped discriminators are designed and a new composite loss function is defined. The RMFF-UPGAN model is validated using GF-2 and QuickBird datasets, and the outcomes of the experiments reveal better than the prevalent approaches regarding both visual assessment and objective indicators. The RMFF-UPGAN model has superior performance in enhancing the spatial resolution and retaining the spectral information, which boosts the fusion quality. Our further work will investigate the unsupervised models to further strengthen the capability of the pansharpening network.

References

References is not available for this document.