Introduction
Remote sensing images are extensively utilized in geological exploration, terrain classification, agricultural yield prediction, pest detection, disaster prediction, national defense, environmental change detection, and so on [1], [2]. In these applications, images with high spatial resolution, high spectral resolution, or high temporal resolution are required. However, due to the limitations of sensor technology, we obtain low spatial resolution multispectral or hyperspectral (LRMS/LRHS) images, low temporal resolution multispectral or hyperspectral images, and low spectral resolution panchromatic (PAN) images [3], [4]. This requires fusion technology to fuse LRMS and PAN images to generate high spatial resolution multispectral (HRMS) images. This fusion technology is called panchromatic sharpening (pansharpening). The pansharpening techniques are generally divided into component substitution (CS) approaches, multiresolution analysis (MRA) techniques, variational optimization (VO) methods, and deep learning (DL) models [1], [5], [6].
CS techniques primarily involve intensity-hue-saturation (IHS) and variants [7], Gram–Schmidt (GS) [8], GS adaptive (GSA) [9], principal component analysis (PCA) [10], and band-dependent spatial detail (BDSD) [11]. First, the LRMS images are projected into another spatial domain, then the spatial structure information is extracted and replaced by the high resolution image. Finally, the image is inversely transformed into the original space to obtain the fused image. The strengths of the CS are simplicity, extensive application, integration in individual software, easy implementation, and greatly enhancing the spatial resolution of LRMS images. The drawbacks involve spectral distortion, oversharpening, aliasing, and fuzzy problems.
MRA approaches principally include the smoothing-filter-based intensity modulation (SFIM) [12], Laplacian pyramid (LP) transform [13], generalized LP (GLP) transform [14], curvelet transform [15], contourlet transform [16], nonsampled contourlet transform (NSCT) [17], and modulation transfer function-GLP (MTF-GLP) transform and variants [7]. The MRA approaches decompose the LRMS and PAN images, then fuse them through some rules and generate the fused images by inverse transformation. Compared with CS methods, MRA can preserve more spectral information and reduce spectral distortion, but their spatial resolution is relatively low.
VO methods can be divided into two parts: energy function and optimization methods. The core is the optimization of the variational model, such as the panchromatic and multispectral image (P+XS) model [18], the nonlocal variational panchromatic sharpening model [19], and the others [7], [20]. Compared with the CS methods and MRA methods, the VO methods have higher spectral fidelity, but the calculations are more complex.
Convolutional neural networks (CNNs) and generative adversarial networks (GANs) have been widely applied in image processing. Some achievements have been made in the pansharpening of remote sensing images. Early on, pansharpening by CNN (PNN) with three layers was designed [21] based on the superresolution reconstruction. The nonlinear mapping of the CNN is employed to generate HRMS images by feeding LRMS and PAN image pairs into the PNN. The PNN is relatively simple and easy to implement, but it is prone to overfitting. Subsequently, the target adaptive CNN (TA-CNN) [22] was modeled, which utilizes the target adaptive adjustment stage to solve the problems of mismatched data sources and insufficient training data. Yang et al. [23] presented a deep pansharpening network based on ResNet modules, i.e., PanNet, employing the high-frequency information of LRMS and PAN images as the input and outputting the residual between HRMS and LRMS images. Nevertheless, the PanNet overlooks the low-frequency information, causing spectral distortions. Wei et al. [24] modeled a deep residual pansharpening neural network (DRPNN), implemented on the ResNet block. Although the DRPNN is realized by using the powerful nonlinear capability of the CNN, the number of samples required should increase with increasing network depth to avoid overfitting. Regarding training in the spatial domain, the generalization ability of the model still needs to be improved. Deng et al. [25] proposed the FusionNet model based on a CS and MRA detail injection model. The injection details are obtained with a deep CNN (DCNN). Difference from other networks, the input of the network is the difference between PAN images, which are copied to the same number of channels as the LRMS images, and LRMS images. Thus, this network can introduce multispectral information and reduce spectral distortion. Hu et al. [26] proposed a multiscale dynamic convolutional neural network (MDCNN). This MDCNN mainly contains three modules: a filter generation network, a dynamic convolution network, and a weight generation network. The MDCNN uses multiscale dynamic convolution to extract multiscale features of LRMS and PAN images and designs a weight generation network to adjust the relationship between features at different scales to improve the adaptability of the network. Although dynamic convolution improves the flexibility of the network, the network design is more complicated. Simultaneously extracting the features of LRMS and PAN images, the network tends to reduce the effective detail information and spectral information. Wu et al. [27] proposed RDFNet based on a distributed fusion structure and residual module, which extracts multilevel features of LRMS and PAN images, respectively. Then, the corresponding level MS and PAN image features and the fusion result of the previous step are fused gradually to obtain HRMS images. Although the network uses multilevel LRMS and PAN features as much as possible, it is affected by the depth of the network and cannot obtain more details and spectral information. Wu et al. [28] also designed TDPNet based on the cross-scale fusion and multiscale detail compensation. GAN offers great potential for generating images [5]. Shao et al. [29] presented a supervised conditional GAN comprising a residual encoder–decoder, i.e., RED-cGAN, which enhances the sharpening ability with the restriction of PAN images. Liu et al. [30] developed a deeply CNN-based pansharpening GAN, i.e., PsGAN, consisting of a dual-stream generator and a discriminator, which distinguishes the generated MS image from the reference image. Benzenati et al. [31] introduced a detail injection GAN (DIGAN) constructed by a dual-stream generator and a relativistic average discriminator. RED-cGAN, PsGAN, and DIGAN are supervised approaches trained on degraded resolution data, nevertheless, the products are not satisfactory for applying to full-resolution data. Ozcelik et al. [32] constructed a self-supervised learning framework considering pansharpening as a colorization, i.e., PanColorGAN, which reduces blurring by color injection and random-scale downsampling. Li et al. [33] put forward a self-supervised approach using a cycle-consistent GAN trained on reduced resolution data, which builds two generators and two discriminators. The LRMS and PAN images are fed into the first generator to yield the predicted image, and then, the predicted image is input to the second generator to acquire the PAN image, which remains consistent with the input PAN. Regarding the problem of having no reference HRMS images, some unsupervised GANs were presented. Ma et al. [34] suggested an unsupervised pansharpening GAN (Pan-GAN) composed of a generator and two discriminators (a spectral discriminator and a spatial discriminator). The generator produces HRMS images with concatenated MS and PAN images. The spectral discriminator is adopted to judge the spectral information between HRMS and LRMS images, and to produce HRMS data with the consistent spectrum of LRMS data. The spatial discriminator discerns the spatial information between the HRMS and PAN images, enabling the generated HRMS image to agree with the spatial information of the PAN image. Pan-GAN uses two discriminators to better retain spectral information and spatial structure information and solves the problem of the ambiguity caused by downsampling in the supervised training process. However, the input is the concatenated MS and PAN images, resulting in insufficient details and spectral information. Zhou et al. [35] proposed an unsupervised dual-discriminator GAN (PGMAN), which utilizes a dual-stream generator to yield HRMS and two discriminators to retain spectral information and details individually. Pan-GAN and PGMAN are trained on the original data directly with no reference images, which obtains better results at full resolution, but the results on the degraded resolution data are not desirable. This reveals the poor generalization ability of the models.
Although various scholars have proposed a variety of pansharpening networks and achieved certain fusion results, a majority of the models are trained on reduced resolution data, which exhibits problems of spectral distortion and loss of details in fusing the full-resolution data due to changes in resolution. Moreover, in the detail injection model, the details are directly added to the upsampled MS image, ignoring the complicated relationship between the MS image and the PAN image, which is likely to lead to spectral distortion or ringing. For the mentioned problems, unsupervised GANs with recursive mixed-scale feature fusion for pansharpening (RMFF-UPGAN) are modeled to boost the spatial resolution and preserve the spectral information, which is trained on observed data without reference images. The main contributions of this article are as follows.
A dual-stream trapezoidal branch is designed in the generator to obtain multiscale information. We employ a ResNeXt block and residual learning block to obtain the spatial structure and spectral information of four scales.
A recursive mixed-scale feature fusion structure is designed by executing prior fusion and mixed-scale fusion sequentially and generates key information.
A compensation information mechanism is also designed for the reconstructing of key information to compensate for information.
A nonlinear rectification block for the reconstructed information is developed to overcome the distortion induced by ignoring the complicated relationship between MS and PAN images.
Two U-shaped discriminators are designed and a new composite loss function is defined to better preserve spectral information and details.
The rest of this article is organized as follows. Section II describes related work. Section III describes the proposed model in detail. Section IV introduces datasets, evaluation indicators, experimental settings, and comparative experiments. Finally, Section V concludes this article.
Related Work
A. MRA-Based Detail Injection Model
MRA methods [36], [37] are a class of image fusion methods and are particularly common in the field of remote sensing images. These methods have good multiscale spatial frequency decomposition characteristics, singularity structure representation abilities, and visual perception characteristics. The implementation form of the efficient filter bank of the wavelet provides the possibility for processing large-scale remote sensing image fusion. Based on MRA methods, first, the image is decomposed into a low-frequency component and a high-frequency component by some decomposition method, and then, the high-frequency component and low-frequency component are fused by a fusion method. Finally, the fused high-frequency component and low-frequency component are reconstructed by inverse transformation to generate the fused image. An MRA-based detail injection model can be represented by a general detail injection framework, as shown in the following expression:
\begin{equation*}
\hat{F}_{k}\,=\,\uparrow M_{k}+g_{k}\left(P-P_{L}\right) \quad k=1,2,\ldots, N \tag{1}
\end{equation*}
B. ResNeXt
Xie et al. [38] proposed a ResNeXt structure, which is an improvement of ResNet [39]. The network uses group convolution to reduce the network complexity and improve the expression ability. The core of ResNeXt is the proposal of cardinality, which is used to measure the complexity of the model. ResNeXt proves that in the case of similar computational complexity and model parameters, increasing the cardinality can achieve better expression ability than increasing the depth or width of the network. The ResNeXt network structure [38] takes advantage of the split-transform-merge idea. However, the convolution operation of each topology is the same, which reduces the computational complexity. The mathematical expression is as follows:
\begin{equation*}
y=x+\sum _{i=1}^{C} \mathcal {T}_{i}(x) \tag{2}
\end{equation*}
Methodology
RMFF-UPGANs are modeled to improve spatial resolution and retain spectral information. RMFF-UPGAN is trained directly using the raw full-resolution data to decrease the effect of resolution variation on results. The overall architecture of the RMFF-UPGAN is illustrated in Fig. 1 and is composed of one dual-stream generator and two U-shaped relative average discriminators (i.e.,
A. Dual-Stream Generator
The designed dual-stream generator consists of a dual-stream trapezoidal multiscale feature extraction module, a recursive mixed-scale feature fusion module, a dual-stream multiscale feature reconstruction module, and a reconstructed information rectification module. The architecture of each module is explained in detail as follows.
1) Dual-Stream Trapezoidal Multiscale Feature Extraction (DSTMFE)
The structure of the DSTMFE branch of the generator is shown in Fig. 2, which consists of two independent branches and differs from our previous work TDPNet [28]. We substitute the maxpooling operation with the Conv4, i.e., a convolution operation with a kernel size of 4 and a stride of 2. The top branch extracts four scale features of PAN images and the bottom branch extracts four scale features of MS images, where
The expressions that extract features of the MS image and PAN image using the ResNeXt module are displayed in (3) and (4), respectively. The expressions that extract features of the MS image and PAN image using the residual learning module are presented in (5)–(8),
\begin{align*}
M_{1}=& \uparrow M+\sum _{j=1}^{16} \mathcal {T}_{j}(\uparrow M) \tag{3}
\\
P_{1}=&P+\sum _{j=1}^{16} \mathcal {T}_{j}(P) \tag{4}
\\
M_{i}=&\Phi _{m}\left(h\left(M_{i-1}\right)+\mathcal {F}\left(M_{i-1}, W_{mi}\right)\right) \tag{5}
\\
&\qquad h\left(M_{i-1}\right)=W_{mi}^{\prime } \ast M_{i-1} \tag{6}
\\
P_{i}=&\Phi _{p}\left(h\left(P_{i-1}\right)+\mathcal {F}\left(P_{i-1}, W_{pi}\right)\right) \tag{7}
\\
&\qquad h\left(P_{i-1}\right)=W_{pi}^{\prime } \ast P_{i-1} \tag{8}
\end{align*}
2) Recursive Mixed-Scale Feature Fusion
According to the four-scale MS and PAN images generated in the DSTMFE stage, a recursive mixed-scale feature fusion (RMSFF) subnetwork is designed based on residual learning, as illustrated in Fig. 4, comprising prior fusion blocks and mixed-scale fusion blocks. For four-scale features of the MS image and PAN image, the prior fusion block (PFB) is designed to aggregate the information of the MS image and PAN image. The PFB is helpful for the learning of multimodal information and the fusion of preliminary features of the MS image and PAN image. A “concatenate+Conv3+residual block” mode is employed to build the PFB, illustrated in Fig. 5(a). Conv3 is a convolution operation followed by LReLU function to implement the primary fusion and adaptively adjust the number of channels, then the residual block implements further fusion. The kernel size of the Conv3 and residual block is 3 × 3, and the stride is 1. The numbers of the convolution kernels are 32, 64, 128, and 256, respectively. The mixed-scale fusion block (MSFB) performs the fusion of information from different scales, displayed in Fig. 5(b). The MSFB is constructed using a scale transfer block (STB), concatenation, Conv3, and residual block, where
(a) Prior fusion block (PFB). (b) Mixed-scale fusion block (MSFB). (c) Multiscale reconstruction block (MRB).
As illustrated in Fig. 4, first, the same-scale features
The expression of the PFB is as follows:
\begin{equation*}
P\_{M}_{i}={\text{PF}_{i}}\left(P_{i}, M_{i}, W_{\text{PF}_{i}}\right) \quad i=1, \dots, 4 \tag{9}
\end{equation*}
The expression of the MSFB is as follows:
\begin{equation*}
\text{Mix}\_{f}_{i+4}\,=\,{ {MF}_{i}}\left(H_{i}, L_{i+1}, W_{\text{MF}_{i}}\right) \!\quad i\,=\,1,2,3,5,6,9\! \tag{10}
\end{equation*}
3) Dual-Stream Multiscale Feature Reconstruction
To obtain more precise reconstruction information, a dual-stream multiscale reconstruction (DSMR) subnetwork is designed to reconstruct the key information
The MRB is presented in Fig. 5(c). Compared with the scale of the information to be reconstructed,
The proposed DSMR structure reuses the extracted low-level features for reconstruction through multiscale skip connections. The low-level features contain rich details, such as edges and contours, which can reduce the loss of details. In this way, the loss of details in the PAN image and MS image is reduced, and the spatial resolution is upgraded.
4) Reconstructed Information Rectification
Since the physical imaging of different sensors, the relationship between the MS image and PAN image is very complex. The band ranges of the MS image and PAN image do not exactly overlap, the linear combination of MS image bands cannot accurately express PAN image [4]. The detail injection model directly adds the injected details and the upsampled MS image, as the expression (1). The detail injection model ignores the complex relationship between the PAN image and MS image, which may result in spectral distortion. Therefore, we design a “concatenate+Conv1+conv(3 × 3)” mode to construct a simple reconstructed information rectification block (RIRB), which builds a nonlinear injection relation. The RIRB is displayed in the orange box in Fig. 7. The kernel size of Conv1 is
The expression for the generator of the pansharpening model is as follows:
\begin{equation*}
\text{HM}=G_{P}\left(\uparrow M, P, W_{P}\right) \tag{11}
\end{equation*}
B. U-Shaped Relative Average Least-Squares Discriminator
To promote the performance and stability of the pansharpening model, we employ a relativistic average discriminator to distinguish the relative probabilities between the generated image and the real image and optimize the model using a least-squares loss function, i.e., the relativistic average least-squares discriminator (RaLSD). The architecture of RaLSD is similar to that of the Real-esrgan [40], enhancing the capability of the RaLSD using a U-shaped structure. However, the differences are that the residual structure is applied to replace the existing convolution operation, and we utilize the “concatenate+SN(conv1-1)+LReLU” mode to substitute the sum operation to increase the discriminative capacity of the network in the skip connection part. SN(conv1-1) indicates spectral normalization (SN) [41] for the convolution operation with a kernel size of 1 and a stride of 1. The structure of the proposed U-shaped RaLSD (U-RaLSD) network is illustrated in Fig. 8, which consists of a spectral discriminator
The expressions of the U-RaLSD are given in (12) and (13).
\begin{align*}
D^{{\text{URaLS} }}\left(Z_{m}, Z_{g}\right)=&\sigma \left(C\left(Z_{m}\right)-\mathbb {E}_{Z_{g} \sim \mathcal {Q}}\left[C\left(Z_{g}\right)\right]\right) \tag{12}
\\
D^{{\text{URaLS} }}\left(Z_{g}, Z_{m}\right)=&\sigma \left(C\left(Z_{g}\right)-\mathbb {E}_{Z_{m} \sim \mathcal {R}}\left[C\left(Z_{m}\right)\right]\right) \tag{13}
\end{align*}
C. Composite Loss Function
We build a new composite loss function composed of a spatial consistency loss function, a spectral consistency loss function, a no-reference loss function, and two adversarial loss functions.
The loss function of the spatial consistency is presented as follows:
\begin{align*}
\mathcal {L}_{p c}=&\frac{1}{T} \sum _{t}^{T}\left\Vert h\left(\text{HM}_{t}\right)-h\left(P_{t}\right)\right\Vert _{F}^{2}\\
&+\frac{1}{T} \sum _{t}^{T}\left\Vert \nabla \left(\text{HM}_{t}\right)-\nabla \left(P_{t}\right)\right\Vert _{F}^{2} \tag{14}
\end{align*}
To maintain the consistency of spectral information between the
\begin{equation*}
\mathcal {L}_{m c}=\frac{1}{T} \sum _{t}^{T}\left\Vert ds(\text{HM}_{t})- M_{t}\right\Vert _{F}^{2} \tag{15}
\end{equation*}
Since without the reference data, we adopt the no-reference index QNR to measure the quality of the generated image. The desired value of QNR is 1, i.e., the generated image has neither spectral loss nor spatial detail loss. Therefore, the expression of the no-reference loss function is as follows:
\begin{equation*}
\mathcal {L}_{q}=1 - \text{QNR} \tag{16}
\end{equation*}
The QNR relates to the spectral loss metric
\begin{equation*}
\text{QNR}=(1 - D_{\lambda })^{l}(1-D_{S})^{v} \tag{17}
\end{equation*}
\begin{equation*}
D_\lambda \!=\!\sqrt{\frac{1}{B(B\!-\!1)} \sum \nolimits_{n=1}^{B} \sum \nolimits_{\substack{k=1 \\
k \ne n}}^{B}\left|Q\left(M_{n}, M_{k}\right)\!-\!Q\left(F_{n}, F_{k}\right)\right|} \tag{18}
\end{equation*}
\begin{equation*}
D_{S}=\sqrt{\frac{1}{B} \sum \nolimits_{n=1}^{B} \left|Q\left(F_{n}, P\right)-Q\left(M_{n}, \widetilde{P}\right)\right|} \tag{19}
\end{equation*}
\begin{equation*}
Q(h,k)=\frac{4 \sigma _{h k} \cdot \bar{h} \cdot \bar{k}}{\left(\sigma _{h}^{2}+\sigma _{k}^{2}\right)\left[(\bar{h})^{2}+(\bar{k})^{2}\right]} \tag{20}
\end{equation*}
We optimize the adversarial model using a relative average least-squares (RaLS) loss function to improve the performance and stability of the model. The adversarial loss of the generator with the
\begin{equation*}
\mathcal {L}_{G}^{{\text{URaLS}}}=\mathcal {L}_{G D_{\text{pe}}}^{{\text{URaLS}}}+\mathcal {L}_{G D_{\text{pa}}}^{{\text{URaLS}}} \tag{21}
\end{equation*}
The expressions for
\begin{align*}
\mathcal {L}_{G D_{\text{pe}}}^{{\text{URaLS}}}=&\mathbb {E}_{M \sim \mathcal {R}}\left[\left(C(M)-\mathbb {E}_{\text{HM}_{\text{pa}} \sim \mathcal {Q}}\left[C\left(\text{HM}_{\text{pa}}\right)\right]+1\right)^{2}\right] \\
&+\mathbb {E}_{\text{H M}_{\text{pa}} \sim \mathcal {Q}}\left[\left(C\left(\text{HM}_{\text{pa}}\right)-\mathbb {E}_{M \sim \mathcal {R}}[C(M)]-1\right)^{2}\right]\\
\tag{22}
\\
\mathcal {L}_{G D_{\text{pa}}}^{{\text{URaLS}}}=& \mathbb {E}_{P \sim \mathcal {R}}\left[\left(C(P)-\mathbb {E}_{\text{H M}_{\text{pe}} \sim \mathcal {Q}}\left[C\left(\text{HM}_{\text{pe}}\right)\right]+1\right)^{2}\right]\\
&+\mathbb {E}_{\text{HM}_{\text{pe}} \sim \mathcal {Q}}\left[\left(C\left(\text{HM}_{\text{pe}}\right)-\mathbb {E}_{P \sim \mathcal {R}}[C(P)]-1\right)^{2}\right]\\
\tag{23}
\end{align*}
The RaLS loss function for the
\begin{align*}
\mathcal {L}_{D_{\text{pe}}}^{{\text{URaLS}}}=&\mathbb {E}_{M \sim \mathcal {R}}\left[\left(C(M)-\mathbb {E}_{\text{HM}_{\text{p a}} \sim \mathcal {Q}}\left[C\left(\text{HM}_{\text{pa}}\right)\right]-1\right)^{2}\right]\\
&+\mathbb {E}_{\text{HM}_{\text{pa}} \sim \mathcal {Q}}\left[\left(C\left(\text{HM}_{\text{pa}}\right)-\mathbb {E}_{M \sim \mathcal {R}}[C(M)]+1\right)^{2}\right]\\
\tag{24}
\\
\mathcal {L}_{D_{\text{pa}}}^{{\text{URaLS}}}=&\mathbb {E}_{P \sim \mathcal {R}}\left[\left(C(P)-\mathbb {E}_{\text{HM}_{\text{pe}} \sim \mathcal {Q}}\left[C\left(\text{HM}_{\text{pe}}\right)\right]-1\right)^{2}\right]\\
&+\mathbb {E}_{\text{HM}_{\text{pe}} \sim \mathcal {Q}}\left[\left(C\left(\text{HM}_{\text{pe}}\right)-\mathbb {E}_{P \sim \mathcal {R}}[C(P)]+1\right)^{2}\right]\\
\tag{25}
\end{align*}
The total loss function is expressed as
\begin{equation*}
\mathcal {L}_{t}=\lambda \mathcal {L}_{G D_{\text{pe}}}^{{\text{URaLS}}}+\mu \mathcal {L}_{G D_{\text{pa}}}^{{\text{URaLS}}}+\xi \mathcal {L}_{q}+\kappa \mathcal {L}_{pc}+\rho \mathcal {L}_{mc} \tag{26}
\end{equation*}
Experimental Results
A. Datasets
To verify the pansharpening performance of the designed RMFF-UPGAN model, we employ data from Gaofen-2 (GF-2) and QuickBird satellites. For visual observation, red, green, and blue bands are used as R, G, and B channels.
GF-2 data with four-band were acquired from the regions of Beijing and Shenyang, China, with a total of seven large data, and one of them works for testing and the other six serve for training and validation. The resolution ratio of the MS and PAN images is 4, the spatial resolutions of the PAN and MS images are 1 and 4 m, and the radiation resolution is 10-bit. We generate 12 000 samples by randomly cropping the six training data, of which 9 600 serve for training and 2 400 for validation. We precisely adhere to the Wald's protocol [42] to create the reduced-resolution testing data and the full-resolution testing data with the number of 286, respectively.
QuickBird data with four-band were acquired from the regions of Chengdu, Beijing, Shenyang, and Zhengzhou, China, with a total of eight large data, and one of them works for testing and the other seven serve for training and validation. The resolution ratio of the MS and PAN images is 4, the spatial resolutions of the PAN and MS images are 0.6 and 2.4 m, and the radiation resolution is 11-bit. We generate 8000 samples by randomly cropping the seven training data, of which 6400 serve for training and 1600 for validation. We create the reduced and full-resolutions testing data with the number of 158, respectively.
The sizes of the MS and PAN images for training and validation are 64 × 64 × 4 and 256 × 256 × 1. The sizes of the MS and PAN images of the testing data at the degraded and full resolutions are 100 × 100 × 4 and 400 × 400 × 1.
B. Quality Evaluation Metrics
To verify the designed RMFF-UPGAN model, we carry out two types of experiments, i.e., reduced-resolution pansharpening and full-resolution pansharpening. In addition, subjective assessment and objective index evaluation of the pansharpening results are conducted. The subjective assessment mainly compares a pansharpened image (PI) with a reference image (RI), judging the retention of spatial details and spectral information. The objective evaluation indexes include full-reference metrics and no-reference indexes.
The full-reference indexes are applied to evaluate the reduced-resolution pansharpening and compare the PI with the RI. The metrics we utilized include the structural correlation coefficient (SCC) [42], structural similarity (SSIM) [43], universal image quality index (UIQI, abbreviated as Q) [44] extended to
The quality with no-reference (QNR) indicators do not need the RI when evaluating the pansharpening performance. They are used to evaluate the pansharpening results of full resolution. The metrics include
The ideal value of the SCC, SSIM, Q, Qn, and QNR is 1, the closer to 1, the better the pansharpening effect is. The ideal value of the SAM, ERGAS,
C. Implementation Details
The implemented framework is TensorFlow, and the experimental setups involve an Intel Xeon CPU and an NVIDIA Tesla V100 PCIe GPU with 16-GB video memory. In the training phase, we optimize the model using the Adam optimizer [50], setting the batch size to 8 and the epoch to 40. The initial learning rate is set to
D. Reduced Resolution Experiments
To evaluate the pansharpening performance of the presented RMFF-UPGAN model, we have carried out comparative experiments with the advanced traditional methods and GAN-based methods. The compared traditional methods are GSA [9], BDSD [11], SFIM [12], and MTF-GLP [51]. The GAN-based methods include RED-cGAN [29], PsGAN [30], PGMAN [35], and DIGAN [31]. We carry out comparative experiments on the GF-2 and QuickBird data with reduced and full resolutions. The averages of the quantitative evaluation of the GF-2 and QuickBird experimental results at the reduced resolution are listed in Tables I and II. From Tables I and II, it is noticed that the proposed RMFF-UPGAN is optimal in all six indicators, demonstrating a superior pansharpening capability.
We display and analyze the experimental outcomes of GF-2 and QuickBird as follows. For better observation of the differences between various comparative approaches, we carry out the representation of the differences between the PI and RI via the average spectral difference map (ASDM), which is the spectral angle between the PI and RI, and the average intensity difference map (AIDM).
The pansharpening results of all the compared models on the testing data of GF-2 at the degraded resolution are shown in Fig. 10. Fig. 10(a) is an upsampled degraded-resolution MS image, Fig. 10(b) is the corresponding PAN image, Fig. 10(c)–(k) illustrates the pansharpening results of the GSA, BDSD, SFIM, MTF-GLP, RED-cGAN, PsGAN, PGMAN, DIGAN, and RMFF-UPGAN models, respectively, and Fig. 10(l) is the reference image, i.e., ground truth (GT). To visualize the details more distinctly, we magnify the contents of the red and yellow boxes in Fig. 10, depicted in Fig. 11. From Fig. 10, it is evident that the pansharpening result of the RMFF-UPGAN model is the most similar to GT regarding both spectrum and structure. From Fig. 11, it is significantly observed that the result of the GSA approach exhibits spectral distortion in comparison with the GT. For the contents in the red box, spectral distortion caused by the blue rendering occurs for all the approaches except the proposed RMFF-UPGAN model. Furthermore, the outcomes of the RecGAN, PGMAN, and DIGAN approaches suffer from blurring edges. The result of the RMFF-UPGAN approach is the closest to GT visually. As to the contents in the yellow box, the outcomes of the BDSD, RED-cGAN, and PGMAN are rather fuzzy. Moreover, only the RMFF-UPGAN accurately captures the spectral information in the white ellipse, whereas the others fail to express precisely.
Visual evaluation of the pansharpening results on the GF-2 testing data at the decreased resolution. (a) EXP. (b) D_P. (c) GSA. (d) BDSD. (e) SFIM. (f) MTF-GLP. (g) RED-cGAN. (h) PsGAN. (i) PGMAN. (j) DIGAN. (k) RMFF-UPGAN. (l) GT.
Enlargement of the contents of the red and yellow boxes in Fig. 10. (a) EXP. (b) D_P. (c) GSA. (d) BDSD. (e) SFIM. (f) MTF-GLP. (g) RED-cGAN. (h) PsGAN. (i) PGMAN. (j) DIGAN. (k) RMFF-UPGAN. (l) GT.
Figs. 12 and 13 depict the ASDM and AIDM. To facilitate comparison, the differences are highlighted in colors, the color bar varies from dark blue to red, and the values vary from 0 to 1. From Figs. 12 and 13, we can evidently observe that the difference between the PI of the RMFF-UPGAN approach and GT is the smallest, and the fusion property is optimal.
ASDM between the PI and GT on the GF-2 testing data at the decreased resolution. (a) GSA. (b) BDSD. (c) SFIM. (d) MTF-GLP. (e) RED-cGAN. (f) PsGAN. (g) PGMAN. (h) DIGAN. (i) RMFF-UPGAN.
AIDM between the PI and GT on the GF-2 testing data at the decreased resolution. (a) GSA. (b) BDSD. (c) SFIM. (d) MTF-GLP. (e) RED-cGAN. (f) PsGAN. (g) PGMAN. (h) DIGAN. (i) RMFF-UPGAN.
Table III lists the objective assessment indicators between the PI and GT on the GF-2 testing data at the decreased resolution. In Table III, the bold indicators reflect that the RMFF-UPGAN is superior for all the indices and has the best pansharpening performance amongst all the comparative approaches. The SAM and ERGAS of the RMFF-UPGAN approach are minimal, signifying that there is the least spectral distortion, and the result retains more spectral information. Combining Figs. 10–13 and the SCC and SSIM indices in Table III, we can distinctly reveal that the RMFF-UPGAN approach preserves the most structural information, i.e., the definition of the pansharpening results is superior.
The pansharpening results of the compared approaches for the QuickBird testing data at the decreased resolution are exhibited in Fig. 14. The corresponding ASDM and AIDM are presented in Figs. 15 and 16, respectively. The assessment indicators between the PI and GT on the QuickBird testing data at the decreased resolution are reported in Table IV. From Figs. 14 and 15, we can noticeably notice that there are relatively severe spectral distortions in the results of the GSA, BDSD, SFIM, MTF-GLP, and PGMAN. Combining Figs. 14 and 16, it is apparent that the results of the BDSD, SFIM, MTF-GLP, and PGMAN are relatively blurry. According to the visual presentation in Fig. 14, the outcomes are better for the RecGAN, PsGAN, DIGAN, and RMFF-UPGAN, however, in Table IV, RMFF-UPGAN achieves the optimal results for the indices Q4, Q, SCC, SSIM, SAM, and ERGAS. Consequently, the pansharpening result of the RMFF-UPGAN is optimal.
Visual evaluation of the pansharpening results on the QuickBird testing data at the decreased resolution. (a) EXP. (b) D_P. (c) GSA. (d) BDSD. (e) SFIM. (f) MTF-GLP. (g) RED-cGAN. (h) PsGAN. (i) PGMAN. (j) DIGAN. (k) RMFF-UPGAN. (l) GT.
ASDM between the PI and GT on the QuickBird testing data at the decreased resolution. (a) GSA. (b) BDSD. (c) SFIM. (d) MTF-GLP. (e) RED-cGAN. (f) PsGAN. (g) PGMAN. (h) DIGAN. (i) RMFF-UPGAN.
AIDM between the PI and GT on the QuickBird testing data at the decreased resolution. (a) GSA. (b) BDSD. (c) SFIM. (d) MTF-GLP. (e) RED-cGAN. (f) PsGAN. (g) PGMAN. (h) DIGAN. (i) RMFF-UPGAN.
E. Full-Resolution Experiments
This section presents the pansharpening experiments on the full-resolution testing data of GF-2 and QuickBird. Because there are no reference data, subjective visual estimation and objective index assessment are carried out. Table V presents the average of the objective indicators of the experimental results for the testing data of GF-2 and QuickBird. From Table V, it is evident that the proposed RMFF-UPGAN approach is optimal for the three metrics,
Fig. 17 presents the visual assessment of the pansharpening results for comparative approaches on the GF-2 testing data with the full resolution. Fig. 17(a) is the raw MS image, Fig. 17(b) is the upsampled MS image, Fig. 17(c) is the raw PAN image, and Fig. 17(d)–(l) are the pansharpening results of all the comparative models. Table VI illustrates the objective assessment indexes of the PI of the various comparative models on the GF-2 testing data with the full resolution. From Fig. 17, we evidently note that the results of the GSA and MTF-GLP exhibit rather severe spectral distortion, as presented in the yellow ellipse of the figure. The outcomes of the BDSD and SFIM are blurry and exhibit ringing. Compared to the MS image, the results of the RED-cGAN and PsGAN exhibit darker colors. The edges of the result of DIGAN suffer from artifacts, such as the edge in the yellow ellipse in Fig. 17. The pansharpening results of the PGMAN and RMFF-UPGAN models are better in details retention, however, for the preservation of spectral information, the RMFF-UPGAN approach is superior. Furthermore, from Table VI, it is noticed that the
Visual assessment of the pansharpening results on the GF-2 testing data with the full resolution. (a) MS. (b) EXP. (c) PAN. (d) GSA. (e) BDSD. (f) SFIM. (g) MTF-GLP. (h) RED-cGAN. (i) PsGAN. (j) PGMAN. (k) DIGAN. (l) RMFF-UPGAN.
Fig. 18 is the visual estimation of the pansharpening results on the QuickBird testing data with the full resolution. To provide a better overview of the details, we magnify the contents of the red, yellow, and orange rectangles in Fig. 18. The enlargements are depicted in Fig. 19, corresponding to the upper left, upper right, and bottom, respectively. Table VII displays the objective estimation indexes of the pansharpened image on the QuickBird testing data at the full resolution. From Figs. 18 and 19, it is noticeable that the resolution of the results is enhanced for the various models. However, there is gross spectral distortion overall in the result of the BDSD approach, spectral distortion (as displayed in the upper right in Fig. 19) and structural distortion (as highlighted in the upper left in Fig. 19) exist in the result of the SFIM approach, and artifacts arise in the results of the SFIM and MTF-GLP approaches. The outcome of the DIGAN model is rather blurry. The upper right enlargements of the RecGAN, PsGAN, and PGMAN suffer from structural distortion. Integrating the retention of spectral and structural information, the designed RMFF-UPGAN model is preferable. It is also validated from Table VII that the result of the RMFF-UPGAN is the best in
Visual assessment of the results on the QuickBird testing data with the full resolution. (a) MS. (b) EXP. (c) PAN. (d) GSA. (e) BDSD. (f) SFIM. (g) MTF-GLP. (h) RED-cGAN. (i) PsGAN. (j) PGMAN. (k) DIGAN. (l) RMFF-UPGAN.
F. Ablation Experiment
Four ablation experiments are conducted to verify the usefulness of the structure of the presented network, completed on the QuickBird testing data with reduced and full resolutions. The first is to validate the effectiveness of the ResNeXt block in the RMFF-UPGAN model, the second is to prove the usefulness of the CIM in the reconstruction phase, the third is to verify the validity of the RIRB in the RMFF-UPGAN model, and the fourth is to confirm the validity of the U-shaped structure in the discriminator.
1) Effectiveness of the ResNeXt Module
The ResNeXt block is a vital component of the proposed RMFF-UPGAN model. We replace the ResNeXt block with a plain residual block, keeping the other parameters of the network invariant, recorded as NX. The objective metrics in Table VIII reveal that the results of the RMFF-UPGAN are better than those of the NX on the QuickBird testing data with reduced and full resolutions, demonstrating the positive impact of the ResNeXt block on feature extraction.
2) Validity of the CIM
The CIM in the reconstruction stage serves to complement the spectral and structural information. We remove the CIM for comparison, and the parameters of the network remain invariant, denoted as NMF. As presented in Table VIII, the pansharpening property of the RMFF-UPGAN outperforms the NMF on the reduced-resolution data. For the full-resolution data, the NMF is preferable to the RMFF-UPGAN for the
3) Effectiveness of the RIRB
We generate
4) Validness of U-Shaped Discriminator
We remove the upsampling structure from the U-shaped discriminator and only retain the downsampling component to form a plain discriminator, holding the other parameters of the network invariant, denoted as NU. As presented in Table VIII, the pansharpening capability of the RMFF-UPGAN is better than that of the NU on the reduced and full-resolutions data, signifying that the U-shaped discriminator boosts the capability of the network. Furthermore, from the contribution to enhancing the performance of the network, the CIM contributes the most, followed by the U-shaped discriminator, the ResNeXt block, and the RIRB.
G. Computation and Time
The model complexity is discussed from model parameters and test time, as presented in Table IX. For the traditional models, only the test time is given. The parameters and test time of DL-based models are provided. The test time is obtained by taking average for time of 286 pairs of GF-2 test data at full resolution. The sizes of MS image and PAN image of GF-2 test data are 100 × 100 × 4 and 400 × 400 × 1, respectively. The experiments of the first four traditional models are implemented on the above mentioned CPU, and the experiments of the later five DL-based models are implemented on the previously mentioned GPU. The unit M of parameters in Table IX denotes
Conclusion
In this article, we model the RMFF-UPGAN to boost spatial resolution and preserve spectral information. RMFF-UPGAN comprises a generator and two U-shaped discriminators. A dual-stream trapezoidal branch is designed in the generator to obtain multiscale information. The ResNeXt block and residual learning blocks are employed to obtain the spatial structure and spectral information of four scales. Further, a recursive mixed-scale feature fusion subnetwork is designed. Perform the prior fusion on the extracted MS and PAN features of the same scale. A mixed-scale fusion is conducted on the prior fusion results of the fine scale and coarse scale. The fusion is executed sequentially in the aforementioned manner building a recursive mixed-scale fusion structure, and finally, generating key information. The CIM is also designed for the reconstructing of key information to compensate for the information. The nonlinear RIRB is developed to overcome the distortion induced by neglecting the complicated relationship between MS and PAN images. Two U-shaped discriminators are designed and a new composite loss function is defined. The RMFF-UPGAN model is validated using GF-2 and QuickBird datasets, and the outcomes of the experiments reveal better than the prevalent approaches regarding both visual assessment and objective indicators. The RMFF-UPGAN model has superior performance in enhancing the spatial resolution and retaining the spectral information, which boosts the fusion quality. Our further work will investigate the unsupervised models to further strengthen the capability of the pansharpening network.