Introduction
Due to physical constraints [1], many satellites, such as QuickBird, GaoFen-1, 2, and WorldView I, II, only offer a pair of modalities at the same time: multispectral (MS) images at a low spatial resolution and panchromatic (PAN) images at a high spatial resolution but a low spectral resolution. In many practical applications, it is desired to use high-resolution (HR) MS images. Pansharpening, which combines the strengths of an MS image with a PAN image to generate an HR MS image, provides a good solution to this.
Over the past few decades, researchers in the remote sensing community have developed various methods for pansharpening. These methods, to distinguish them from the recently proposed deep learning models, we called them traditional pansharpening methods, can be mainly divided into three categories: Component substitution (CS) based methods, multiresolution analysis (MRA) based methods, and model-based methods. The CS methods convert MS images into a new space, in which one component is replaced with the spatial parts of the PAN, and then perform an inverse transformation to obtain the pansharpened images. Intensity-hue-saturation technique (IHS-based methods [2]), principal component analysis (PCA-based methods [3], [4]), and Gram–Schmidt (GS [5] method) are those widely adopted transformations. The MRA methods apply multiresolution algorithms to extract the spatial information of PAN images and then inject them into MS images. Some representative methods include the modulation transfer function [6], [7], and the smoothing filter based intensity modulation (SFIM [8]). The model-based methods attempt to build interpretable mathematical models of the input PAN and MS and the ideal HR MS. They usually need to solve an optimization problem to parameterize the models. A typical method is band-dependent spatial detail (BDSD [9]) model. These traditional methods are widely used in practice, however, their ability to solve high nonlinear mapping is limited and, thus, often suffer from spatial or spectral distortions.
Recently, deep learning techniques have made great successes in various computer vision tasks, from low-level image processing to high-level image understandings [10]–[12]. Convolution neural networks (CNNs) have shown a powerful ability of modeling complex nonlinear mappings and the superiority of solving image-enhancing problems, such as single-image superresolution [13], [14]. Inspired by this, many deep learning models have been developed for pansharpening. PNN [15] introduces SRCNN [16] to pansharpening and designs a fusion network, which is also a three-layered CNN. PanNet [17] borrows the idea of skip-connection from ResNet [18] to build deeper networks and trains the model on the high-frequency domain to learn the residual between the upsampled low-resolution (LR) MS image and the desired HR MS image. DRPNN [19] also learns from ResNet [18] and designs a deeper CNN with 11 layers. MSDCNN [20] tries to explore the multiscale structures from images by using different sizes of filters and combining a shallow network and a deep network. TFNet [21] builds a two-stream fusion network and designs a variant of UNet [22] to solve the problem. PSGAN [23] improves the TFNet [21] by using generative adversarial training [24].
These deep learning methods for pansharpening have achieved satisfactory performances. However, they cannot be optimized without supervised images and hence are hard to obtain optimal results on the full-resolution images. To be specific, the existing works require ideal HR MS images, which do not exist to train networks. To optimize the networks, they downsample PAN and MS images and take the original MS image as targets to form training samples. In the testing stage, evaluations are also conducted on the downsampled images. However, for remote sensing images, this protocol may cause a gap between downsampled images and the original ones. Different from natural images, remote sensing images are usually in deeper bit depths and with distinct pixel distributions. These supervised methods may have good performances in the downsampled image domain, however, they generalize poorly to the original full-scale images, which make them lack practicality.
To overcome this drawback, we propose an unsupervised generative multiadversarial network, termed PGMAN. PGMAN focuses on unsupervised learning and is trained on the original data without downsampling or any other preprocessing steps to make full use of original spatial and spectral information. Our method is inspired by CycleGAN [25]. We use a two-stream generator to extract modality-specific features from PAN and MS images, respectively. Since we do not have target images to calculate losses, the only way for us to verify the quality of the generated images is the consistency property between the pansharpened images and the PAN and MS images, i.e., the degraded versions of the HR MS images, both spectral and spatial degradations, should be as close as possible to the PAN and MS images. To realize this, we build two discriminators, one is to distinguish the downsampled fusion results from the input MS images, and the other one is to distinguish the grayed fusion images from the input PAN. Furthermore, inspired by the nonreference metric QNR [26], we introduce a novel loss function to boost the quality of the pansharpened images. Our major contributions can be summarized as follows.
We design an unsupervised generative multiadversarial network for pansharpening, termed PGMAN, which can be trained on the full-resolution PAN and MS images without any preprocessing. It takes advantage of the rich spatial and spectral information of the original data and is consistent with the real application environment.
For the purpose of being consistent with the original PAN and MS images, we transform the fusion result back to PAN and LR MS images and design a dual-discriminator architecture to preserve the spatial and spectral information.
Inspired by the QNR metric, we introduce a novel loss to optimize the network under the unsupervised learning framework without reference images.
We conduct extensive experiments on GaoFen-2, QuickBird, and WorldView-3 images to compare our proposed model with the state-of-the-art methods. Experimental results demonstrate that the proposed method can achieve the best results on the full-resolution images, which clearly show its practical value.
The rest of this article is organized as follows. The related works and background knowledge are introduced in Section II. The details of our proposed method are described in Section III. Section IV shows the experiments and the results. Finally, Section V concludes this article.
Related Work
Deep learning techniques have achieved great success in diverse computer vision tasks, inspiring us to design deep learning models for the pansharpening problem. Observing that pansharpening and single-image superresolution share a similar spirit and motivated by Dong et al. [16], Masi et al. [15] propose a three-layered CNN-based pansharpening method. Following this work, increasing research efforts have been devoted to developing deep learning based pansharpening. For instance, Zhong et al. [27] present a CNN-based hybrid pansharpening method. Recent studies [28], [29] have suggested that deeper networks will achieve better performance on vision tasks. The first attempt at applying the residual network is PanNet [17]. They adopt a similar idea to Rao et al. [30] and Wei et al. [9] but employ ResNet [18] to predict details of the image. In this way, both spatial and spectral information could be preserved well.
Generative adversarial networks (GANs), proposed by Goodfellow et al. [24], have achieved attractive performance in various image generation tasks. The main idea of GANs is to train a generator with a discriminator, adversarially. The generator learns to output realistic images to cheat the discriminator, whereas the discriminator learns to distinguish the generated images from real ones. However, the difficulty of stable training of GANs remains a problem. DCGAN [31] introduces CNN to GANs and drops the pooling layer, which improves the performance. LSGAN [32] replaces the sigmoid cross entropy loss function with the least squares loss function to avoid vanishing gradients problem. WGAN [33] leverages the Wasserstein distance as the objective function and uses weight clipping to stabilize the training process. WGAN-GP [34] penalizes norms of gradients of discriminators with respect to its input rather than uses weight clipping. SAGAN [35] adds self-attention modules for long-range dependence modeling. To speed up convergence and ease the training process, we choose WGAN-GP as the basic GAN to build our model.
Recently, researchers do not limit GANs in one-generator-one-discriminator architecture and try to design multiple generators and discriminators for dealing with difficult tasks. GMAN [36] extends GANs to multiple discriminators and endows them with two roles: formidable adversaries and forgiving teachers. One is a stronger discriminator, whereas another is weaker. CycleGAN [25] designs two pairs of generators and discriminators and proposes a cycle consistency loss to reduce the space of possible mapping functions. MsCGAN [37] is a multiscale adversarial network consisting of two generators and two discriminators handling different levels of visual features. SinGAN [38] uses a pyramid of generators and discriminators to learn the multiscale patch distribution in a single image. Considering the domain-specific knowledge of pansharpening, we design two discriminators to train against one generator for spectral and spatial preservation.
Method
A. Network Architecture
1) Generator Architecture
We design the generator based on the architecture of TFNet [21] and make the following modifications to it to further improve the quality of the pansharpened images. First, inspired by PanNet [17], the generator is trained on the high-pass domain and the output of it is added to the upsampled LR MS image for better spectral preservation. Generally, the high-pass domain of images usually contains more spatial details. Moreover, learning the residual between the LR MS image and the final HR MS image can stabilize the training process. Second, considering that the input pair PAN and MS images are of different sizes, we build two independent feature extraction (FE) subnetworks. The PAN FE subnetwork has two stride-2 convolutions for downsampling, whereas the MS FE subnetwork has two stride-1 convolutions to maintain the feature map resolution without downsampling. We concatenate feature maps produced by these two subnetworks and append a residual block [18] to achieve fusion. Finally, two successive fractionally strided convolutions that both are with a stride of
Architecture of our proposed model, PGMAN. The generator takes the original LR MS and PAN images as inputs and generates the HR MS image. The pansharpened result will be degraded spatially and spectrally to form a pair of fake inputs for multiadversarial learning. According to the feedback from these two discriminators, the generator will minimize the distance between the real and fake distributions and further improve the spatial and spectral quality of the fusion results. The parameters of each layer in the neural network are given on the right side.
2) Discriminators Architecture
We use two discriminators for verifying the consistency in pansharpening process. First, we downsample the fused images to the same spatial resolution as the LR MS images and then apply the discriminator-1
As can be seen, our generator and discriminators are fully convolutional, which makes our model easy to train and can accept PAN and MS images with arbitrary sizes in the testing phase.
B. Loss Function
For simplification and convenience, Table I lists some key notations used in the following of this article.
1) Q-Loss
Supervised learning-based methods usually employ
Recall that in the pansharpening paradigm, the interrelationships measured by QI between any couple of spectral bands of MS images should be unchanged after fusion, otherwise the pansharpened MS images may have spectral distortions. Furthermore, the interrelationships between each band of the MS and the same sized PAN image should be preserved across scales. Therefore, the spatial and spectral consistencies can be directly calculated from the pansharpened image, the LR MS image, and the PAN image without the ground truth. The underlying assumption of interrelationships consistency of cross-similarity is demonstrated by the fact that the true HR MS data, whenever available, exhibit spectral and spatial distortions that are both zero, within the approximations of the model, and definitely lower than those attained by any fusion method. To describe this, quantitatively, we introduce a non-GT loss on top of QNR [26], which is defined as follows:
\begin{equation*}
\mathcal {L}_{Q} = 1 - \text{QNR}. \tag{1}
\end{equation*}
QNR is the abbreviation of quality with no reference, which is a combination of
\begin{equation*}
\text{QNR} = (1 - D_\lambda) (1 - D_s). \tag{2}
\end{equation*}
The optimal value of QNR is 1
\begin{equation*}
D_\lambda = \sqrt{ \frac{2}{K(K-1)} \sum _{i=1}^K \sum _{\underbrace{j=1}_{i\ne j}}^K |Q(P_i, P_j) - Q(X_i, X_j) | } \tag{3}
\end{equation*}
\begin{equation*}
Q(x, y) = \frac{4 \sigma _{xy} \cdot \bar{x} \cdot \bar{y} }{ (\sigma _x^2 + \sigma _y^2) (\bar{x}^2 + \bar{y}^2) } \tag{4}
\end{equation*}
\begin{equation*}
D_s = \sqrt{ \frac{1}{K} \sum \nolimits_{i=1}^K | Q(P_i, Y) - Q(X_i, \tilde{Y}) | } \tag{5}
\end{equation*}
This loss function enables us to measure the qualities of the fused images from input PAN and MS images without ground truth HR MS images.
2) Adversarial Loss
We design two discriminators,
\begin{align*}
\mathcal {L}_G =& \frac{1}{N} \sum _{n=1}^N - \alpha D_1(\tilde{P}^{(n)}) - \beta D_2(\hat{P}^{(n)}) \\
&+ \mathcal {L}_{Q} (P^{(n)}, X^{(n)}, Y^{(n)}) \tag{6}
\end{align*}
To stabilize the training, we employ WGAN-GP [34] as a basic framework, i.e., using Wasserstein distance [33] and applying gradient penalty [34] to discriminators. The loss functions of
\begin{align*}
\mathcal {L}_{D_1} =& \frac{1}{N} \sum _{n=1}^N - D_1(X^{(n)}) + D_1(\tilde{P}^{(n)}) \\
&+ \lambda \text{GP}(D_1, X^{(n)}, \tilde{P}^{(n)}) \tag{7}\\
\mathcal {L}_{D_2} =& \frac{1}{N} \sum _{n=1}^N - D_2(Y^{(n)}) + D_2(\hat{P}^{(n)}) \\
&+ \lambda \text{GP}(D_2, Y^{(n)}, \hat{P}^{(n)}) \tag{8}
\end{align*}
C. Training Details
Our method is implemented in PyTorch [42] and trained on a single NVIDIA Titan 1080Ti GPU. The batch size and learning rate are set as 8 and 1e-4, respectively. The hyperparameters in (6)–(8) are set as
Experiments and Results
A. Datasets
We conduct extensive experiments on three datasets with images collected from GaoFen-2, QuickBird, and WorldView-3 satellites. The detailed information of the datasets can be seen in Table II. For comparison between the supervised and unsupervised methods, we build the datasets under both reduced-scale (based on Wald's protocol [44]) and full-scale settings. Wald's protocol has been widely used for assessment of pansharpening methods, in which the original MS and PAN images are spatially degraded before feeding into models, the reducing factor being the ratio between their spatial resolutions, and the original MS images are considered as reference images for comparison. Same as the previous works [39], [45], we implement it by blurring the full-resolution datasets using a Gaussian filter and then downsampling them with a scaling factor of 4. Under Wald's protocol, the supervised models can be trained on reduced-resolution images using the original MS images as labels. However, under full-resolution setting, there are no reference images so that only unsupervised models can be trained with the full-resolution images as inputs. Although the training environment is limited related to the type of models, testing is without constraints. We can test all models on both reduced-scale and full-scale images whether they need supervised labels or not.
Moreover, because of the very large size of remote sensing images, for example, PAN and MS images of GaoFen-2 are with the size of about
B. Metrics
In order to examine the performance of models on both reduced-scale and full-scale images, we use reference and nonreference metrics, respectively.
1) Nonreference Metrics
On full-resolution, there is no ground truth and we use the nonreference metrics, which can be calculated without the target.
2) Reference Metrics
Under Wald's protocol [44], we can also validate models on reduced-resolution, where the PAN and MS images are downsampled so that the origin MS image is as ground truth. Therefore, on LR, we use the following reference metrics.
SAM [49] measures spectral distortions of pansharpened images comparing with the reference images
where\begin{equation*} \text{SAM}(x_1, x_2) = \arccos {\left(\frac{x_1 \cdot x_2}{||x_1|| \cdot ||x_2||}\right)} \tag{9} \end{equation*} View Source\begin{equation*} \text{SAM}(x_1, x_2) = \arccos {\left(\frac{x_1 \cdot x_2}{||x_1|| \cdot ||x_2||}\right)} \tag{9} \end{equation*}
andx_1 are two spectral vectors from the pansharpened result and the reference result.x_2 Erreur Relative Globale Adimensionnelle de Synthèse (ERGAS) [50]: The ERGAS, also known as the relative global dimensional synthesis error is a commonly used global QI. It is given by
where\begin{equation*} \mathrm{ERGAS} = 100\frac{h}{l}\sqrt{\frac{1}{N}\sum ^N_{i=1}\left(\frac{\mathrm{RMSE}(B_i)}{M(B_i)}\right)^2} \tag{10} \end{equation*} View Source\begin{equation*} \mathrm{ERGAS} = 100\frac{h}{l}\sqrt{\frac{1}{N}\sum ^N_{i=1}\left(\frac{\mathrm{RMSE}(B_i)}{M(B_i)}\right)^2} \tag{10} \end{equation*}
andh are the spatial resolutions of PAN and MS images; RMSE(l ) is the root-mean-square error between theB_i th band of the fused and reference image; andi is the mean value of the original MS bandM(B_i) .B_i SSIM [51] is a widely used metric that models the loss and distortion according to the similarities in light, contrast, and structure information
where\begin{equation*} \text{SSIM}(x, y) = \frac{ (2 \mu _x \mu _y + c_1) (2 \sigma _{xy} + c_2) }{ (\mu _x^2 + \mu _y^2 + c_1) (\sigma _x^2 + \sigma _y^2 + c_2) } \tag{11} \end{equation*} View Source\begin{equation*} \text{SSIM}(x, y) = \frac{ (2 \mu _x \mu _y + c_1) (2 \sigma _{xy} + c_2) }{ (\mu _x^2 + \mu _y^2 + c_1) (\sigma _x^2 + \sigma _y^2 + c_2) } \tag{11} \end{equation*}
means the covariance between\sigma _{xy} andx , andy and\sigma ^2_{x} are the variances of\sigma ^2_{y} andx , respectively.y and\mu _{x} are the means of\mu _{y} andx , respectively.y andc_1 are fixed constants.c_2
C. Comparison With State of the Arts
Fourteen methods, including seven traditional methods, five supervised methods, and two unsupervised methods, are employed for comparison.
Traditional methods include BDSD [9], GS [5], IHS [2], Brovey [46], HPF [47], LMM [48], and SFIM [8].
Supervised methods are PNN [15], DRPNN [19], MSDCNN [20], PanNet [17], and PSGAN [23]. These methods are state-of-the-art supervised deep learning based pansharpening methods. The first four models are CNN-based models and the last one is based on GAN.
Unsupervised methods include Pan-GAN [39] and our proposed PGMAN.
D. Quantitative Results
For nonreference metrics, D
Table III shows the quantitative assessments on GaoFen-2. The best value in each group is in bold font. For the nonreference metrics, our proposed PGMAN takes the advantage of the unsupervised learning and obtains the best average values on D
Table IV shows the quantitative assessments on QuickBird. Similar to the results of GaoFen-2 dataset, here, our proposed model PGMAN maintains its superiority and achieves the best values on all nonreference metrics, which exceed the performance of other traditional and supervised methods. PGMAN also obtains the best value in terms of SAM in the unsupervised methods while is a little behind Pan-GAN on ERGAS and SSIM.
We also conduct an evaluation on the eight-band images from WorldView-3 dataset. Table V shows the quantitative assessments on it. The results show that our proposed PGMAN still maintains its superiority and achieves the best values on all nonreference metrics, showing the effectiveness and the great practical value of it. For the reference metrics, both unsupervised methods fall a lot. In eight-band WV3 dataset, the generalization ability from the full-scale dataset to the reduced-scale dataset may be affected due to the added spectral bands. It remains a challenge to the unsupervised methods trained in the full-scale dataset.
Generally, traditional models do not rely on any training data and make a medium performance regardless of the kind of testing environment. Supervised methods tend to be better on the reduced-scale images because they make full use of the supervision information and directly minimize the loss between the predicted pansharpened and the ground truth images. However, this may make them overfit reduced-scale images and generalize poorly on full-resolution test images, and sometimes these methods even perform worse than traditional models. The best nonreference metrics achieved by supervised methods are PNN (GaoFen-2 dataset, Table III) and PanNet (QuickBird dataset, Table IV), the reason may be that they consist of fewer parameters (see Table VIII) and, thus, with lower possibility to become overfitting. Here, unsupervised models take the advantages of learning from the full-scale images and obtain the best nonreference metrics. Furthermore, our proposed model, PGMAN, surpasses all other unsupervised methods. Our method significantly improves the results, which indicates that full-resolution images indeed provide rich spatial and spectral information and are helpful for improving the quality of the pansharpened images. Although our method falls behind the supervised methods when testing on the reduced-scale images, considering that there is no ground truth used in our method, the results are quite promising. More importantly, the proposed PGMAN obtains the best QNR when applied to full-resolution images, showing the great practical value of it.
E. Visual Results
Fig. 2 shows some example results on GaoFen-2 full-resolution images, which are produced from the original PAN and LR MS images. The results of Brovey [see Fig. 2(f)] and LMM [see Fig. 2(h)] cannot preserve the spectral information from the LR MS image [see Fig. 2(b)] very well. They tend to produce notable color distortions. Those supervised methods, including PNN [see Fig. 2(j)], DRPNN [see Fig. 2(k)], MSDCNN [see Fig. 2(l)], PanNet [see Fig. 2(m)], and PSGAN [see Fig. 2(n)], also demonstrate low-quality spatial scenario details. It seems that the knowledge learned from the reduced-scale images is hard to generalize well to the real scenarios. This is because different from natural images, remote sensing images usually contain deeper bit depth and are with different pixel distribution. Downsampling the original PAN and LR MS images in a supervised manner will change the data distribution and hurt the original information. However, when trained on the full-scale images, the quality is improved significantly for PGMAN [see Fig. 2(p)]. This gives us strong motivations to propose an unsupervised model since the pansharpening works are usually done on the full-resolution images in real applications.
Visual results on GaoFen-2 full-resolution dataset. (a) PAN. (b) LR MS. (c) BDSD [9]. (d) GS [5]. (e) IHS [2]. (f) Brovey [47]. (g) HPF [48]. (h) LMM [49]. (i) SFIM [8]. (j) PNN [15]. (k) DRPNN [19]. (l) MSDCNN [20]. (m) PanNet [17]. (n) PSGAN [23]. (o) Pan-GAN [40]. (p) PGMAN.
Fig. 3 shows some example results on QuickBird full-resolution images. GS [see Fig. 3(d)], IHS [see Fig. 3(e)], and Brovey [see Fig. 3(f)] distort the spectral information from the LR MS image [see Fig. 3(b)], which can be seen from the color of the roof. The supervised methods, including PNN [see Fig. 3(j)] and MSDCNN [see Fig. 3(l)], also produce unsatisfactory results with some distortions. Our method generates the pansharpened image [see Fig. 3(p)] with good spatial and spectral quality. The visual results from these two datasets demonstrate the superiority of our proposed model and show the advantage of unsupervised models for training on the original data distribution.
Visual results on QuickBird full-resolution dataset. (a) PAN. (b) LR MS. (c) BDSD [9]. (d) GS [5]. (e) IHS [2]. (f) Brovey [47]. (g) HPF [48]. (h) LMM [49]. (i) SFIM [8]. (j) PNN [15]. (k) DRPNN [19]. (l) MSDCNN [20]. (m) PanNet [17]. (n) PSGAN [23]. (o) Pan-GAN [40]. (p) PGMAN.
F. Ablation Study
We conduct ablation studies on GaoFen-2 dataset to verify the effectiveness of the proposed Q-loss. The quantitative results are listed in Table VI.
It is observed that using only the adversarial loss to optimize the network degrades the performance except for D
Furthermore, we also conduct parameter analysis on GaoFen-2 dataset, where we try six different combinations of
G. Efficiency Study
We evaluate the computational time and memory cost of ours and the comparison methods. All traditional methods are implemented using MATLAB and run on an Intel Core i7-7700HQ CPU, and deep models are implemented in PyTorch and tested on a single NVIDIA GeForce RTX 2080Ti GPU. Table VIII shows the inference speed and number of parameters of all comparative models. The inference speed is evaluated on 286 test samples, and each pair consists of a PAN image with
BDSD is the slowest method, whose inference time is about 1.2613 s per image. IHS and Brovey only require about 0.01 s, which are the fastest in traditional methods.
Beneficial from the advance of GPU architectures, the inference time of deep learning models is satisfactory, almost all deep models only require less than 0.001 s to pansharpen one image, except for DRPNN model who is the largest deep model with deeper network architecture and much bigger convolution filters. PNN is the fastest since it consists of only three convolution layers. PGMAN is efficient in both model size and inference speed.
Conclusion
In this article, we propose a novel unsupervised generative multiadversarial model for pansharpening, called PGMAN. PGMAN consists of one generator and two discriminators to reduce both the spectral and spatial distortions. To further improve the performance, we also introduce a QNR-related loss to the unsupervised manner. As one of the advantages of unsupervised methods, our proposed method can be trained on either reduced-scale or full-scale images without ground truth. Our model is focused on original PAN and LR MS images without any preprocessing step to keep the consistency with the real application environment. The attractive performance on full-scale testing and satisfactory results on reduced-scale images demonstrate the powerful ability of our proposed model.
However, there is still a gap between the supervised and unsupervised models on the reference metrics. When we use the GAN-based framework for spectral preservation, we simply downsample the output to form a fake LR MS image for the discriminator to recognize. The method of downsampling remains uncertain and may affect the performance of the model, which is also an ill-posed problem. There may be another better way to achieve the degradation process to discuss. In our future work, we will continue to study the architecture of unsupervised pansharpening models and further improve the performance.