Journals & Magazines >IEEE Journal of Selected Topi... >Volume: 14

PGMAN: An Unsupervised Generative Multiadversarial Network for Pansharpening

Abstract:

Pansharpening aims at fusing a low-resolution multispectral (MS) image and a high-resolution (HR) panchromatic (PAN) image acquired by a satellite to generate an HR MS im...Show More

Metadata

Abstract:

Pansharpening aims at fusing a low-resolution multispectral (MS) image and a high-resolution (HR) panchromatic (PAN) image acquired by a satellite to generate an HR MS image. Many deep learning based methods have been developed in the past few years. However, since there are no intended HR MS images as references for learning, almost all of the existing methods downsample the MS and PAN images and regard the original MS images as targets to form a supervised setting for training. These methods may perform well on the down-scaled images; however, they generalize poorly to the full-resolution images. To conquer this problem, we design an unsupervised framework that is able to learn directly from the full-resolution images without any preprocessing. The model is built based on a novel generative multiadversarial network. We use a two-stream generator to extract the modality-specific features from the PAN and MS images, respectively, and develop a dual discriminator to preserve the spectral and spatial information of the inputs when performing fusion. Furthermore, a novel loss function is introduced to facilitate training under the unsupervised setting. Experiments and comparisons with other state-of-the-art methods on GaoFen-2, QuickBird, and WorldView-3 images demonstrate that the proposed method can obtain much better fusion results on the full-resolution images. Code is available. [Online]. Available: https://github.com/zhysora/PGMAN.

Published in: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing ( Volume: 14)

Page(s): 6316 - 6327

Date of Publication: 23 June 2021

ISSN Information:

DOI: 10.1109/JSTARS.2021.3090252

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Due to physical constraints [1], many satellites, such as QuickBird, GaoFen-1, 2, and WorldView I, II, only offer a pair of modalities at the same time: multispectral (MS) images at a low spatial resolution and panchromatic (PAN) images at a high spatial resolution but a low spectral resolution. In many practical applications, it is desired to use high-resolution (HR) MS images. Pansharpening, which combines the strengths of an MS image with a PAN image to generate an HR MS image, provides a good solution to this.

Over the past few decades, researchers in the remote sensing community have developed various methods for pansharpening. These methods, to distinguish them from the recently proposed deep learning models, we called them traditional pansharpening methods, can be mainly divided into three categories: Component substitution (CS) based methods, multiresolution analysis (MRA) based methods, and model-based methods. The CS methods convert MS images into a new space, in which one component is replaced with the spatial parts of the PAN, and then perform an inverse transformation to obtain the pansharpened images. Intensity-hue-saturation technique (IHS-based methods [2]), principal component analysis (PCA-based methods [3], [4]), and Gram–Schmidt (GS [5] method) are those widely adopted transformations. The MRA methods apply multiresolution algorithms to extract the spatial information of PAN images and then inject them into MS images. Some representative methods include the modulation transfer function [6], [7], and the smoothing filter based intensity modulation (SFIM [8]). The model-based methods attempt to build interpretable mathematical models of the input PAN and MS and the ideal HR MS. They usually need to solve an optimization problem to parameterize the models. A typical method is band-dependent spatial detail (BDSD [9]) model. These traditional methods are widely used in practice, however, their ability to solve high nonlinear mapping is limited and, thus, often suffer from spatial or spectral distortions.

Recently, deep learning techniques have made great successes in various computer vision tasks, from low-level image processing to high-level image understandings [10]–[12]. Convolution neural networks (CNNs) have shown a powerful ability of modeling complex nonlinear mappings and the superiority of solving image-enhancing problems, such as single-image superresolution [13], [14]. Inspired by this, many deep learning models have been developed for pansharpening. PNN [15] introduces SRCNN [16] to pansharpening and designs a fusion network, which is also a three-layered CNN. PanNet [17] borrows the idea of skip-connection from ResNet [18] to build deeper networks and trains the model on the high-frequency domain to learn the residual between the upsampled low-resolution (LR) MS image and the desired HR MS image. DRPNN [19] also learns from ResNet [18] and designs a deeper CNN with 11 layers. MSDCNN [20] tries to explore the multiscale structures from images by using different sizes of filters and combining a shallow network and a deep network. TFNet [21] builds a two-stream fusion network and designs a variant of UNet [22] to solve the problem. PSGAN [23] improves the TFNet [21] by using generative adversarial training [24].

These deep learning methods for pansharpening have achieved satisfactory performances. However, they cannot be optimized without supervised images and hence are hard to obtain optimal results on the full-resolution images. To be specific, the existing works require ideal HR MS images, which do not exist to train networks. To optimize the networks, they downsample PAN and MS images and take the original MS image as targets to form training samples. In the testing stage, evaluations are also conducted on the downsampled images. However, for remote sensing images, this protocol may cause a gap between downsampled images and the original ones. Different from natural images, remote sensing images are usually in deeper bit depths and with distinct pixel distributions. These supervised methods may have good performances in the downsampled image domain, however, they generalize poorly to the original full-scale images, which make them lack practicality.

To overcome this drawback, we propose an unsupervised generative multiadversarial network, termed PGMAN. PGMAN focuses on unsupervised learning and is trained on the original data without downsampling or any other preprocessing steps to make full use of original spatial and spectral information. Our method is inspired by CycleGAN [25]. We use a two-stream generator to extract modality-specific features from PAN and MS images, respectively. Since we do not have target images to calculate losses, the only way for us to verify the quality of the generated images is the consistency property between the pansharpened images and the PAN and MS images, i.e., the degraded versions of the HR MS images, both spectral and spatial degradations, should be as close as possible to the PAN and MS images. To realize this, we build two discriminators, one is to distinguish the downsampled fusion results from the input MS images, and the other one is to distinguish the grayed fusion images from the input PAN. Furthermore, inspired by the nonreference metric QNR [26], we introduce a novel loss function to boost the quality of the pansharpened images. Our major contributions can be summarized as follows.

We design an unsupervised generative multiadversarial network for pansharpening, termed PGMAN, which can be trained on the full-resolution PAN and MS images without any preprocessing. It takes advantage of the rich spatial and spectral information of the original data and is consistent with the real application environment.
For the purpose of being consistent with the original PAN and MS images, we transform the fusion result back to PAN and LR MS images and design a dual-discriminator architecture to preserve the spatial and spectral information.
Inspired by the QNR metric, we introduce a novel loss to optimize the network under the unsupervised learning framework without reference images.
We conduct extensive experiments on GaoFen-2, QuickBird, and WorldView-3 images to compare our proposed model with the state-of-the-art methods. Experimental results demonstrate that the proposed method can achieve the best results on the full-resolution images, which clearly show its practical value.

The rest of this article is organized as follows. The related works and background knowledge are introduced in Section II. The details of our proposed method are described in Section III. Section IV shows the experiments and the results. Finally, Section V concludes this article.

SECTION II.

Related Work

Deep learning techniques have achieved great success in diverse computer vision tasks, inspiring us to design deep learning models for the pansharpening problem. Observing that pansharpening and single-image superresolution share a similar spirit and motivated by Dong et al. [16], Masi et al. [15] propose a three-layered CNN-based pansharpening method. Following this work, increasing research efforts have been devoted to developing deep learning based pansharpening. For instance, Zhong et al. [27] present a CNN-based hybrid pansharpening method. Recent studies [28], [29] have suggested that deeper networks will achieve better performance on vision tasks. The first attempt at applying the residual network is PanNet [17]. They adopt a similar idea to Rao et al. [30] and Wei et al. [9] but employ ResNet [18] to predict details of the image. In this way, both spatial and spectral information could be preserved well.

Generative adversarial networks (GANs), proposed by Goodfellow et al. [24], have achieved attractive performance in various image generation tasks. The main idea of GANs is to train a generator with a discriminator, adversarially. The generator learns to output realistic images to cheat the discriminator, whereas the discriminator learns to distinguish the generated images from real ones. However, the difficulty of stable training of GANs remains a problem. DCGAN [31] introduces CNN to GANs and drops the pooling layer, which improves the performance. LSGAN [32] replaces the sigmoid cross entropy loss function with the least squares loss function to avoid vanishing gradients problem. WGAN [33] leverages the Wasserstein distance as the objective function and uses weight clipping to stabilize the training process. WGAN-GP [34] penalizes norms of gradients of discriminators with respect to its input rather than uses weight clipping. SAGAN [35] adds self-attention modules for long-range dependence modeling. To speed up convergence and ease the training process, we choose WGAN-GP as the basic GAN to build our model.

Recently, researchers do not limit GANs in one-generator-one-discriminator architecture and try to design multiple generators and discriminators for dealing with difficult tasks. GMAN [36] extends GANs to multiple discriminators and endows them with two roles: formidable adversaries and forgiving teachers. One is a stronger discriminator, whereas another is weaker. CycleGAN [25] designs two pairs of generators and discriminators and proposes a cycle consistency loss to reduce the space of possible mapping functions. MsCGAN [37] is a multiscale adversarial network consisting of two generators and two discriminators handling different levels of visual features. SinGAN [38] uses a pyramid of generators and discriminators to learn the multiscale patch distribution in a single image. Considering the domain-specific knowledge of pansharpening, we design two discriminators to train against one generator for spectral and spatial preservation.

SECTION III.

Method

A. Network Architecture

1) Generator Architecture

We design the generator based on the architecture of TFNet [21] and make the following modifications to it to further improve the quality of the pansharpened images. First, inspired by PanNet [17], the generator is trained on the high-pass domain and the output of it is added to the upsampled LR MS image for better spectral preservation. Generally, the high-pass domain of images usually contains more spatial details. Moreover, learning the residual between the LR MS image and the final HR MS image can stabilize the training process. Second, considering that the input pair PAN and MS images are of different sizes, we build two independent feature extraction (FE) subnetworks. The PAN FE subnetwork has two stride-2 convolutions for downsampling, whereas the MS FE subnetwork has two stride-1 convolutions to maintain the feature map resolution without downsampling. We concatenate feature maps produced by these two subnetworks and append a residual block [18] to achieve fusion. Finally, two successive fractionally strided convolutions that both are with a stride of $\frac{1}{2}$ are applied to upsample the feature maps to meet the size of the desired HR MS. The outputs are high-frequency parts of the pansharpened MS images. We add them to the upsampled LR MS images to obtain the final results. Fig. 1 shows the detailed architecture of our generator.

Fig. 1.

Architecture of our proposed model, PGMAN. The generator takes the original LR MS and PAN images as inputs and generates the HR MS image. The pansharpened result will be degraded spatially and spectrally to form a pair of fake inputs for multiadversarial learning. According to the feedback from these two discriminators, the generator will minimize the distance between the real and fake distributions and further improve the spatial and spectral quality of the fusion results. The parameters of each layer in the neural network are given on the right side.

Show All

2) Discriminators Architecture

We use two discriminators for verifying the consistency in pansharpening process. First, we downsample the fused images to the same spatial resolution as the LR MS images and then apply the discriminator-1 $D_1$ to enforce them to have the same spectral information. Second, the discriminator-2 $D_2$ is applied to match the spatial structures of the fused images to that of the PAN images. Different from Ma et al. [39], we do not use the global average pooling or maximum pooling to obtain the spectrally degraded version of the fused HR MS. Instead, we train an auxiliary network consisting of only one $3 \times 3$ convolution layer to estimate the transformation from MS images to PAN images. The auxiliary spectral degradation network is pretrained individually with the LR MS and the downsampled PAN images and will be fixed when training the generator and discriminators. We use discriminators similar to the one used in [40]. Since the input LR MS and PAN are with different image sizes and channels, the two discriminators have distinct architectures, as shown in Fig. 1. $D_2$ has one more convolution layer to downsample the feature maps because the PAN image has a larger image size. As suggested by WGAN-GP [34], we remove the last activation function and batch normalization layers in our discriminators.

As can be seen, our generator and discriminators are fully convolutional, which makes our model easy to train and can accept PAN and MS images with arbitrary sizes in the testing phase.

B. Loss Function

For simplification and convenience, Table I lists some key notations used in the following of this article.

TABLE I List of Notations Used in This Article

1) Q-Loss

Supervised learning-based methods usually employ $L_1$ or $L_2$ loss to train the networks. However, under the unsupervised setting, there are no ideal images to be compared with. In this work, we attempt to devise an alternative solution that can quantify the quality of the fusion result with reference to the inputs rather than the ground truth. The intuition behind this is that there must be some consistencies across modalities, which means there is a measure that we can obtain in LR MS and PAN images and it still holds when applying it to the HR pansharpened MS images. Image quality index (QI) [41] provides a statistical similarity measurement between two monochrome images. To measure the spectral consistency, we can calculate the QI values between any couple of spectral bands in the LR MS image and compare them with those in the pansharpened image. Analogously, the QI values between each spectral band in the MS image and the PAN image should be consistent with the QI values between each spectral band in the pansharpened image and the PAN image, which defines the spatial consistency. The rationale behind it is, when the spectral information is translated from the coarse-scale to the fine-scale in spatial resolution, the QI values should be unchanged after fusion.

Recall that in the pansharpening paradigm, the interrelationships measured by QI between any couple of spectral bands of MS images should be unchanged after fusion, otherwise the pansharpened MS images may have spectral distortions. Furthermore, the interrelationships between each band of the MS and the same sized PAN image should be preserved across scales. Therefore, the spatial and spectral consistencies can be directly calculated from the pansharpened image, the LR MS image, and the PAN image without the ground truth. The underlying assumption of interrelationships consistency of cross-similarity is demonstrated by the fact that the true HR MS data, whenever available, exhibit spectral and spatial distortions that are both zero, within the approximations of the model, and definitely lower than those attained by any fusion method. To describe this, quantitatively, we introduce a non-GT loss on top of QNR [26], which is defined as follows:

$\begin{equation*} \mathcal {L}_{Q} = 1 - \text{QNR}. \tag{1} \end{equation*}$ View Source

QNR is the abbreviation of quality with no reference, which is a combination of $D_\lambda$ and $D_s$ . $D_\lambda$ is a spectral distortion index, whereas $D_s$ is a spatial quality metric complementary to $D_\lambda$

$\begin{equation*} \text{QNR} = (1 - D_\lambda) (1 - D_s). \tag{2} \end{equation*}$ View Source

The optimal value of QNR is 1

$\begin{equation*} D_\lambda = \sqrt{ \frac{2}{K(K-1)} \sum _{i=1}^K \sum _{\underbrace{j=1}_{i\ne j}}^K |Q(P_i, P_j) - Q(X_i, X_j) | } \tag{3} \end{equation*}$ View Source

where

$P$

is the pansharpened result,

$X$

stands for the LR MS input, and

$K$

is the number of bands.

$P_i$

and

$X_i$

represent the

$i$

th band of them, respectively.

$Q$

stands for image QI [41]. It is defined as follows:

$\begin{equation*} Q(x, y) = \frac{4 \sigma _{xy} \cdot \bar{x} \cdot \bar{y} }{ (\sigma _x^2 + \sigma _y^2) (\bar{x}^2 + \bar{y}^2) } \tag{4} \end{equation*}$

View Source

where

$\sigma _{xy}$

means the covariance between

$x$

and

$y$

, and

$\sigma ^2_{x}$

and

$\sigma ^2_{y}$

are the variances of

$x$

and

$y$

, respectively.

$\bar{x}$

and

$\bar{y}$

are the means of

$x$

and

$y$

, respectively

$\begin{equation*} D_s = \sqrt{ \frac{1}{K} \sum \nolimits_{i=1}^K | Q(P_i, Y) - Q(X_i, \tilde{Y}) | } \tag{5} \end{equation*}$

View Source

where

$Y$

is the PAN input and

$\tilde{Y}$

is the degraded LR version of it.

This loss function enables us to measure the qualities of the fused images from input PAN and MS images without ground truth HR MS images.

2) Adversarial Loss

We design two discriminators, $D_1$ for spectral preservation and $D_2$ for spatial preservation. The generator learns to preserve more spectral information to cheat $D_1$ , which is able to distinguish the real and fake LR MS images and preserve more spatial details to cheat $D_2$ , which is able to distinguish the real PAN image and the spectral degraded fusion result. The loss function of the generator $G$ takes the form of

$\begin{align*} \mathcal {L}_G =& \frac{1}{N} \sum _{n=1}^N - \alpha D_1(\tilde{P}^{(n)}) - \beta D_2(\hat{P}^{(n)}) \\ &+ \mathcal {L}_{Q} (P^{(n)}, X^{(n)}, Y^{(n)}) \tag{6} \end{align*}$ View Source

where

$N$

is the number of samples.

$P$

$X$

, and

$Y$

are pansharpened MS, LR MS, and PAN images, respectively.

$\tilde{P}$

and

$\hat{P}$

stand for the spatially and spectrally degraded

$P$

$\alpha$

and

$\beta$

are hyperparameters.

To stabilize the training, we employ WGAN-GP [34] as a basic framework, i.e., using Wasserstein distance [33] and applying gradient penalty [34] to discriminators. The loss functions of $D_1$ and $D_2$ are formulated as follows:

$\begin{align*} \mathcal {L}_{D_1} =& \frac{1}{N} \sum _{n=1}^N - D_1(X^{(n)}) + D_1(\tilde{P}^{(n)}) \\ &+ \lambda \text{GP}(D_1, X^{(n)}, \tilde{P}^{(n)}) \tag{7}\\ \mathcal {L}_{D_2} =& \frac{1}{N} \sum _{n=1}^N - D_2(Y^{(n)}) + D_2(\hat{P}^{(n)}) \\ &+ \lambda \text{GP}(D_2, Y^{(n)}, \hat{P}^{(n)}) \tag{8} \end{align*}$ View Source

where GP is the gradient penalty for discriminators and

$\lambda$

is a hyperparameter.

C. Training Details

Our method is implemented in PyTorch [42] and trained on a single NVIDIA Titan 1080Ti GPU. The batch size and learning rate are set as 8 and 1e-4, respectively. The hyperparameters in (6)–(8) are set as $\alpha = 2e-4, \beta = 1e-4$ , and $\lambda = 100$ . Adam optimizer [43] is used to train the model for 20 epochs with fixed hyperparameters $\beta _1 =0$ and $\beta _2=0.9$ . It should be noted that we process the images at the raw bit depth in both training and testing phases without normalizing them into displayable 8-b images.

SECTION IV.

Experiments and Results

A. Datasets

We conduct extensive experiments on three datasets with images collected from GaoFen-2, QuickBird, and WorldView-3 satellites. The detailed information of the datasets can be seen in Table II. For comparison between the supervised and unsupervised methods, we build the datasets under both reduced-scale (based on Wald's protocol [44]) and full-scale settings. Wald's protocol has been widely used for assessment of pansharpening methods, in which the original MS and PAN images are spatially degraded before feeding into models, the reducing factor being the ratio between their spatial resolutions, and the original MS images are considered as reference images for comparison. Same as the previous works [39], [45], we implement it by blurring the full-resolution datasets using a Gaussian filter and then downsampling them with a scaling factor of 4. Under Wald's protocol, the supervised models can be trained on reduced-resolution images using the original MS images as labels. However, under full-resolution setting, there are no reference images so that only unsupervised models can be trained with the full-resolution images as inputs. Although the training environment is limited related to the type of models, testing is without constraints. We can test all models on both reduced-scale and full-scale images whether they need supervised labels or not.

TABLE II Details of the Datasets Used in Our Experiments

Moreover, because of the very large size of remote sensing images, for example, PAN and MS images of GaoFen-2 are with the size of about $30\,000\times 30\,000$ and $7500\times 7\,500$ pixels, respectively, which is too large to feed into a neural network, we crop these images into small patches to form training and testing sets. As for the testing set, we crop the remote sensing images orderly in small overlapping regions between neighborhood patches to remove border effects as is the common practice. It is noted that there is no overlapping between the training and testing sets.

B. Metrics

In order to examine the performance of models on both reduced-scale and full-scale images, we use reference and nonreference metrics, respectively.

1) Nonreference Metrics

On full-resolution, there is no ground truth and we use the nonreference metrics, which can be calculated without the target. $D_\lambda$ , $D_s$ , and QNR are widely used nonreference metrics in pansharpening task, which have been introduced in Section III to support our loss function.

2) Reference Metrics

Under Wald's protocol [44], we can also validate models on reduced-resolution, where the PAN and MS images are downsampled so that the origin MS image is as ground truth. Therefore, on LR, we use the following reference metrics.

SAM [49] measures spectral distortions of pansharpened images comparing with the reference images
$\begin{equation*} \text{SAM}(x_1, x_2) = \arccos {\left(\frac{x_1 \cdot x_2}{||x_1|| \cdot ||x_2||}\right)} \tag{9} \end{equation*}$ View Sourcewhere $x_1$ and $x_2$ are two spectral vectors from the pansharpened result and the reference result.
Erreur Relative Globale Adimensionnelle de Synthèse (ERGAS) [50]: The ERGAS, also known as the relative global dimensional synthesis error is a commonly used global QI. It is given by
$\begin{equation*} \mathrm{ERGAS} = 100\frac{h}{l}\sqrt{\frac{1}{N}\sum ^N_{i=1}\left(\frac{\mathrm{RMSE}(B_i)}{M(B_i)}\right)^2} \tag{10} \end{equation*}$ View Source where $h$ and $l$ are the spatial resolutions of PAN and MS images; RMSE( $B_i$ ) is the root-mean-square error between the $i$ th band of the fused and reference image; and $M(B_i)$ is the mean value of the original MS band $B_i$ .
SSIM [51] is a widely used metric that models the loss and distortion according to the similarities in light, contrast, and structure information
$\begin{equation*} \text{SSIM}(x, y) = \frac{ (2 \mu _x \mu _y + c_1) (2 \sigma _{xy} + c_2) }{ (\mu _x^2 + \mu _y^2 + c_1) (\sigma _x^2 + \sigma _y^2 + c_2) } \tag{11} \end{equation*}$ View Sourcewhere $\sigma _{xy}$ means the covariance between $x$ and $y$ , and $\sigma ^2_{x}$ and $\sigma ^2_{y}$ are the variances of $x$ and $y$ , respectively. $\mu _{x}$ and $\mu _{y}$ are the means of $x$ and $y$ , respectively. $c_1$ and $c_2$ are fixed constants.

C. Comparison With State of the Arts

Fourteen methods, including seven traditional methods, five supervised methods, and two unsupervised methods, are employed for comparison.

Traditional methods include BDSD [9], GS [5], IHS [2], Brovey [46], HPF [47], LMM [48], and SFIM [8].

Supervised methods are PNN [15], DRPNN [19], MSDCNN [20], PanNet [17], and PSGAN [23]. These methods are state-of-the-art supervised deep learning based pansharpening methods. The first four models are CNN-based models and the last one is based on GAN.

Unsupervised methods include Pan-GAN [39] and our proposed PGMAN.

D. Quantitative Results

For nonreference metrics, D $_\lambda$ is used to examine the spectral distortions in full-scale images, D $_s$ is for the spatial distortions, and QNR is a comprehensive metric. For reference metrics, SAM is used to examine the spectral distortions in reduced-scale images, ERGAS is for the spatial distortions, and SSIM is a comprehensive metric.

Table III shows the quantitative assessments on GaoFen-2. The best value in each group is in bold font. For the nonreference metrics, our proposed PGMAN takes the advantage of the unsupervised learning and obtains the best average values on D $_\lambda$ , D $_s$ , and QNR, which is much better than all the other competitors. As for the reference metrics, unsupervised methods fall behind supervised methods because they do not utilize the label information, whereas the supervised methods can directly minimize the difference between the output pansharpened images and the ground truth. Nevertheless, our proposed PGMAN still obtains the best results among the unsupervised methods on the reference metrics. PGMAN also achieves better values than all of the traditional methods in terms of ERGAS and SSIM.

TABLE III Quantity Results on GaoFen-2 Dataset

Table IV shows the quantitative assessments on QuickBird. Similar to the results of GaoFen-2 dataset, here, our proposed model PGMAN maintains its superiority and achieves the best values on all nonreference metrics, which exceed the performance of other traditional and supervised methods. PGMAN also obtains the best value in terms of SAM in the unsupervised methods while is a little behind Pan-GAN on ERGAS and SSIM.

TABLE IV Quantity Results on QuickBird Dataset

We also conduct an evaluation on the eight-band images from WorldView-3 dataset. Table V shows the quantitative assessments on it. The results show that our proposed PGMAN still maintains its superiority and achieves the best values on all nonreference metrics, showing the effectiveness and the great practical value of it. For the reference metrics, both unsupervised methods fall a lot. In eight-band WV3 dataset, the generalization ability from the full-scale dataset to the reduced-scale dataset may be affected due to the added spectral bands. It remains a challenge to the unsupervised methods trained in the full-scale dataset.

TABLE V Quantity Results on WoldView-3 Dataset

Generally, traditional models do not rely on any training data and make a medium performance regardless of the kind of testing environment. Supervised methods tend to be better on the reduced-scale images because they make full use of the supervision information and directly minimize the loss between the predicted pansharpened and the ground truth images. However, this may make them overfit reduced-scale images and generalize poorly on full-resolution test images, and sometimes these methods even perform worse than traditional models. The best nonreference metrics achieved by supervised methods are PNN (GaoFen-2 dataset, Table III) and PanNet (QuickBird dataset, Table IV), the reason may be that they consist of fewer parameters (see Table VIII) and, thus, with lower possibility to become overfitting. Here, unsupervised models take the advantages of learning from the full-scale images and obtain the best nonreference metrics. Furthermore, our proposed model, PGMAN, surpasses all other unsupervised methods. Our method significantly improves the results, which indicates that full-resolution images indeed provide rich spatial and spectral information and are helpful for improving the quality of the pansharpened images. Although our method falls behind the supervised methods when testing on the reduced-scale images, considering that there is no ground truth used in our method, the results are quite promising. More importantly, the proposed PGMAN obtains the best QNR when applied to full-resolution images, showing the great practical value of it.

TABLE VI Ablation Study of the Loss Function on GaoFen-2 Dataset

TABLE VII Parameter Analysis on GaoFen-2 Dataset

TABLE VIII Inference Time and Number of Trainable Parameters

E. Visual Results

Fig. 2 shows some example results on GaoFen-2 full-resolution images, which are produced from the original PAN and LR MS images. The results of Brovey [see Fig. 2(f)] and LMM [see Fig. 2(h)] cannot preserve the spectral information from the LR MS image [see Fig. 2(b)] very well. They tend to produce notable color distortions. Those supervised methods, including PNN [see Fig. 2(j)], DRPNN [see Fig. 2(k)], MSDCNN [see Fig. 2(l)], PanNet [see Fig. 2(m)], and PSGAN [see Fig. 2(n)], also demonstrate low-quality spatial scenario details. It seems that the knowledge learned from the reduced-scale images is hard to generalize well to the real scenarios. This is because different from natural images, remote sensing images usually contain deeper bit depth and are with different pixel distribution. Downsampling the original PAN and LR MS images in a supervised manner will change the data distribution and hurt the original information. However, when trained on the full-scale images, the quality is improved significantly for PGMAN [see Fig. 2(p)]. This gives us strong motivations to propose an unsupervised model since the pansharpening works are usually done on the full-resolution images in real applications.

Fig. 2.
Visual results on GaoFen-2 full-resolution dataset. (a) PAN. (b) LR MS. (c) BDSD [9]. (d) GS [5]. (e) IHS [2]. (f) Brovey [47]. (g) HPF [48]. (h) LMM [49]. (i) SFIM [8]. (j) PNN [15]. (k) DRPNN [19]. (l) MSDCNN [20]. (m) PanNet [17]. (n) PSGAN [23]. (o) Pan-GAN [40]. (p) PGMAN.

Show All

Fig. 3 shows some example results on QuickBird full-resolution images. GS [see Fig. 3(d)], IHS [see Fig. 3(e)], and Brovey [see Fig. 3(f)] distort the spectral information from the LR MS image [see Fig. 3(b)], which can be seen from the color of the roof. The supervised methods, including PNN [see Fig. 3(j)] and MSDCNN [see Fig. 3(l)], also produce unsatisfactory results with some distortions. Our method generates the pansharpened image [see Fig. 3(p)] with good spatial and spectral quality. The visual results from these two datasets demonstrate the superiority of our proposed model and show the advantage of unsupervised models for training on the original data distribution.

Fig. 3.
Visual results on QuickBird full-resolution dataset. (a) PAN. (b) LR MS. (c) BDSD [9]. (d) GS [5]. (e) IHS [2]. (f) Brovey [47]. (g) HPF [48]. (h) LMM [49]. (i) SFIM [8]. (j) PNN [15]. (k) DRPNN [19]. (l) MSDCNN [20]. (m) PanNet [17]. (n) PSGAN [23]. (o) Pan-GAN [40]. (p) PGMAN.

Show All

F. Ablation Study

We conduct ablation studies on GaoFen-2 dataset to verify the effectiveness of the proposed Q-loss. The quantitative results are listed in Table VI.

It is observed that using only the adversarial loss to optimize the network degrades the performance except for D $_\lambda$ index. Adversarial loss tries to simulate the generation of PAN and MS images, i.e., the PAN and MS images are degradation version of the desired HR MS image along the spatial and spectral dimension, respectively. However, since the inverse of the process is ill-posed, even if the degraded images of the generated pansharpened images are identical to the corresponding PAN and LR MS, it is hard to guarantee that the pansharpened images are high fidelity in both spatial and spectral information. Q-loss makes an additional constraint to the learning, i.e., the spatial and spectral information should be preserved across scales. Experimental results show that Q-loss produces better results than adversarial loss in most cases, indicating that Q-loss is a stronger constraint than adversarial loss. Finally, combining the two losses together further improves the performance, especially on the reference-based indicators. Therefore, with the help of Q-loss item and adversarial loss item, our proposed model PGMAN can be optimized to a good status and achieve competitive results on both reduced-scale and full-scale images.

Furthermore, we also conduct parameter analysis on GaoFen-2 dataset, where we try six different combinations of $\alpha$ and $\beta$ values. Table VII displays the results. At the same magnitude, the results of $\alpha =2\beta$ are better than those of $\beta =2\alpha$ . It indicates that giving more weights to $D_1$ to make a balance is good for our model to learn, because the parameters of $D_1$ is less than $D_2$ . When the magnitude is changed from $1e-3$ to 0, we can find that both too small and too large values can affect the performance of the model. Considering all reference metrics and nonreference metrics, we finally choose $\alpha =2e-4$ and $\beta =1e-4$ to construct the loss function.

G. Efficiency Study

We evaluate the computational time and memory cost of ours and the comparison methods. All traditional methods are implemented using MATLAB and run on an Intel Core i7-7700HQ CPU, and deep models are implemented in PyTorch and tested on a single NVIDIA GeForce RTX 2080Ti GPU. Table VIII shows the inference speed and number of parameters of all comparative models. The inference speed is evaluated on 286 test samples, and each pair consists of a PAN image with $400 \times 400 \times 1$ pixels and an LR MS image with $100 \times 100 \times 4$ pixels. The time is calculated by averaging these 286 samples. Additionally, we report the number of trainable parameters of all deep learning related methods.

BDSD is the slowest method, whose inference time is about 1.2613 s per image. IHS and Brovey only require about 0.01 s, which are the fastest in traditional methods.

Beneficial from the advance of GPU architectures, the inference time of deep learning models is satisfactory, almost all deep models only require less than 0.001 s to pansharpen one image, except for DRPNN model who is the largest deep model with deeper network architecture and much bigger convolution filters. PNN is the fastest since it consists of only three convolution layers. PGMAN is efficient in both model size and inference speed.

SECTION V.
Conclusion
In this article, we propose a novel unsupervised generative multiadversarial model for pansharpening, called PGMAN. PGMAN consists of one generator and two discriminators to reduce both the spectral and spatial distortions. To further improve the performance, we also introduce a QNR-related loss to the unsupervised manner. As one of the advantages of unsupervised methods, our proposed method can be trained on either reduced-scale or full-scale images without ground truth. Our model is focused on original PAN and LR MS images without any preprocessing step to keep the consistency with the real application environment. The attractive performance on full-scale testing and satisfactory results on reduced-scale images demonstrate the powerful ability of our proposed model.
However, there is still a gap between the supervised and unsupervised models on the reference metrics. When we use the GAN-based framework for spectral preservation, we simply downsample the output to form a fake LR MS image for the discriminator to recognize. The method of downsampling remains uncertain and may affect the performance of the model, which is also an ill-posed problem. There may be another better way to achieve the degradation process to discuss. In our future work, we will continue to study the architecture of unsupervised pansharpening models and further improve the performance.

References is not available for this document.

PGMAN: An Unsupervised Generative Multiadversarial Network for Pansharpening

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction

Related Work