Journals & Magazines >IEEE Journal of Selected Topi... >Volume: 15

LDP-Net: An Unsupervised Pansharpening Network Based on Learnable Degradation Processes

Abstract:

Pansharpening in remote sensing image aims at acquiring a high-resolution multispectral (HRMS) image directly by fusing a low-resolution multispectral (LRMS) image with a...Show More

Metadata

Abstract:

Pansharpening in remote sensing image aims at acquiring a high-resolution multispectral (HRMS) image directly by fusing a low-resolution multispectral (LRMS) image with a panchromatic (PAN) image. The main concern is how to effectively combine the rich spectral information of LRMS image with the abundant spatial information of PAN image. Recently, many methods based on deep learning have been proposed for the pansharpening task. However, these methods usually have two main drawbacks: 1) requiring HRMS for supervised learning; and 2) simply ignoring the latent relation between the MS and PAN image and fusing them directly. To solve these problems, we propose a novel unsupervised network based on learnable degradation processes, dubbed as LDP-Net. A reblurring block and a graying block are designed to learn the corresponding degradation processes, respectively. In addition, a novel hybrid loss function is proposed to constrain both spatial and spectral consistency between the pansharpened image and the PAN and LRMS images at different resolutions. Experiments on GaoFen-2, Worldview-2, and Worldview-3 images demonstrate that our proposed LDP-Net can fuse PAN and LRMS images effectively without the help of HRMS samples, achieving promising performance in terms of both qualitative visual effects and quantitative metrics.

Published in: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing ( Volume: 15)

Page(s): 5468 - 5479

Date of Publication: 04 July 2022

ISSN Information:

DOI: 10.1109/JSTARS.2022.3188181

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Nowadays, numerous remote sensing images are obtained to monitor the conditions of agriculture, forestry, ocean, land, environmental protection, and meteorology [1]. Usually, most earth observation satellites can provide two kinds of images, namely, panchromatic (PAN) images with a high spatial resolution band and multispectral (MS) images with higher spectral resolution but lower spatial resolution, which are limited to the image signal-to-noise ratio (SNR) and data storage and transmission. Naturally, the technique for PAN and MS image fusion has been proposed and developed. This technology, which is known as pansharpening, integrates the complementary advantages of spatial and spectral information respectively from PAN and MS images to obtain high spatial resolution MS images. Fused images with both high spectral and spatial resolution can achieve better results in subsequent tasks, such as image classification and object detection [2].

In early research, many traditional methods were proposed to develop pansharpening algorithms, and most of them can be generally summarized into three categories.

1) Methods based on component substitution (CS) [3] attempt to transform MS images and PAN images into a new space in which the structural component of MS images can be substituted by PAN images to achieve spatial information injection. Representative attempts include principal component analysis (PCA) [4], intensity-hue-saturation (IHS) [5], and Gram–Schmidt adaptive (GSA) transform [6].
Multiresolution-analysis-based methods utilize the high frequencies of PAN images to restore the spatial details in MS images. To extract this high-frequency information in PAN images, various transform algorithms are applied, such as Laplacian pyramid transform [7], discrete wavelet transform (DWT) [8], and support value transform [9].
Model-based methods [10] treat pansharpening as an inverse process of the degradation in which the ideal high-resolution multispectral (HRMS) image degenerates to a PAN image and low-resolution multispectral (LRMS) image. One typical example is the band-dependent spatial detail (BDSD) method [11].

However, these methods exhibit a certain degree of spectral distortions owing to some prior assumptions, which are hard to be generalized to different situations [12].

In the past decade, deep learning approaches, especially convolutional neural networks (CNNs), have achieved excellent performance in various fields, including computer vision and image processing tasks [13]. Some pioneering methods have applied CNNs to the pansharpening task. Typical examples include PNN [14], PanNet [15], PSGAN [16], RED-cGan [17], and TFNet [18]. These supervised learning methods use an end-to-end network to learn the pansharpening process and achieve desirable performance with high spatial resolution and few spectral distortions. However, two vital problems still exist in most CNN-based methods. The first issue is that most networks are based on supervised learning and the training data are generated following Wald’s protocol [19]. These models perform spatial downsampling and blurring operations on the MS images to obtain the LRMS images and treat the original MS images as ground truth. These operations may not be consistent with the degradation processes in the real situation. The other issue is that these schemes do not effectively utilize the rich spatial information of PAN images [20] and ignore the relation between MS images and PAN images.

To address these problems, we propose a novel unsupervised network for pansharpening based on a two-stream CNN-based architecture with two learnable degradation processes, dubbed as LDP-Net. Pansharpening can be regarded as a superresolution or deblurring problem [21] with additional PAN images and aims to restore the spatial details from PAN images and simultaneously maintain the spectral information of LRMS images. Owing to the lack of ground truth, the inverse process of pansharpening can be divided into two degradation processes: one process uses a spectral response function to transform the HRMS image into a single grayed image similar to the PAN image, and the other process models a spatial blurring operation from the HRMS image into an upsampled LRMS image with a blurring kernel. In the proposed LDP-Net, we adopt two CNN modules to learn two degradation processes. Moreover, according to the relation between MS and PAN images, we propose a new loss function to effectively constrain both spatial and spectral information. Furthermore, a KL divergence loss function is proposed to maintain the spectral distribution of the difference between the MS and PAN images at two resolutions, which has never been explored. As a result, our proposed model achieves desirable performance in which the predicted HRMS image can preserve the high spatial resolution of the PAN image and rich spectral information of the LRMS image under unsupervised conditions. The main contributions of this article are summarized as follows.

An unsupervised pansharpening model is proposed based on a two-stream end-to-end network, which is trained without relying on supervised labels. The hyperparameters of the model can be easily tuned in training phase.
Different from other models with specified degradation operators, our proposed model learns the degradation processes in a data-driven manner.
A novel hybrid loss function, which consists of three parts, is proposed. The first two parts maintain the spatial and spectral consistency between the inputs and the predicted HRMS image in two different resolutions. The other part constrains the difference between the MS and PAN images at different resolutions to have similar distributions.
Extensive experiments on different remote sensing datasets demonstrate the effectiveness and robustness of our method over several state-of-the-art methods in both qualitative and quantitative aspects.

The rest of this article is organized as follows. In Section II, we review related works on pansharpening. Section III introduces the framework of the proposed unsupervised model and the loss function for training without labels. In Section IV, extensive experiments were conducted to illustrate our pansharpening method compared with several representative traditional, supervised, and unsupervised learning based approaches. Finally, Section V concludes this article.

SECTION II.

Related Works

Numerous pansharpening methods have emerged in recent decades, and this section briefly reviews these methods, including classic approaches, supervised learning based approaches and unsupervised learning based approaches.

A. Classic Methods

Traditional pansharpening methods can be roughly classified into three categories. First, early pansharpening studies focused on CS. Some components of the upsampled LRMS images are substituted by corresponding components of PAN images in a specific transform domain. The spectral information and spatial information are separated using a simple and fast transformation, such as IHS [5], principal components transform [22], and GSA transform [6]. Moreover, Dou et al. [23] proposed a general framework to implement these CS-based methods systematically. These methods can effectively achieve high spatial resolution but may cause spectral distortions in the pansharpened results. The second category is multiresolution-analysis-based methods, which apply multiscale decomposition techniques to inject high-frequency information of the PAN image into the upsampled LRMS image. High-frequency spatial information is usually extracted by several transform algorithms, such as wavelet transform [24], Laplacian pyramid transform [7], curvelet transform [25], and contourlet transform [4]. Although these methods can achieve improved performance in spectral fidelity, they may also cause aliasing distortion and blurring effects in spatial details. The third type is model-based methods. For instance, Garzelli et al. [11] presented two linear injection models, including the single spatial detail (SSD) model and the BDSD model and optimized the models by minimizing the squared error between the original MS image and pansharpened results. Another pansharpening model proposed by Wright achieved fast image fusion with a Markov random field [26]. In addition, Guo et al. [27] adopted an online coupled dictionary learning approach to model the relation between LRMS and PAN images to reduce the spectral distortion and restore the spatial details. Recently, Guo et al. [28] developed a new posterior probability model based on the Bayesian theory to achieve better spectral and spatial fusion.

B. Supervised Learning Based Approaches

These deep learning methods specifically design a CNN-based network driven by large quantities of paired training data and achieve better performance than traditional methods. Motivated by the superresolution convolutional neural network (SRCNN) model [29], Giuseppe et al. [14] first proposed a three-layer CNN-based network named PNN according to the characteristics of remote sensing images. Later, Yang et al. [15] directly added the upsampled LRMS image to the output of the network to maintain spectral consistency and treated the edges of PAN and LRMS images as the inputs of the network to restore the spatial details. However, introducing only high-frequency information and superimposing the upsampled LRMS image on the results can cause a blurring effect and lead the training difficult to converge. Scarpa et al. [30] adopted a target-adaptive usage modality to ensure that a lightweight network can be applied to different remote sensing sensors. In the deep residual pansharpening neural network (DRPNN) model [31], the concept of residual learning is introduced to form a very deep convolutional neural network, which can further improve the pansharpening performance. He et al. [32] introduced a new detail injection strategy into the CNN-based pansharpening methods. Subsequently, Deng et al. [33] further exploited a new detail injection-based network aided by the difference between the PAN image and the upsampled LRMS image. Recently, Liu et al. [18] also incorporated residual learning into a two-stream CNN architecture to fuse the features extracted from both MS and PAN images. Zhang et al. [34] designed a triple-double network with a level-domain-based loss function to fully exploit the spatial details of the PAN image. Jin et al. [35] utilized Laplacian pyramid network to recover the crucial spatial information at multiscales. Moreover, several generative adversarial network (GAN)-based methods have been proposed to utilize a discriminator to distinguish the generated images from the ground-truth images. In PSGAN [16], the authors first attempted to produce high-quality pansharpened images with GANs and design a two-stream fusion architecture as the generator and a fully convolutional network as the discriminator. In RED-cGAN [17], a residual encoder–decoder conditional GAN was proposed to produce more details with sharpened images. However, as we mentioned above, these methods require HRMS images for supervised learning and still suffer from spectral distortions or blurring effects.

C. Unsupervised Learning Based Approaches

To address the unreality of simulated data and bridge the gap between classic and supervised learning based approaches, some unsupervised learning based approaches have been developed. Ma et al. [20] achieved unsupervised pansharpening using one generator and two discriminators that were designed to distinguish the spatial and spectral characteristics between generated and real images, respectively. Then, Zhou et al. [36] combined a generative multiadversarial network and nonreference loss function to improve the performance of unsupervised pansharpening. Motivated by some priors about downsampling and blurring, several methods have been developed for unsupervised pansharpening. For instance, a deep learning prior based on spatial downsampling with blurring has been applied for image fusion to obtain the loss function in [37]. The authors embedded the semantic features extracted from the guidance PAN image by an encoder–decoder network into another deep decoder to generate an output image. Similarly, Luo et al. [38] designed an iterative network architecture with a PAN-guided strategy and a set of skip connections to continuously extract and fuse the features from the input and then used a fixed unidimensional Gaussian kernel to obtain a blurred version from the fused HRMS image. However, these prior-based methods are limited to handcrafted training data and cannot be effectively applied to real scenes.

In this article, we propose an unsupervised learning model based on a two-stream CNN network incorporated with two learnable degradation modules that can be adaptive to complex simulated and real situations. Moreover, we specifically design a hybrid spectral loss to effectively maintain spectral consistency between the output and input LRMS images.

SECTION III.

Method

A. Problem Formulation and Framework

Unsupervised pansharpening aims to obtain the pansharpened HRMS image by fusing the LRMS image and the HR PAN image without the ground-truth. We denote the LRMS image by $m \in R^{w\times h\times C}$, the corresponding HR PAN image by $P \in R^{W\times H}$, the pansharpened HRMS image by $\widehat{M} \in {R^{W \times H \times C}}$, and the ground-truth HRMS image by $ M \in {R^{W \times H \times C}}$. $W$ and $H$ represent the width and height of high-resolution images, respectively, while $w$ and $h$ represent the width and height of low-resolution images, respectively. $C$ is the number of spectral bands of the multispectral image and usually, $C=4$. The scale factor for spatial resolution ratio is defined as $r=W/w=H/h$ and usually $r=4$.

Our proposed LDP-Net is based on a two-stream encoder–decoder fusion network. As shown in Fig. 1, the network mainly consists of several different modules, including feature extraction block (FEB), dense encoder–decoder block (DEDB), reconstruction block (REC), graying block (GB), and reblurring block (RB). First, we interpolate the LRMS image $m$ to the upsampled LRMS image $\uparrow m \in {R^{W \times H \times C}}$ with same resolution as that of the PAN image. As shown in Fig. 1, to unify the dimensions of both HR PAN image $P$ and the predicted HRMS image $\widehat{M}$ as the input of RB, we copy the single-band PAN image $C$ times to form a $C$-band tensor $\widetilde{P} \in {R^{W \times H \times C}}$. Then, $\uparrow m$ and $\widetilde{P}$ are fed into FEB to obtain the shallow spectral and spatial features $F_{m}$ and $F_{p}$, respectively. Then, we use DEDB [39], which has a strong inference ability to further extract and fuse the deep features. Finally, the predicted HRMS image $\widehat{M}$ is reconstructed from the concatenation of the deep features and shallow features via two residual connections [13]. The fusion process takes the following general form: \begin{equation*} \widehat{M} = f\left(\uparrow m,\widetilde{P};\Theta \right) \tag{1} \end{equation*}View Source where $f(\cdot)$ is the two-stream encoder–decoder fusion model, which takes $\uparrow m$ and $\widetilde{P}$ as the inputs and generates the desired HRMS image $\widehat{M}$, while $\Theta$ is the collection of parameters for this model.

$Fig. 1. - Overview of the proposed LDP-Net for pansharpening. FEB denotes the feature extraction block. DEDB denotes the dense encoder–decoder block. RB and GB represent the reblurring block and graying block, respectively. REC stands for the reconstruction block. $4 \uparrow$ and $4 \downarrow$ stand for 4 times upsampling and downsampling, respectively. $F$ and $R$ denote the shallow features and residual connection, respectively.$

Fig. 1.

Overview of the proposed LDP-Net for pansharpening. FEB denotes the feature extraction block. DEDB denotes the dense encoder–decoder block. RB and GB represent the reblurring block and graying block, respectively. REC stands for the reconstruction block. $4 \uparrow$ and $4 \downarrow$ stand for 4 times upsampling and downsampling, respectively. $F$ and $R$ denote the shallow features and residual connection, respectively.

Show All

Since we do not have the HRMS image as labels, to achieve unsupervised learning, two degradation processes, namely, the degradation between the ideal HRMS image $M$ and the HR PAN image $P$ and the degradation between the ideal HRMS image $M$ and the upsampled LRMS image $\uparrow m$, are formulated to add extra constraints on the training procedure of $f$ as follows [40]: \begin{equation*} P = \sum \limits _{i = 1}^{C} {{\alpha _{i}}{M_{i}}} \tag{2} \end{equation*}View Source and \begin{equation*} \uparrow m = k * M \tag{3} \end{equation*}View Source where $M_{i}$ denotes the $i$th band of the ideal HRMS image, $\alpha _{i}$ is the corresponding weighting coefficient, and $k$ represents the spatial blur kernel. (2) can be regarded as the degradation process of graying an MS image, which is similar to graying a RGB image, while (3) can be regarded as the blurring process. Inspired by the forms of (2) and (3), a channel attention module [41] is adopted to simulate the graying degradation as GB and a convolution module is employed to simulate the blurring degradation as RB. The parameters of both modules can be learned from training data. Consequently, our model can be optimized by minimizing the loss between the inputs and two degraded versions of the output HRMS image $\widehat{M}$. In addition, we apply the two degradation blocks for the two inputs to obtain their corresponding degraded versions at low resolution. It is worth mentioning that the different GBs and RBs used in our proposed LDP-Net share the same parameters, respectively.

B. Loss Function

Given the upsampled LRMS image $\uparrow m$ and the stacked HR PAN image $\tilde{P}$ as the inputs, our network produces the desired HRMS image $\widehat{M}$ and four degraded images using the learned degradation operations, which are respectively defined as follows: \begin{align*} {\widehat{M}_{gray}} &= G\left({\widehat{M}} \right) \tag{4} \\ {\widehat{M}_{blur}} &= B\left({\widehat{M}} \right) \tag{5} \\ \uparrow {m_{gray}} &= G\left({ \uparrow m} \right) \tag{6} \\ {\widetilde{P}_{blur}} &= B\left({\widetilde{P}} \right) \tag{7} \end{align*}View Source where ${\widehat{M}_{gray}} \in {R^{W \times H \times C}}$ is the grayed version of $\widehat{M}$, ${\widehat{M}_{blur}} \in {R^{W \times H \times C}}$ is the blurred version of $\widehat{M}$, $ \uparrow {m_{gray}} \in {R^{W \times H \times C}}$ denotes the grayed version of $\uparrow m$, and ${\widetilde{P}_{blur}} \in {R^{W \times H \times C}}$ denotes the blurred version of $\widetilde{P}$. $G(\cdot)$ and $B(\cdot)$ represent the corresponding degradation functions of GB and RB, respectively. Then, our model utilizes these degraded versions to calculate the loss without the ground-truth. The proposed loss function contains three parts: spatial loss, spectral loss, and spectral KL divergence loss.

1) Spatial Loss

The degradation relationship between the MS image and PAN image can be used to restore the high-resolution spatial information of the output HRMS image. Thus, the spatial loss of our method, which can be divided into spatial constraints at both low and high resolutions, is defined as \begin{equation*} {L_{spatial}} = \left\Vert {{{\widetilde{P}}_{blur}} - \uparrow {m_{gray}}} \right\Vert _{2}^{2}+ \delta * \left\Vert {\widetilde{P} - {{\widehat{M}}_{gray}}} \right\Vert _{2}^{2} \tag{8} \end{equation*}View Source where ${\Vert \cdot \Vert _{2}}$ denotes the L2 norm and $\delta$ represents a regularization parameter to balance the two terms. The first term represents the spatial constraint at low resolution, and the second term represents the spatial constraint at high resolution after upsampling. The proposed spatial loss devotes to ensuring the consistency of spatial information extracted by two degradation modules at different resolutions.

2) Spectral Loss

Another degradation between the HRMS image and the upsampled LRMS image can be regarded as the blurring operation, which can be used to maintain the spectral consistency between the output HRMS image and the input upsampled LRMS image at different resolutions. Then, similar to (8), the spectral loss is defined as \begin{equation*} {L_{spectral}} = \left\Vert { \uparrow m - {{\widehat{M}}_{blur}}} \right\Vert _{_{2}}^{2}\mathrm{{ + }}\gamma * \left\Vert {m - \downarrow \widehat{M}} \right\Vert _{2}^{2} \tag{9} \end{equation*}View Source where $\gamma$ denotes a regularization parameter to balance the two terms. To get the downsampled output $\downarrow \widehat{M}$, we smooth ${\widehat{M}}$ with $3\times 3$ mean filter before downsampling to avoid aliasing effect.

3) Spectral KL Divergence Loss

On the other hand, we consider the inverse process of graying degradation and note that the spectral information of MS images in different spectral bands should follow a specific pattern. The difference between the MS image and PAN image at different resolutions should have similar distributions. Based on this consideration, we use the softmax function to transform the residual terms into a form of probability distribution. Then, the spectral Kullback–Leibler (KL) divergence loss is added to constrain the distribution of the residual terms at different resolutions, which is formulated as follows: \begin{equation*} {L_{KL}} = KL(p({x_{low}})\left\Vert {q(x))} \right., \tag{10} \end{equation*}View Source where $p({x_{low}}) = softmax (\uparrow m - \uparrow {m_{gray}})$ and $q(x) = softmax (\widehat{M} - \widetilde{P})$. ${x_{low}} = \uparrow m - \uparrow {m_{gray}}$ denotes the residual features between the MS image and the PAN image at low resolution and $x = \widehat{M} - \widetilde{P}$ stands for the residual features between the MS image and the PAN image at high resolution. We reshape the residual terms into a one-dimensional vector and apply the softmax function to rescale the elements. Then KL divergence is applied to impose both terms have similar distributions. The spectral KL divergence loss can effectively reduce the spectral artifacts in the fused results, which will be demonstrated in the experimental section.

In summary, we utilize spatial loss and spectral loss to simultaneously restore the spatial details and preserve the spectral information from the inputs. Moreover, an additional spectral KL divergence loss is proposed to further adjust the spectral qualities. Finally, our proposed unsupervised model is trained by minimizing the following loss function: \begin{equation*} L = \alpha {L_{spatial}} + \beta {L_{spetral}} + \mu {L_{KL}}. \tag{11} \end{equation*}View Source where $\alpha$, $\beta$, and $\mu$ are the weights that are empirically set in our experiments. It can be seen that the proposed loss function can be used to train the proposed LDP-Net without the HRMS image (ground-truth) via two degradation processes that can learn the latent characteristics of the output HRMS image.

C. Network Architecture

As mentioned in Section III-A, there are several CNN-based blocks that are designed to implement our proposed network framework, including FEB, DEDB, GB, and RB. Specifically, FEB is used to extract the shallow features from the upsampled LRMS image and HR PAN image to contribute to the subsequent fusion step. Thus, given $\uparrow m$ or $\widetilde{P}$ as the inputs, the corresponding shallow features $F_{m}$ or $F_{p}$ can be obtained as \begin{equation*} F_{m} = {f_{FEB}} (\uparrow m) \tag{12} \end{equation*}View Source and \begin{equation*} F_{p} = {f_{FEB}}(\widetilde{P}) \tag{13} \end{equation*}View Source where $ {f_{FEB}}$ represents the operation of FEB. It must mention that both blocks have the same structure but different parameters that extract different features from the MS and PAN images, respectively. As shown in Fig. 2(a), three convolutional layers with several adjacent residual connections are adopted to extract features from different depths and one downsampling convolutional layer is used to reduce the size of features. As shown in Fig. 1, $R_{p}$ and $R_{m}$ denote the output of the residual connection from the PAN image and upsampled LRMS image, respectively. All convolutional layers are followed by PReLU activation function. The extracted features $F_{m}$ and $F_{p}$ are concatenated as the input of the subsequent DEDB.

Fig. 2.

Structure of (a) FEB, (b) DEDB, (c) GB, (d) RB, and (e) REC, where k3n128s1 denotes a convolution layer with a 3 × 3 kernel size, 128 channels, and stride 1.

Show All

The role of DEDB is to learn more high-level features and fuse sufficient spatial and spectral information. As shown in Fig. 2(b), we adopt four convolutional layers with dense connections to enhance the fusion and inference abilities. Then, the fused features are fed into a deconvolutional layer for upsampling before concatenation with the two residual connections. To reconstruct the output HRMS image, we use a reconstruction block (REC) that consists of two convolutional layers followed by a ReLU activation layer as demonstrated in Fig. 2(e).

GB and RB are vital parts of our proposed unsupervised model. Taking the output HRMS image or the upsampled LRMS image as the input, GB is implemented aided by the channel attention mechanism, as shown in Fig. 2(c). First, we adopt two convolutional layers to transform the input into weight features and use global average pooling (GAP) and fully connected layers to obtain the channel weight vector, which is used to simulate the graying process. Finally, we obtain the stacked output by copying it in the channel dimension. For RB, we implement this module by using a single convolution layer to simulate the spatial degradation as illustrated in Fig. 2(d). Additionally, these modules are jointly optimized to adaptively learn the degradation in the training phase.

SECTION IV.

Experiments and Evaluations

A. Experimental Setup

1) Datasets and Metrics

To evaluate the performance of the proposed method, we conduct experiments on three datasets: GaoFen-2 (GF-2), Worldview-2 (WV-2), and Worldview-3 (WV-3). The spatial resolutions of the MS and PAN images for GF-2 satellite are 3.2 m and 0.8 m, respectively, those for WV-2 satellite are 1.84 m and 0.46 m and those for WV-3 satellite are 1.2 m and 0.31 m. The satellite of GF-2 has four bands, while the satellites of later two have eight bands. We produced the training data following the Wald’s protocol [19], cropping the PAN and upsampled LRMS images into patch pairs of size 256 × 256 in the training phase. Furthermore, another pairs of size 512 × 512 were selected to implement test experiments of the reduced resolution and full resolution. The partitions of both datasets are listed in Table I.

TABLE I Partitions of Training and Testing Datasets of Three Satellites

The performance of different methods in the reduced-resolution and full-resolution experiments are evaluated by different quantitative metrics. In reduced-resolution testing, four widely used metrics with reference are involved, namely, the spectral angle mapper (SAM) [42], spatial correlation coefficient (SCC) [43], relative global synthesis errors (ERGAS) [44], and 4-band extension of the universal image quality index (Q4) [45], while the quality with no-reference (QNR) [46] and its spectral components ${D_\lambda }$ and spatial components $D_{S}$ are used in full-resolution testing.

2) Implementation Details

No postprocessing operations were applied on the output HRMS image. The network was trained with approximately 50 epochs. The Adam optimizer [47] was used to minimize the loss function, with an initial learning rate of 1e$-$4, and it was decayed by 0.1 every 10 epochs. The batch size was set to 16, the weight of loss $\alpha$ was set to 1, $\beta$ was set to 5, $\mu$ was set to 0.1, $\delta$ was set to 20, and $\gamma$ was set to 20. The network was implemented in PyTorch and trained on an Nvidia GeForce GTX 1080Ti GPU. The codes for this work can be downloaded.¹

3) Comparison Methods

In our experiments, we compared the proposed LDP-Net with several state-of-the-art methods, including PCA [4], IHS [5], Brovey [48], GS [49], BSBD [11], additive wavelet luminance proportional (AWLP) [50], PNN [14], DiCNN [32], PanNet [15], DMDNet [51], FusionNet [33], PGMAN [36], and Pan-GAN [20]. The first six methods belong to traditional method. PNN, DiCNN, PanNet, DMDNet, and FusionNet are supervised learning based methods. Pan-GAN and PGMAN are recently proposed unsupervised methods. For fair comparison, these methods were reimplemented with the PyTorch framework according to their publicly available codes and retrained using the same training datasets at the reduced resolution.

B. Comparison at Reduced Resolution

The experiment was performed on three datasets at reduced resolution, which follows the Wald’s protocol. The original MS image can be used as the reference. Figs. 3–5 show three examples cropped from the results of GF-2, WV-2, and WV-3 processed using different methods. In each case, one region that is marked by a red rectangle is magnified to visualize the differences of these results. In Figs. 3–5, it can be observed that the results of traditional methods can restore spatial details effectively but still exhibit some blurring effects and spectral distortions. For example, the results of BDSD suffer from severe spectral distortions and some blurring effects, while the results of AWLP reduce the blurring effect but introduce some spatial artifacts. Supervised learning based methods can improve the spectral performance of pansharpening results but still exist spatial blurring. For the unsupervised method, Pan-GAN successfully achieves unsupervised pansharpening but its results contain some spatial blurring and obvious spectral distortions, especially in WV-2 and WV-3 datasets. In Fig. 3(n), PGMAN recovers more spatial details in pansharpened results while still exists some spectral distortions. Moreover, GAN-based pansharpening methods are difficult to tune the hyperparameters and easily generate spatial and spectral artifacts. As shown in the magnified regions in Figs. 3(o) and 5(o), compared to other methods, it can be seen that our proposed LDP-Net effectively recovers spatial details and preserves spectral information without introducing artifacts and the fusion results are more vivid and much closer to the ground truth than other methods.

Fig. 3.

Pansharpened results from different methods on the GF-2 dataset at reduced resolution. (a) Upsampled LRMS. (b) PCA. (c) IHS. (d) Brovey. (e) GS. (f) BDSD. (g) AWLP. (h) PNN. (i) DiCNN1. (j) PanNet. (k) DMDNet. (l) FusionNet. (m) Pan-GAN. (n) PGMAN. (o) Ours. (p) Ground truth.

Show All

Fig. 4.

Pansharpened results from different methods on the WV-2 dataset at reduced resolution. (a) Upsampled LRMS. (b) PCA. (c) IHS. (d) Brovey. (e) GS. (f) BDSD. (g) AWLP. (h) PNN. (i) DiCNN1. (j) PanNet. (k) DMDNet. (l) FusionNet. (m) Pan-GAN. (n) PGMAN. (o) Ours. (p) Ground truth.

Show All

Fig. 5.

Pansharpened results from different methods on the WV3 dataset at reduced resolution. (a) Upsampled LRMS. (b) PCA. (c) IHS. (d) Brovey. (e) GS. (f) BDSD. (g) AWLP. (h) PNN. (i) DiCNN1. (j) PanNet. (k) DMDNet. (l) FusionNet. (m) Pan-GAN. (n) PGMAN. (o) Ours. (p) Ground truth.

Show All

Tables II–IV show the average values of the quantitative results of different methods on three datasets. The methods are classified into three groups, including traditional, supervised, and unsupervised, and the best result in each group are highlighted in bold. Compared with Pan-Gan and PGMAN, the proposed LDP-Net achieves better scores in most metrics. Among the CNN-based methods, the proposed method can approach the performance of supervised methods. In particular, our model achieves the SCC, ERGAS, and Q4 scores close to the supervised methods, which verifies that our proposed method can effectively fuse the spatial and spectral information without the reference.

TABLE II Quantitative Results on the GF-2 Dataset at Reduced Resolution

TABLE III Quantitative Results on the WV-2 Dataset at Reduced Resolution

TABLE IV Quantitative Results on the WV-3 Dataset at Reduced Resolution

C. Comparison at Full Resolution

In this section, all the methods were validated on real data. Figs. 6–8 illustrate the representative results of the real GF-2, WV-2, and WV-3 data. Moreover, to verify the robustness of the proposed LDP-Net, the models trained with reduced images were used for the full-resolution test, which means we do not need to train new models for the full-resolution datasets. In these cases, most traditional methods can significantly restore the spatial information compared with that in LRMS images but most still suffer from a certain degree of spectral shift. In contrast, AWLP reduces the spectral distortion in the results while introduce noticeable spatial artifacts. Compared with these traditional methods, CNN-based models can effectively maintain spectral consistency and improve the spatial resolution over different datasets. However, PanNet and DMDNet generate perceptible blurring effects and artifacts. DiCNN1 can restore the spatial details better with a high spectral resolution, but spectral distortions are still observed in parts of regions. As shown in Figs. 7(j) and 8(j), the light blue mark and the cyan buildings are not as vividly colored as those obtained by other methods. Compared with other supervised methods, FusionNet can further reduce the spatial blurring and spectral distortions. Pan-GAN, which achieves unsupervised learning using spatial and spectral discriminators, can improve the spatial and spectral resolution but still exist spatial blurring and introduce spectral distortions to the results in Figs. 7(n) and 8(n). Though PGMAN maintains the spectral consistency as the upsampled LRMS image, there are still noticeable distortions of spatial details in pansharpened results. It is obvious that in the magnified regions indicated by red boxes, our proposed method preserves better spatial details and maintains higher spectral consistency than other methods. Apparently, our pansharpened images are clearer and more vivid than all the other methods, as shown in Figs. 6(p) and 8(p).

Fig. 6.

Pansharpened results from different methods on the GF-2 dataset at full resolution. (a) Upsampled LRMS. (b) PAN. (c) PCA. (d) IHS. (e) Brovey. (f) GS. (g) BDSD. (h) AWLP. (i) PNN. (j) DiCNN1. (k) PanNet. (l) DMDNet. (m) FusionNet. (n) Pan-GAN. (o) PGMAN. (p) Ours.

Show All

Fig. 7.

Pansharpened results from different methods on the WV-2 dataset at full resolution. (a) Upsampled LRMS. (b) PAN. (c) PCA. (d) IHS. (e) Brovey. (f) GS. (g) BDSD. (h) AWLP. (i) PNN. (j) DiCNN1. (k) PanNet. (l) DMDNet. (m) FusionNet. (n) Pan-GAN. (o) PGMAN. (p) Ours.

Show All

Fig. 8.

Pansharpened results from different methods on the WV-3 dataset at full resolution. (a) Upsampled LRMS. (b) PAN. (c) PCA. (d) IHS. (e) Brovey. (f) GS. (g) BDSD. (h) AWLP. (i) PNN. (j) DiCNN1. (k) PanNet. (l) DMDNet. (m) FusionNet. (n) Pan-GAN. (o) PGMAN. (p) Ours.

Show All

Due to lack of ground truth, QNR, ${D_\lambda }$, and $D_{S}$ are employed as the quantitative metrics to evaluate the performance of the pansharpened results at full resolution. The quantitative results are shown in Table V. As shown in Table V, we notice that the quantitative results are not quite consistent with the results of the visual inspections. This paradox probably lies in that the nonreference assessment metrics are calculated using the LRMS images, PAN image, and pansharpened results to assess the spectral and spatial distortion. The results with blurring effects tend to achieve better values due to their similarity to the LRMS images, which has also been mentioned in [52]. For example, as shown in Figs. 7(j) and (l), and 8(j) and (l), the results of DMDNet with more obvious blurring effects have better QNR values than DiCNN1. Hence, nonreference metrics are not always suitable to assess the spectral and spatial distortions of pansharpened results, and it is more important to emphasize visual inspection for comparison at full resolution without ground truth.

TABLE V Quantitative Results at Full Resolution

D. Ablation Study of Loss Function

In this section, several experiments were conducted to investigate the impacts of each component in our loss function. Based on two learnable degradation processes, the loss function plays an important role in our unsupervised training process. The proposed loss function can be subdivided into five parts, namely, the spatial loss at high resolution ${L_{spatial\_{h}}}=\Vert {\widetilde{P} - {{\widehat{M}}_{gray}}} \Vert _{2}^{2}$, the spatial loss at low resolution ${L_{spatial\_{l}}}=\Vert {{{\widetilde{P}}_{blur}} - \uparrow {m_{gray}}} \Vert _{2}^{2}$, the spectral loss at high resolution ${L_{spectral\_{h}} = \Vert { \uparrow m - {{\widehat{M}}_{blur}}} \Vert _{_{2}}^{2}}$, the spectral loss at low resolution ${L_{spectral\_{l}}= \Vert {m - \downarrow \widehat{M}} \Vert _{2}^{2}}$ and the spectral KL divergence loss ${L_{KL}}$. ${L_{spatial\_{h}}}$ and ${L_{spectral\_{h}}}$ are used as the basic loss components for the unsupervised training. Table VI shows the quantitative results to validate the effectiveness of ${L_{spatial\_{l}}}$, ${L_{spectral\_{l}}}$ and the proposed spectral KL divergence loss. In addition, we display the visual results of different combinations of loss components in Fig. 9. It can be seen that the combination of only ${L_{spatial\_{h}}}$ and ${L_{spectral\_{h}}}$ cannot achieve satisfactory performance, which suffers from severe spectral distortions in pansharpened images. Low-resolution spatial loss can restore the spatial details but still suffer from spectral distortions, and low-resolution spectral loss can reduce the spectral distortions but produce some spectral artifacts, while the spectral KL divergence loss can obviously eliminate spectral artifacts with high spatial resolution but still remain spectral distortions. When all of the loss components are included, the pansharpened images have the best quantitative scores and achieve the best spatial and spectral consistency, fully utilizing the rich spatial information of HR PAN images and the relation between MS images and PAN images. These results verify the effectiveness of our proposed hybrid loss function in both qualitative and quantitative aspects.

TABLE VI Ablation Results With the Loss Functions on WV-3 Dataset

Fig. 9.

Pansharpened results from the ablation study of the loss functions. (a) Ground truth. (b) The combination I. (c) The combination II. (d) The combination III. (e) The combination IV. (f) The combination V. (g) The combination VI. (h) The combination VII. (i) The combination VIII.

Show All

E. Efficiency Study

In this section, the computational efficiencies of all comparison methods are evaluated. As mentioned in Section IV-A, all deep learning based methods were implemented in PyTorch and tested on an Nvidia GeForce GTX 1080Ti GPU, while all traditional methods were implemented in MATLAB R2019b framework on CPU. Table VII lists the computational times of different approaches and the parameters of different models. The cost times are evaluated by averaging the inference time in the testing set at the reduced resolution experiment. Compared with other methods, the number of the parameters of our model is small but the computational time of our method is at the middle level. The main reason is that our proposed network contains two additional degradation modules and a deeper network structure. Compared to GAN-based unsupervised pansharpening methods, we must mention that our model is easier for hyperparameter tuning in the training phase. Generally, in addition to ensuring the superiority of performance, our proposed unsupervised model makes a reasonable tradeoff between model performance and computational cost.

TABLE VII Efficiency Comparison With Different Methods When Processing Inputs of Size 256 × 256 × 4 on GF-2 Dataset

SECTION V.

Conclusion

In this article, we propose an unsupervised pansharpening method based on two learnable degradation processes. The method can adaptively learn the degraded processes with two corresponding CNN-based modules and successfully achieve unsupervised pansharpening. Moreover, we consider the degradation processes at different resolutions and present a novel hybrid loss that can effectively maintain spatial and spectral consistency. Thus, this unsupervised training strategy adequately improves the spatial details and reduces the spectral distortion in the results. Then, extensive experiments were performed on different-resolution images from three datasets, demonstrating the superiority of our proposed method over other state-of-the-art methods.

References is not available for this document.

MIT Libraries

MIT Libraries

LDP-Net: An Unsupervised Pansharpening Network Based on Learnable Degradation Processes

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction