Journals & Magazines >IEEE Access >Volume: 8

UPSNet: Unsupervised Pan-Sharpening Network With Registration Learning Between Panchromatic and Multi-Spectral Images

A visual comparison of pan-sharpening results from previous state-of-the-art methods and the proposed method (UPSNet).

Abstract:

Recent advances in deep learning have shown impressive performances for pan-sharpening. Pan-sharpening is the task of enhancing the spatial resolution of a multi-spectral...Show More

Metadata

Abstract:

Recent advances in deep learning have shown impressive performances for pan-sharpening. Pan-sharpening is the task of enhancing the spatial resolution of a multi-spectral (MS) image by exploiting the high-frequency information of its corresponding panchromatic (PAN) image. Many deep-learning-based pan-sharpening methods have been developed recently, surpassing the performances of traditional pan-sharpening approaches. However, most of them are trained in lower scales using misaligned PAN-MS training pairs, which has led to undesired artifacts and unsatisfying visual quality. In this paper, we propose an unsupervised learning framework with registration learning for pan-sharpening, called UPSNet. UPSNet can be effectively trained in the original scales, and implicitly learns the registration between PAN and MS images without any dedicatedly designed registration module involved. Additionally, we design two novel loss functions for training UPSNet: a guided-filter-based color loss between network outputs and aligned MS targets; and a dual-gradient detail loss between network outputs and PAN inputs. Extensive experimental results show that our UPSNet can generate pan-sharpened images with remarkable improvements in terms of visual quality and registration, compared to the state-of-the-art methods.

A visual comparison of pan-sharpening results from previous state-of-the-art methods and the proposed method (UPSNet).

Published in: IEEE Access ( Volume: 8)

Page(s): 201199 - 201217

Date of Publication: 04 November 2020

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2020.3035802

Funding Agency:

Citations are not available for this document.

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

With the advent of deep-learning, many deep-learning-based methods have been proposed to solve various image restoration problems, i.e., super-resolution [7], [18], [20], [22], [35], showing state-of-the-art performances in terms of reconstruction quality. Likewise, the growing usage of deep-learning for satellite imagery research can be observed recently. Satellite imageries contain various scenes around the world. The research areas for satellite imagery include prediction of forest growth, classification of crops, buildings and roads, environmental monitoring, and many other applications. To achieve high performance for solving such problems, it is essential to obtain high-quality, high-resolution satellite image datasets. However, due to the constraints of intrinsic satellite sensor resolutions and transmission bandwidths, most satellites acquire multi-spectral images with varying resolutions for the same geographical regions. In general, satellite images are comprised of pairs of low-resolution (LR) multi-spectral (MS) images of a larger ground sample distance (GSD) and high-resolution (HR) panchromatic (PAN) images of a smaller GSD. Pan-sharpening or pan-colorization is the task of generating pan-sharpened (PS) multi-spectral images which have the same spatial resolutions as the PAN images, by fusing the high-frequency details from the PAN images and the color information from the MS images. Fig. 1 shows an example pair of PAN, MS and PS results from various pan-sharpening approaches, including the proposed method.

FIGURE 1.

Pan-sharpening results using various methods and the proposed method.

Show All

Recently, several works on pan-sharpening have been proposed that incorporate learning models with convolutional neural networks (CNN) [4], [6], [12], [15], [19], [21], [28], [33], [38], [43], [44]. These methods are based on supervised learning (Fig. 2-a) that often requires a degradation model to prepare a training dataset of PAN-MS pairs. For this, the original PAN-MS pairs are degraded (down-scaled) to LR PAN-MS pairs which are then used as inputs to the networks, and the original MS images are used as pseudo ground truth for training. In doing so, the networks are trained to output down-scaled PS images of input MS scales in such a lower scale scenario. Therefore, when these networks are tested under the original scale scenario, they perform poorly where the networks yield the PS images of input PAN scales. To overcome the scale (resolution) mismatch between training and testing, we propose an effective unsupervised learning framework for pan-sharpening, where a ground truth is not required for training. This enables the network to be trained and tested on the same scales, resulting in better visual quality.

FIGURE 2.

Comparison between two different learning frameworks (a) Conventional supervised learning framework for pan-sharpening (b) Unsupervised learning framework.

Show All

Since the ground truth data are not available in pan-sharping, conventional supervised PS methods could not help but utilize the lower scale scenario. These methods optimize their PS outputs with mean absolute error (MAE) or mean squared error (MSE) loss using pseudo ground truth MS image. In our unsupervised PS (Fig. 2-b), where no ground truth image is required, we design two novel loss functions so that our UPSNet can effectively learn the high-frequency details from PAN inputs and color information from MS inputs in the original scale scenario without any pseudo ground truth: one is a dual-gradient detail loss between network outputs and PAN inputs; and the other is a guided-filter-based color loss between network outputs and our aligned MS targets.

One of the main difficulties of the pan-sharpening task is a misalignment between PAN and MS image pairs. PAN and MS images often have the misalignment of some pixel distances due to inherent limitations in satellite sensor arrays and acquisition time difference. A misaligned dataset used for training often entails undesired artifacts in pan-sharpened results such as double edge and color spread artifacts. To remedy this problem, we incorporate a preprocessing step only during the training where each MS image is registered to its corresponding PAN image in the sense of correlation maximization. The aligned MS images are not used as inputs to the network but are used as targets for the color loss. By doing so, our UPSNet can learn to implicitly match the high-frequency information from PAN inputs and color information from misaligned MS inputs during training, without any dedicatedly designed registration module. The trained UPSNet can then properly handle misaligned PAN-MS input pairs during testing. As shown in Fig. 1, the output image from UPSNet shows that structures and colors of the objects are better well-aligned compared to the other five methods. We can also observe that the produced pan-sharpened image from UPSNet has the most similar color compared with that of the input MS image while preserving the strong edges from the corresponding PAN image.

Furthermore, we found that a patch-based normalization can effectively deal with non-stationary PAN and MS input images of various pixel intensity distributions depending on geographical features, which often leads to color distortion in the pan-sharpened results. Similar to a batch normalization [13], this reduces the internal covariate shift and enables faster and more stable training of the network, which could possibly result in higher performance. Besides, applying local normalization helps maintain the color information of the MS input. This allows the network trained on the images acquired by a specific satellite to be well generalized for unseen images of other satellites. Our contributions can be summarized as follows:

We propose a novel unsupervised learning framework for pan-sharpening where our proposed UPSNet can achieve state-of-the-art performance for most metrics and shows significantly better visual quality when tested on the original scale.
Two novel loss functions for pan-sharpening are proposed, which effectively fuse the high-frequency details from PAN images and color information from MS images: a dual gradient detail loss and a guided-filter-based color loss. The dual gradient detail loss can appropriately handle different characteristics of PAN and MS image signals, so that UPSNet can effectively learn the details of PAN images. The guided-filter-based color loss allows UPSNet to effectively learn the color information from aligned and upscaled target MS images.
With a preprocessing step of correlation-based alignment between PAN and MS images only for training, UPSNet can be trained to implicitly handle the inherent misalignment between PAN and MS input images without the preprocessing step in testing.
We propose a simple yet very effective patch-based normalization technique that boosts up the generalization capability of our UPSNet for PAN-MS images of various satellites.

SECTION II.

Related Works

A. Traditional Pan-Sharpening Methods

Before the advent of deep-learning, pan-sharpening algorithms were based on component substitution, multiresolution analysis, and model learning. Component substitution methods [5], [9], [17], [34], [42] apply spectral transformations on an interpolated MS input, and its spatial channel is replaced with a modified PAN. Multiresolution analysis based methods [27], [36] fuse the high-frequency details of PAN images into up-sampled MS input images. To decompose such high-frequency components, wavelet or undecimated decomposition techniques are utilized. Then these decomposed components are incorporated into interpolated MS input images to form pan-sharpened images. These methods have relatively low computational complexity but tend to produce the resulting images with mismatched spectral information and artifacts because they do not consider local properties of MS and PAN images. Model learning-based methods [11], [29], [31] learn pan-sharpening models by using regularization terms. In these methods, pan-sharpening is defined as an ill-posed problem, where a certain model is optimized to generate an output image so that a similarity metric between the output and target pan-sharpened image is maximized. These methods tend to produce pan-sharpened images with better quality having well-preserved spectral information, but require high computational complexity compared to the previously mentioned methods.

B. Deep-Learning-Based Pan-Sharpening Methods

Recent pan-sharpening methods incorporate various types of CNN structures. Pan-sharpening CNN (PNN) [28] is known to be the first CNN-based pan-sharpening method, showing competitive performance compared to conventional methods. The PNN adopted a shallow 3-layered network structure from SRCNN [7], which is the first super-resolution method to use CNN. Inspired by the success of ResNet [10] in classification, Yang et al. [43] proposed PanNet that has adopted the ResNet structure as their backbone network, where residual connection enables the network to focus on preserving the high-frequency details. PanNet applies high-pass filtering to MS and PAN inputs, and their edge components are used as network inputs. This enables better network generalization, being robust for unseen satellite datasets.

By adopting the network architecture of the state-of-the-art SR network, EDSR [22], Lanaras et al. [19] proposed a deep network (DSen2) and a deeper network (VDSen2) for super-resolution of the Sentinel-2 satellite images. DSen2 and VDSen2 are not exactly pan-sharpening methods since they super-resolve the images in 9 lower-resolution bands using the images in 4 higher-resolution bands as guidance. PAN images are not included in the Sentinel-2 dataset. PanNet and DSen2 show top performance in various quantitative metrics, producing PS images with high visual quality. Zhang et al. proposed a bidirectional pyramid network [45] that processes the MS and PAN images in two separate branches, which allows the spatial detail features from the PAN branch to be fused into the spectral information features of the MS branch, finally generating the output pan-sharpened images. This type of feature fusion has improved the preservation of high-frequency spatial information from PAN images.

Recently, Choi et al. proposed an S3 [6] loss, which considers the correlation between PAN and MS images. The S3 loss is devised to be applied adaptively for the areas according to the correlation values between MS and PAN images, thus reducing the ghosting artifacts around moving objects such as cars on the roads. Although the aforementioned deep-learning-based methods have greatly enhanced the performances and visual qualities over the traditional methods, they still have some limitations that those methods were trained in lower scales in a supervised manner, resulting in suboptimal PS outputs.

Recently, a few attempts have been made to tackle the drawbacks that come from supervised learning with pseudo ground truth. Ma et al. [26] proposed an unsupervised scheme based on a generative adversarial network with spatial and spectral discriminators. PercepPan [46] adopted an auto-encoder architecture into their unsupervised PS network design, and utilized a perceptual loss to improve visual quality. Qu et al. incorporated a self-attention mechanism [32] that estimates spatially varying detail extraction and injection functions. Luo et al. also proposed an unsupervised pan-sharpening method [25] with an iterative fusion network. Although these unsupervised PS methods resolved the drawbacks of training in lower scales, none of them considered the inherent misalignment between MS and PAN inputs.

SECTION III.

Proposed Method

As aforementioned, the pan-sharpening (PS) is defined as a task to obtain high-quality PS images using high-resolution (HR) PAN images and their corresponding low-resolution (LR) MS images. The resulting PS images should have the high-frequency detail information of the PAN images and the color information of the MS images as similar as possible. To avoid the drawbacks that come from training PS networks using pseudo ground truth images, our UPSNet learns the pan-sharpening in the original scale scenario, as shown in Fig. 2(b). Another root cause of inferior visual quality of previous pan-sharpening methods is a misaligned PAN-MS input pair. To allow UPSNet to implicitly handle the misalignment between PAN and MS images, which we call “registration learning”, a data preparation step is introduced with a correlation-based alignment between PAN and MS images, which is only used during the training. To effectively train our UPSNet, we present two different types of loss functions, which allow the network to learn spatial information from PAN inputs and spectral information form MS inputs to produce high-quality PS images. Note that the training of UPSNet is done in the original scales of PAN and MS images, where the testing is also taken place. In order to handle diverse characteristics of PAN and MS images taken from different satellites with UPSNet, we propose a simple but very effective patch-based normalization technique to have a generalization capability for PAN-MS images from various satellites. More details for loss functions, registration method, and normalization will be thoroughly explained in the following subsections.

A. Formulations

In general, satellite imagery datasets include PAN images of higher resolution (smaller GSD), denoted as $\mathbf {P}_{0}$ , and the corresponding MS images of lower resolution (larger GSD), denoted as $\mathbf {M}_{1}$ . The subscript number denotes a level of resolution, where a smaller number indicates a higher resolution. Our final goal in pan-sharpening is to utilize $\mathbf {P}_{0}$ and $\mathbf {M}_{1}$ to generate a high-quality pan-sharpened image $\mathbf {S}_{0}$ which has the same resolution as $\mathbf {P}_{0}$ and similar spectral information of $\mathbf {M}_{1}$ . This requires a pan-sharpening model $\mathbf {g}$ which takes $\mathbf {P}_{0}$ and $\mathbf {M}_{1}$ as inputs and yields a pan-sharpened image $\mathbf {S}_{0}$ as an output. In the conventional CNN-based pan-sharpening based on supervised learning, their models are trained using $\mathbf {P}_{0}$ and $\mathbf {M}_{1}$ as targets and their down-scaled version $\mathbf {P}_{1}$ and $\mathbf {M}_{2}$ as inputs, where their training is done in a lower scale scenario.

B. Unsupervised Learning Framework for Pan-Sharpening

One of the main limitations of the previous CNN-based pan-sharpening methods is that the PAN-MS pairs are downscaled to enable supervised learning. These networks are only trained in the lower scale scenario, so they perform poorly when tested in the original scale scenario which is always a realistic case. Since the misalignment between MS and PAN images would be more severe in their original scales, the networks trained in such a lower scale scenario are not able to appropriately handle the PAN and MS input images with larger misalignment.

On the contrary, the proposed unsupervised learning framework can overcome this problem, as our network is trained and tested under the same original scale scenario. The conceptual difference between conventional methods and the proposed framework is depicted in Fig. 2.

Unlike the conventional methods in Fig. 2-(a) for pan-sharpening that are trained under a lower-scale scenario, UPSNet is trained and tested under the same original scale as depicted in Fig. 2-(b). For the training, unlike the lower-scale scenario, the original PAN images are used as targets for a detail loss, and the aligned MS images of the same scale as PAN images are used as targets for a color loss. By doing so, our UPSNet can be trained in the original scale scenario. Here, one of the main points is how to obtain the aligned MS images of the same scale as the PAN and PS images. This will be detailed in the following subsections.

C. Registration

The conventional pan-sharpening methods that were trained with L1 or L2 loss functions on the misaligned datasets tend to produce the PS images of inferior visual quality, including double edge and spread color artifacts. To remedy this, it is necessary to use aligned datasets for the training of pan-sharpening networks. For the alignment between PAN and MS images, we propose a novel correlation-based PAN-MS registration on the PAN scale, which is done off-line. The resulting MS images have the same size as PAN images and are aligned to the PAN images. It should be noted that the aligned MS images are used as targets in the color loss function during training, not as the input for the network. In doing so, UPSNet internally learns the registration for the misaligned PAN-MS input pairs. That is, the aligned MS image is not required during the test.

Fig. 3 shows the off-line alignment steps. For a given pair of an original PAN image $\mathbf {P}_{0}$ and a grayed MS image $\mathbf {M}_{1}^{g}$ , a PAN-sized aligned MS image $\widetilde {\mathbf {M}}_{0}$ is constructed via a correlation-based searching process. For each pixel location of the PAN image, an optimal multi-channel (e.g., RGB) pixel value in the MS image is selected and placed in the corresponding pixel location of the aligned MS image of the PAN scale. The optimal pixel is determined as the center pixel of a searching window that finds the highest correlation value between the PAN and gray MS images is found by searching the grayed MS image within a search region. When the searching window size for the gray MS image is ${M} \times {M}$ , the corresponding window size for the PAN image is set to dM $\times$ dM where ${d}$ is a dilation equivalent to the scale difference between PAN and MS images.

FIGURE 3.

Proposed PAN-MS registration method based on correlation maximization.

Show All

FIGURE 4.

Overall training process and loss functions for our UPSNet.

Show All

The details of the searching process are as follows: First, we obtain a grayed MS image (Fig. 3-(b)) where a searching window of size $27\times 27$ with dilation 1 is applied. The searching window slides in a pixel-wise manner of stride 1 within a pre-defined search region of size $7\times 7$ . The corresponding window size applied for the PAN image is of size $27\times 27$ with dilation 4 due to the 4 times resolution difference between PAN and MS images. The goal of this registration is to replace all pixels in PAN with the best matching MS pixels, so that we can get MS images that are well aligned to their corresponding PAN images. Therefore, for a current pixel location of the PAN image, we search for the best matching patch with the highest correlation value in a search region of the grayed MS image. The 49 correlation values are calculated by sliding the searching window of size $27\times 27$ with stride 1 in the $7\times 7$ pixel grid (search region) centered at the current pixel location. When the best match is found, the MS pixel corresponding to the center pixel of the searching window is placed in the corresponding pixel location of the aligned MS image of the PAN scale. The searching process is repeated for all pixel positions of the PAN images. The aligned MS images of the PAN scales will then be used as MS targets for the color loss during training.

The above correlation-maximization-based registration involves two hyper-parameters: the searching window size ( $27\times 27$ ) and the search range size ( $7\times 7$ ) that are set empirically through extensive experiments. The searching window size should be large enough to capture a sufficient amount of local structures for correlation calculation but at the expense of computational complexity. A too small-sized searching window will ignore neighboring pixel correlation, and a too large one may ignore some misaligned pixels in correlation matching because the amount of misalignment gets relatively small. For the search range, a larger search range would be beneficial in handling larger misalignment but also at the expense of computational complexity. The search range size of $7\times 7$ is large enough to handle the inherent misalignment between PAN and MS images for our experiments because it can cope with up to 3-pixel misalignment in MS scale that corresponds to a maximum 12-pixel misalignment in PAN scale.

It is worthwhile to mention some other alignment options to perform alignment in the MS scale. In this case, the computation of color loss can have two possible options for matching the scale (resolution), where PS images and MS images have different resolutions. The first option is to downscale the PS images to the MS scale by applying a degradation model, which causes the resulting trained PS networks to yield PS outputs with checkerboard artifacts. The second option is to upscale the aligned MS images (aligned in the MS scale) to have the same resolution as PS images. However, this causes a new misalignment due to the upscaling process, thus leading to the degraded quality of the PS output. The experimental results for these options are provided in Sec. IV-C2.

D. Loss Functions

Previous deep-learning-based methods in supervised learning have applied a degradation model to the input images $\mathbf {P}_{0}$ and $\mathbf {M}_{1}$ , which yields $\mathbf {P}_{1}$ and $\mathbf {M}_{2}$ . Then, the network output $\mathbf {S}_{1}$ is compared to the pseudo ground truth MS images $\mathbf {M}_{1}$ by using a simple L1 or L2 loss between them. On the other hand, to train the network in an unsupervised manner, we propose two different types of loss functions: First, a detail loss that enforces the network output to have similar details (high-frequency information) with PAN images $\mathbf {P}_{0}$ ; Secondly, a color loss that helps the network match the spectral information of the network output $\mathbf {S}_{0}$ and the aligned PAN-resolution MS image $\widetilde {\mathbf {M}}_{0}$ . More details for the proposed loss functions will be thoroughly explained in the following.

1) Detail Loss

We now define a detail loss that minimizes spatial distortions between network outputs $\mathbf {S}_{0}$ and PAN inputs $\mathbf {P}_{0}$ . We first obtain grayed PS outputs $\mathbf {S}_{0}^{g}$ . In general, a vanilla detail loss, which is a simplified version of the spatial loss [6], can be defined as $\begin{equation*} L_{d}=\sum ||d(\mathbf {S}_{0}^{g})-d(\mathbf {P}_{0})||^{1}_{1}\tag{1}\end{equation*}$ View Source where $d(\cdot)$ is a gradient extractor using horizontal and vertical difference (e.g. [1, −1]) operators.

One of the difficulties in pan-sharpening tasks is inherent differences in image signal characteristics between the PAN and MS images. PAN images generally cover a wide range of wavelengths by merging a broad spectrum of visible lights into a single-channel image. Therefore, luminance values in MS images considerably differ from the PAN images. For example, certain objects that appear bright in an MS image (e.g., water) can appear dark in a corresponding PAN image or vice-versa (e.g., trees, grass). When we consider three bands (R, G, B) in MS images separately, the luminance difference between each of the bands and PAN images would be even larger than comparing with the grayscale versions of the MS images.

This inherent luminance difference between PAN and MS images generates not only dissimilar luminance values but also opposite directions of intensity gradients between them, which hinders deep-learning networks from properly learning the task of pan-sharpening. To solve this, we propose a novel loss function, called a dual-gradient detail loss, which is specially designed to handle such opposite gradient directions. This loss is utilized to enforce the PS outputs to have similar edge details with PAN images, together with the vanilla detail loss. Our dual-gradient detail loss is defined as $\begin{align*} L_{dg}=&\sum \min (||d(\mathbf {P}_{0})|!-\!d(\mathbf {S}_{0}^{R})||^{1}_{1}, ||-d(\mathbf {P}_{0})-d(\mathbf {S}_{0}^{R})||^{1}_{1}) \\&+\,\min (||d(\mathbf {P}_{0})-d(\mathbf {S}_{0}^{G})||^{1}_{1}, ||-d(\mathbf {P}_{0})-d(\mathbf {S}_{0}^{G})||^{1}_{1}) \\&+\,\min (||d(\mathbf {P}_{0})-d(\mathbf {S}_{0}^{B})||^{1}_{1}, ||-d(\mathbf {P}_{0})\!-\!d(\mathbf {S}_{0}^{B})||^{1}_{1}),\tag{2}\end{align*}$ View Source where $-d(\mathbf {P}_{0})$ is the reversed gradient map of PAN input $d(\mathbf {P}_{0})$ , and $\mathbf {S}_{0}^{R}$ , $\mathbf {S}_{0}^{G}$ and $\mathbf {S}_{0}^{B}$ are R, G, B channels of the network output PS image respectively. The gradient map of the output PS image is compared to both the gradient map and the reverse gradient map of the PAN image. Then, the smaller gradient differences (in absolute value) are chosen to be included in the loss computation. The proposed dual-gradient detail loss enables the network to handle the opposite directions of gradients which frequently occur between PAN and each channel of an MS image. The loss then enforces the PS output to have similar edge details with PAN, while preserving the gradient directions as those of the color channels. This prevents the double edge artifact which happens due to the gradient direction mismatch, resulting in better visual quality.

2) Color Loss

In addition to the two detail loss functions, we propose a guided-filter-based color loss to impose color similarity between the MS input and the network PS output. Here we utilize previously aligned PAN-resolution MS images $\widetilde {\mathbf {M}}_{0}$ as color targets to avoid any artifact that comes from the misalignment between $\mathbf {P}_{0}$ and $\mathbf {M}_{1}$ . The previous deep-learning-based methods in supervised learning have used L1 or L2 loss between the network PS output $\mathbf {S}_{1}$ and the pseudo ground truth MS image $\mathbf {M}_{1}$ , under the assumption that those two have similar high-frequency details and colors.

However in our unsupervised learning setting (original scale scenario), there exists no ground truth, but the network output $\mathbf {S}_{0}$ is supposed to have high-frequency details learned from the PAN image $\mathbf {P}_{0}$ , where such high-frequency details are not present in the input MS image $\mathbf {M}_{1}$ . So as to ensure that the network produces the PS output $\mathbf {S}_{0}$ having similar colors as the aligned PAN-resolution MS image $\widetilde {\mathbf {M}}_{0}$ while not losing the high-frequency information, we first apply a guided filter to the network output $\mathbf {S}_{0}$ using the previously aligned MS target $\widetilde {\mathbf {M}}_{0}$ as guidance. Then the resulting guided-filtered PS output $GF(\mathbf {S}_{0})$ is compared with the aligned MS target $\widetilde {\mathbf {M}}_{0}$ using L1 loss. Without the guided-filtering step, this becomes a direct comparison between the network output $\mathbf {S}_{0}$ and the aligned MS image $\widetilde {\mathbf {M}}_{0}$ , which would result in a substantial loss of the high-frequency details that are learned from the PAN image $\mathbf {P}_{0}$ . Our guided-filter-based color loss is defined as $\begin{equation*} L_{c}=\sum ||GF(\mathbf {S}_{0})-b(\widetilde {\mathbf {M}}_{0})||^{1}_{1}\tag{3}\end{equation*}$ View Source where $GF(\mathbf {S}_{0})$ is a guided filtering operation on the network PS output $\mathbf {S}_{0}$ with guidance $\widetilde {\mathbf {M}}_{0}$ , and $b(\cdot)$ is a Gaussian blurring operation with the filter size of 3 with $\sigma = 2$ /3. The values are set empirically to apply a mild blur as strong blur often leads to a loss of detail information. The Gaussian blur is applied to reduce the pixel blocking artifact introduced during the alignment and upscaling operation described in Sec. III-C. The proposed guided-filter-based color loss enforces the PS output to have a similar color as that of the MS image, while avoiding the checkerboard artifacts that may come from a down-sampling operation and loss of high-frequency details due to a direct comparison between PS and MS images.

3) Total Loss

The total loss function to train the network is defined as a weighted sum of the aforementioned loss functions, which is given by $\begin{equation*} L_{total}=L_{d}+w_{dg}L_{dg}+w_{c}L_{c}\tag{4}\end{equation*}$ View Source where $w_{dg}$ and $w_{c}$ are empirically set to 1 and 2, respectively. Our total loss function is simple yet effective.

E. Normalization

Throughout the whole training and test processes, the inputs are normalized by the mean and standard deviation values at each pixel computed within a local patch around the pixel. We have conducted extensive experiments for various types of normalization, such as uniform normalization for all images using the dataset statistics, and global normalization by computing mean and standard deviation values for each image. But local normalization per patch has shown to be the most effective method.

As mentioned earlier, PAN and MS input images are non-stationary, having various pixel intensity distributions depending on geographical features. Also, pixel intensity distributions can be very different according to satellite sensor types. It is time-consuming and costly to train dedicated PS networks for different satellite datasets. Motivated by this, we propose a simple but effective patch-based normalization technique that allows the network trained on the images acquired by a specific satellite to be well generalized for unseen images of other satellites. Applying our normalization helps maintain the color information of the MS input.

Our proposed normalization downscales the PAN and aligned MS images to the MS scale, and computes the mean and variance values in a local window of size $9\times 9$ over downscaled images for complexity reduction. Then upscaled mean and variance maps are used to normalize the PAN and MS input images. Denormalization is applied to the network output to yield the final PS result images. The values that are used for the denormalization are the upscaled mean and variance map of the MS input that were used for normalization. Local window size should be large enough to capture the regional characteristics of geographical features, as the goal of local normalization is a generalization to unseen datasets. However, the computational complexity quadratically increases as the window size goes bigger. Through a set of experiments, we found that a local window of size $9\times 9$ shows good generalizability without being computationally too expensive. Our normalization technique can be easily adopted to any PS networks. This allows the network to maintain the low-frequency color information of the MS input, having a similar effect as the residual connection.

F. Network Architecture and Training Details

Our network, UPSNet, comprises of 28 residual blocks, each of which has one leaky ReLU (negative slope $=0.1$ ), one convolution layer, and one identity mapping. In total, our network has 30 convolution layers with about 1M filter parameters. To reduce the computational complexity, a single channel PAN input is de-shuffled and transformed into an MS-sized 16-channel input, which is an opposite operation of the sub-pixel convolution layer [35]. The de-shuffled PAN image is then concatenated with the 3-channel MS input. Therefore, the MS-sized 19-channel data is fed into the first convolution layer of UPSNet. The last convolution layer generates 48-channel feature maps, which is then converted to a PAN-sized 3-channel (if MS is RGB) residual output via a shuffle layer. Finally, a nearest-neighbor interpolated MS image is added to the residual output to generate the final pan-sharpened image. Fig. 5 illustrates the network structure of our UPSNet.

FIGURE 5.

Network architecture of our proposed UPSNet.

Show All

SECTION IV.

Evaluation

A. Experiment Setting

1) Datasets

We evaluate the performance of UPSNet using two different remote sensing image datasets that are captured with the WorldView-3 (WV3) and KOMPSAT-3A (K3A) satellite sensors. The WorldView-3 satellite provides 0.31m PAN resolution and 1.24m MS resolution. The KOMPSAT-3A satellite provides 0.55m PAN resolution and 2.2m MS resolution. Both sensors have a resolution ratio equal to 4 between PAN and MS images. Randomly cropped patch pairs of PAN-MS images were used for training of the networks, where various data augmentations were conducted on the fly. Each cropped MS image patches have a size of $32\times 32$ , while the corresponding PAN image patches have a size of $128\times 128$ . As mentioned earlier, the training of our UPSNet is done in the original scale scenario, while the other deep-learning-based PS methods under comparison are trained in a lower scale scenario according to their original settings in their papers. For testing, 100 PAN-MS image pairs that were unseen during training were randomly selected.

2) Training

We trained UPSNet using the ADAMW [23] optimization technique with the initial learning rate of 10⁻⁴ and the weight decay of 10⁻⁷. For training other deep-learning-based PS methods, we followed the training details provided in their original papers. We employed the uniform weight initialization technique in [14] for training. All the networks were implemented using TensorFlow [1], and were trained and tested on NVIDIA TITAN $^{\mathrm{ TM}}$ RTX GPU. Our network is trained for 10⁶ iterations, where the learning rate was lowered by a factor of 10 after $5\times 10^{5}$ iterations. The mini-batch size was set to 2. Training of UPSNet takes about 10 hours, and it takes 0.237 seconds for testing an image of size $648\times 648$ (PAN) on average.

B. Results and Discussions

1) Quantitative Comparison

a: PS Methods for Comparison

We compare our UPSNet with seven non-deep-learning PS methods including Brovey transform [9], affinity PS [37], guided-filtering-based PS [39], intensity-hue-saturation (IHS) PS [5], principal component analysis (PCA) PS [34], P+XS PS [3] and variational PS [8], and five deep-learning-based PS methods including PNN [28], PanNet [43] and DSen2 [19], and their variants trained with S3 loss [6], called PanNet-S3 and DSen2-S3, respectively. UPSNet trained without the registration learning (UPSNet w/o align) is also evaluated for comparison, which is trained with bicubic interpolated original MS image for guided-filter-based color loss instead of the aligned MS image.

b: Lower-Scale Validations

Due to the unavailability of ground-truth pan-sharpened images, we evaluate the performances of UPSNet and other PS methods under two different settings: lower-scale and full-scale (original-scale) validations. We use the full-reference metrics under the lower-scale validation following the Wald’s protocol [40]. For this, the downscaled versions of PAN and MS images are fed as input to all the methods under comparison, and the resulting output PS images of lower-scale are compared with their corresponding pseudo-ground-truth original MS images. Four different metrics are used for the lower-scale validations: (i) spatial correlation coefficient (SCC) [47]; (ii) erreur relative globale adimensionnelle de synthèse (ERGAS) [24]; (iii) Q index [41]; and (iv) peak signal-to-noise ratio (PSNR).

c: Full-Scale Validations

For the full-scale validation, SCC is also measured between original PAN inputs and grayscale versions of PS output images. The SCC values measured at full-scale indicate how much a pan-sharpening method can maintain the sharpness of the input PAN images in the PS output images. We also measure the quality-with-no-reference (QNR) [2] which is a no-reference metric for pan-sharpening, and another no-reference metric called a joint quality measure (JQM) [30] metric, which is known to better coincide with the perceived visual quality of PS output images than QNR.

d: Misalignment Issue Between Pan and Ms Images

In general, the PAN and MS images are misaligned due to inevitable acquisition time difference and mosaicked sensor arrays. However, none of the above seven metrics for lower- and full-scale validations considers the inherent misalignment between PAN and MS images. On one hand, UPSNet is designed to correct the inherent misalignment between them by aligning the color (MS) of an object with the objects’ details (PAN). So, it can produce output PS images that have very well aligned colors and shapes of objects. In this case, it is important to note that directly measuring the spectral distortion of the PS output with respect to the color of the original MS input is meaningless for the aligned PS output. This is because the colors of the PS output generated by UPSNet are moved (aligned) to match the shapes (details). Therefore, in addition to such conventional direct measures with respect to the original MS inputs, we also measure the distortions with respect to the aligned MS images created by the alignment method in Section III-C for fair and meaningful comparison.

e: Analysis for Experimental Results

Tables 1 and 2 show the average metric scores for 100 randomly chosen test image pairs from the WorldView-3 dataset measured with respect to the original MS input without alignment and with the aligned MS image, respectively. $\uparrow$ and $\downarrow$ indicate that the higher the better, and the lower the better performance, respectively for each metric. In Table 1, UPSNet (w/o align) outperforms all other methods for all lower-scale validations and JQM when measured with the original MS input. When measuring full-scale SCC metric for original scale validation, the SCC values in Table 1 are the same as those in Table 2. This is because MS images are not used in measuring the SCC metric, as mentioned earlier. As shown in Tables 1 and 2, UPSNet performs the best in terms of SCC. DSen2 shows the highest QNR value in Table 2, however it shows poor perceived visual quality, which will be later discussed in Sec. IV-B3. In Table 2, UPSNet outperforms all other PS methods in terms of all quality metrics except QNR, and UPSNet (w/o align) achieves the highest value of QNR.

TABLE 1 Quantitative Comparison (Measured With Original MS Input Without Alignment)

TABLE 2 Quantitative Comparison (Measured With Aligned MS Image Created by the Alignment Method in Section III-C)

2) Qualitative Comparison

Fig. 6 and 7 show visual comparisons for our UPSNet against the previous state-of-the-art methods. It is clearly shown in Fig. 6-(p) and 7-(p) that the PS output image from UPSNet well preserves the high-frequency details of PAN inputs and the color information as similar as possible with MS inputs, also having minimal distortions. The effectiveness of registration (alignment) learning by our UPSNet can be clearly seen around the pool area in Fig. 6-(h). Since the pool is located at a slightly up-right position in the MS image (Fig. 6-(b)) compared to the PAN image (Fig. 6-(a)), most of the previous SOTA PS methods show artifacts (color of the water is placed at slightly up-right position compared to the shape of the pool) due to this misalignment, but the output PS image of UPSNet shows no such artifacts. Also, UPSNet produces the most similar color with the original MS images, especially the color of the water in the pool (Fig. 6-(h)). The effectiveness of the registration learning is even more emphasized in Fig. 7-(h). As can be seen in Fig. 7-(a) and (b), the color of the orange roof in the MS image is placed slightly upward compared to the shape of that in the PAN image. UPSNet is the only method that is able to fuse the colors of the orange roof from the MS image with their appropriate shapes in the corresponding PAN image. More visual comparisons are provided in Figs. 13 and 14.

FIGURE 6.

Result images for pan-sharpening using various methods and our UPSNet.

Show All

FIGURE 7.

Result images for pan-sharpening using various methods and our UPSNet.

Show All

FIGURE 8.

Visual comparison of various pan-sharpening methods including their QNR and JQM values.

Show All

FIGURE 9.

Visual comparison of various pan-sharpening methods including their QNR and JQM values.

Show All

FIGURE 10.

Visual comparison for ablation study.

Show All

FIGURE 11.

Qualitative comparison between our proposed UPSNet and its variants trained with registration in MS scale for a cropped region of an image ‘AOI_2_Vegas_Roads_Test_public_img161.tif’ in WorldView-3 dataset.

Show All

FIGURE 12.

Qualitative comparison for ablation study on loss functions.

Show All

FIGURE 13.

Qualitative comparison between our proposed UPSNet and other SOTA methods for a cropped region of an image ‘AOI_3_Shanghai_Bldg_Test_public_img2434.tif’ in the WorldView-3 dataset.

Show All

FIGURE 14.

Qualitative comparison between our proposed UPSNet and other SOTA methods for a cropped region of an image ‘AOI_2_Khartoum_Bldg_Test_public_img1522.tif’ in the WorldView-3 dataset.

Show All

3) Considerations for No-Reference Metrics: QNR and JQM

In this paper, we have utilized two full-scale no-reference metrics, QNR and JQM. However, several previous works have pointed out the drawbacks and unexpected properties of QNR [16], [30], [40], especially when perfect alignment between the MS and PAN images is not assured. As known, PAN and MS images in the WorldView-3 dataset are not well-aligned, so it can be expected that the values of the QNR metric are not well agreed with the observed visual quality.

We have intensively investigated this discrepancy between QNR metric and subjective quality for PS output. Figs. 8 and 9 show visual comparison on PS outputs obtained by various pan-sharpening methods. As shown, it is important to note that, although the PS output images of PNN, PanNet and Dsen2 relatively exhibit higher QNR scores than those of PanNet-S3, DSen2-S3 and UPSNet, their perceived visual qualities are much worse, showing severe ghost artifacts in Fig. 8 and the misalignment between colors and shapes (details) in Fig. 9. It is also worthwhile to point out that the PS output of UPSNet in Fig. 9 shows the best visual quality but has the lowest QNR value.

To remedy this problem, we additionally adopted another metric (JQM) which is known to be better agreed with the perceived visual quality on PS images [32]. As shown in Figs. 8 and 9, it can be easily noticed that the values of the JQM metric are very well agreed with the perceived visual qualities of the PS output. As opposed to the QNR metric, PNN, PanNet and Dsen2 relatively exhibit lower JQM scores than those of PanNet-S3, DSen2-S3 and UPSNet in Figs. 8 and 9. In both figures, PS outputs from our UPSNet yield the highest JQM scores, coinciding with the perceived visual quality. The visual qualities of the PS outputs produced by DSen2-S3 and PanNet-S3 are ranked the second and the third in terms of JQM values, which are very reasonably ranked in agreement with the perceived visual qualities.

The discrepancy between QNR and perceived visual quality comes from the fact that QNR does not directly reflect the spectral and spatial distortions in its calculation form [2]. The spectral distortion term ( $D_{\lambda }$ ) of QNR indirectly obtains the spectral distortion index by taking the difference between inter-band similarity measures of the MS and PS images. Similarly, the spatial distortion term ( $D_{S}$ ) of QNR is measured indirectly by taking the difference between the two relations: (i) each channel of an MS image and its corresponding low-pass-filtered and downscaled PAN image; (ii) each channel of a PS output image and a PAN image. On the other hand, JQM [30] directly measures both the spectral distortion between MS and downscaled PS images, and the spatial distortion between PAN and fused PS images. The JQM was argued that it is better agreed with perceived visual quality than QNR [30]. Throughout our intensive experiments, we also have found that the JQM is better correlated with perceived visual quality for various PS output images, as shown in Figs. 8 and 9.

C. Ablation Studies

Ablation studies have been conducted in a few different settings to show the effectiveness of key aspects of our proposed UPSNet. Throughout the experiments, only one component has been changed, and others remained the same. Evaluation of different models has been conducted under full-scale, using original MS and PAN input as inputs for the network. We measure two different criteria for measuring the performance of output PS images: high-frequency detail similarity with PAN images (SCC) and color similarity with MS images (ERGAS). ERGAS is measured between aligned MS images and PS output images. We denote this as ERGAS-A.

1) Learning Framework

First, we provide ablation study results on learning framework including unsupervised learning, training in original scales, and alignment. Experiment conditions are as follows.

Condition 1 is for training on lower scales using our unsupervised framework and testing on original scales. Condition 2 is for training without alignment, using the bicubic interpolated original MS image as a target for the color loss. In Condition 3, we train UPSNet in a supervised manner, similarly to PanNet [43] and DSen2 [19], where each training pair of PAN and MS images is downscaled by a scale of 4, and the original MS input is used as a pseudo ground truth. The network for Condition 3 is regularized by L1 loss between output PS images and original MS inputs to have similar settings as PanNet [43] and DSen2 [19].

As shown in Table. 3, all conditions entail substantial performance drops in terms of all metrics. Fig. 10 shows the visual comparison for Conditions 1, 2, and 3. Due to the scale mismatch between training and testing, and the absence of alignment between MS and PAN images, it is clear that the results in Fig. 10-(b), (c), and (d) suffer from misaligned colors, especially on the areas pointed by the red arrows. As can be seen in Table. 3, UPSNet trained in a supervised manner has shown a substantial amount of performance drop, especially in terms of SCC. Fig. 10-(d) clearly shows that supervised training in lower scales causes inferior visual quality, also showing artifacts in the homogeneous region.

TABLE 3 Performance of UPSNet Under Different Settings of Learning at Original Scales, at Lower Scales, Without Alignment, and in an Supervised Manner

2) Registration Scale

In Sec. III-C, we have discussed other possible alignment options to perform the registration step in the MS scale. The aligned MS, the output of the registration step, is only used as a target for the proposed guided-filter-based color loss and has the same size as the PAN image, as explained in Sec. III-C. Then, the PS output images from UPSNet and their corresponding aligned MS images are compared by the guided-filter-based color loss function without any scale conversion. However, when the registration is performed in the MS scale, aligned MS images would have the same size as input MS images. Therefore there exists a scale mismatch between the PS images and the corresponding aligned MS images. In this case, the computation of color loss can have two possible options for matching the scale (resolution).

The first option is to downscale the PS images to the MS scale by applying a degradation model, and the second option is to upscale the aligned MS images (aligned in the MS scale) to have the same resolution as their corresponding PS images. Since both options require scale conversion, a new type of misalignment is introduced inevitably during the scale matching process.

Table 4 provides the quantitative experiment results for UPSNet and its variants trained under two options mentioned above. The values of ERGAS-A and SCC metrics are lowered in the two options. Figs. 11 shows the artifacts introduced by the scale conversion. UPSNet can effectively handle the misalignment between the PAN and MS images, especially on the moving cars, but variants of UPSNet that included scale conversion (Fig. 11-(c), (d)) failed because they could not properly learn to handle the misalignment. The overall experiment results show that registration in the PAN scale yields the best pan-sharpening performance in both quantitative and qualitative perspectives.

TABLE 4 Ablation Study on Registration Scale

3) Loss Functions

In this section, we discuss the effectiveness of the proposed loss functions. Two loss functions have been newly proposed to train our UPSNet: a guided-filter-based color loss ( $L_{c}$ ) between network outputs and our aligned MS targets; and a dual-gradient detail loss ( $L_{dg}$ ) between network outputs and PAN inputs.

Ablation studies have been conducted under two different conditions to show the effectiveness of the proposed loss functions. Condition 1 is training the network without the dual-gradient detail loss. Condition 2 is applying Gaussian blur kernel instead of the guided-filter used for the color loss. We denote the Gaussian blur kernel-based color loss as $L_{gb}$ . The parameters of a Gaussian blur kernel are adequately adjusted so that the PS images after applying the Gaussian blur kernel to have similar visual quality with the corresponding aligned MS images.

Table 5 shows the average metric scores of ERGAS-A and SCC. The performance drops are observed for both Condition 1 and Condition 2, showing that the proposed loss functions are essential for training our UPSNet. Fig. 12 shows the visual comparison regarding the ablation study on loss functions. Condition 1 seems to produce reasonable visual quality, but it tends to disturbingly enhance the local contrast, introducing some artifacts in the output PS images. Condition 2 introduce rainbow-like artifacts in all images. Both quantitative and qualitative experiments show the effectiveness of our proposed loss functions in training UPSNet.

TABLE 5 Ablation Study on Loss Functions

4) Cross-Dataset Experiment

Cross-dataset experiments have been conducted to show the generalization capability of our UPSNet. Each pan-sharpening network is trained and tested in four different settings using the datasets acquired from two different satellites, KOMPSAT-3A (K3A) and WorldView-3, as described in Tables. 6, 7, 8, and 9. The upward and downward arrows $\uparrow \downarrow$ indicate that higher and lower values imply better performance, respectively. The best and second-best results are highlighted in bold and underline, respectively. It can be seen that UPSNet is showing a good generalization capability while other methods show performance drop when tested on the dataset that is different from the training dataset.

TABLE 6 Evaluation of PS Networks (Train: WorldView-3, Test: WorldView-3)

TABLE 7 Evaluation of PS Networks (Train: K3A, Test: WorldView-3)

TABLE 8 Evaluation of PS Networks (Train: K3A, Test: K3A)

TABLE 9 Evaluation of PS Networks (Train: WorldView-3, Test: K3A)

SECTION V.

Conclusion

In this work, we propose an effective unsupervised learning framework with registration learning for pan-sharpening, called UPSNet. To resolve a misalignment between PAN and MS, we first propose a simple PAN-MS registration based on correlations to obtain an aligned MS target of PAN-resolution from each misaligned PAN-MS input pair. The aligned MS target is then used to enforce the network to learn how to handle the misalignment between PAN and MS images by giving it as a target for the color loss. It should be noted that the registration for training is no longer required in testing. Additionally, we designed two loss functions for the training of our network: a guided-filter-based color loss between the network’s PS outputs and our aligned MS targets; and a dual-gradient detail loss between the network’s PS outputs and PAN inputs. Intensive experimental results show that our UPSNet can generate pan-sharpened images with remarkable improvements in terms of color similarity and texture details compared to the state-of-the-art pan-sharpening methods.

References is not available for this document.

UPSNet: Unsupervised Pan-Sharpening Network With Registration Learning Between Panchromatic and Multi-Spectral Images

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Works

A. Traditional Pan-Sharpening Methods

B. Deep-Learning-Based Pan-Sharpening Methods

Proposed Method

A. Formulations

B. Unsupervised Learning Framework for Pan-Sharpening

C. Registration

D. Loss Functions

1) Detail Loss

2) Color Loss

3) Total Loss

E. Normalization

F. Network Architecture and Training Details

Evaluation

A. Experiment Setting

1) Datasets

2) Training

B. Results and Discussions

1) Quantitative Comparison

a: PS Methods for Comparison

b: Lower-Scale Validations

c: Full-Scale Validations

d: Misalignment Issue Between Pan and Ms Images

e: Analysis for Experimental Results

2) Qualitative Comparison

3) Considerations for No-Reference Metrics: QNR and JQM

C. Ablation Studies

1) Learning Framework

2) Registration Scale

3) Loss Functions

4) Cross-Dataset Experiment

Conclusion

Cites in Papers - IEEE (17) | Other Publishers (10)

Cites in Papers - IEEE (17)

Cites in Papers - Other Publishers (10)

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Cites in Papers - |