Introduction
With the advent of deep-learning, many deep-learning-based methods have been proposed to solve various image restoration problems, i.e., super-resolution [7], [18], [20], [22], [35], showing state-of-the-art performances in terms of reconstruction quality. Likewise, the growing usage of deep-learning for satellite imagery research can be observed recently. Satellite imageries contain various scenes around the world. The research areas for satellite imagery include prediction of forest growth, classification of crops, buildings and roads, environmental monitoring, and many other applications. To achieve high performance for solving such problems, it is essential to obtain high-quality, high-resolution satellite image datasets. However, due to the constraints of intrinsic satellite sensor resolutions and transmission bandwidths, most satellites acquire multi-spectral images with varying resolutions for the same geographical regions. In general, satellite images are comprised of pairs of low-resolution (LR) multi-spectral (MS) images of a larger ground sample distance (GSD) and high-resolution (HR) panchromatic (PAN) images of a smaller GSD. Pan-sharpening or pan-colorization is the task of generating pan-sharpened (PS) multi-spectral images which have the same spatial resolutions as the PAN images, by fusing the high-frequency details from the PAN images and the color information from the MS images. Fig. 1 shows an example pair of PAN, MS and PS results from various pan-sharpening approaches, including the proposed method.
Recently, several works on pan-sharpening have been proposed that incorporate learning models with convolutional neural networks (CNN) [4], [6], [12], [15], [19], [21], [28], [33], [38], [43], [44]. These methods are based on supervised learning (Fig. 2-a) that often requires a degradation model to prepare a training dataset of PAN-MS pairs. For this, the original PAN-MS pairs are degraded (down-scaled) to LR PAN-MS pairs which are then used as inputs to the networks, and the original MS images are used as pseudo ground truth for training. In doing so, the networks are trained to output down-scaled PS images of input MS scales in such a lower scale scenario. Therefore, when these networks are tested under the original scale scenario, they perform poorly where the networks yield the PS images of input PAN scales. To overcome the scale (resolution) mismatch between training and testing, we propose an effective unsupervised learning framework for pan-sharpening, where a ground truth is not required for training. This enables the network to be trained and tested on the same scales, resulting in better visual quality.
Comparison between two different learning frameworks (a) Conventional supervised learning framework for pan-sharpening (b) Unsupervised learning framework.
Since the ground truth data are not available in pan-sharping, conventional supervised PS methods could not help but utilize the lower scale scenario. These methods optimize their PS outputs with mean absolute error (MAE) or mean squared error (MSE) loss using pseudo ground truth MS image. In our unsupervised PS (Fig. 2-b), where no ground truth image is required, we design two novel loss functions so that our UPSNet can effectively learn the high-frequency details from PAN inputs and color information from MS inputs in the original scale scenario without any pseudo ground truth: one is a dual-gradient detail loss between network outputs and PAN inputs; and the other is a guided-filter-based color loss between network outputs and our aligned MS targets.
One of the main difficulties of the pan-sharpening task is a misalignment between PAN and MS image pairs. PAN and MS images often have the misalignment of some pixel distances due to inherent limitations in satellite sensor arrays and acquisition time difference. A misaligned dataset used for training often entails undesired artifacts in pan-sharpened results such as double edge and color spread artifacts. To remedy this problem, we incorporate a preprocessing step only during the training where each MS image is registered to its corresponding PAN image in the sense of correlation maximization. The aligned MS images are not used as inputs to the network but are used as targets for the color loss. By doing so, our UPSNet can learn to implicitly match the high-frequency information from PAN inputs and color information from misaligned MS inputs during training, without any dedicatedly designed registration module. The trained UPSNet can then properly handle misaligned PAN-MS input pairs during testing. As shown in Fig. 1, the output image from UPSNet shows that structures and colors of the objects are better well-aligned compared to the other five methods. We can also observe that the produced pan-sharpened image from UPSNet has the most similar color compared with that of the input MS image while preserving the strong edges from the corresponding PAN image.
Furthermore, we found that a patch-based normalization can effectively deal with non-stationary PAN and MS input images of various pixel intensity distributions depending on geographical features, which often leads to color distortion in the pan-sharpened results. Similar to a batch normalization [13], this reduces the internal covariate shift and enables faster and more stable training of the network, which could possibly result in higher performance. Besides, applying local normalization helps maintain the color information of the MS input. This allows the network trained on the images acquired by a specific satellite to be well generalized for unseen images of other satellites. Our contributions can be summarized as follows:
We propose a novel unsupervised learning framework for pan-sharpening where our proposed UPSNet can achieve state-of-the-art performance for most metrics and shows significantly better visual quality when tested on the original scale.
Two novel loss functions for pan-sharpening are proposed, which effectively fuse the high-frequency details from PAN images and color information from MS images: a dual gradient detail loss and a guided-filter-based color loss. The dual gradient detail loss can appropriately handle different characteristics of PAN and MS image signals, so that UPSNet can effectively learn the details of PAN images. The guided-filter-based color loss allows UPSNet to effectively learn the color information from aligned and upscaled target MS images.
With a preprocessing step of correlation-based alignment between PAN and MS images only for training, UPSNet can be trained to implicitly handle the inherent misalignment between PAN and MS input images without the preprocessing step in testing.
We propose a simple yet very effective patch-based normalization technique that boosts up the generalization capability of our UPSNet for PAN-MS images of various satellites.
Related Works
A. Traditional Pan-Sharpening Methods
Before the advent of deep-learning, pan-sharpening algorithms were based on component substitution, multiresolution analysis, and model learning. Component substitution methods [5], [9], [17], [34], [42] apply spectral transformations on an interpolated MS input, and its spatial channel is replaced with a modified PAN. Multiresolution analysis based methods [27], [36] fuse the high-frequency details of PAN images into up-sampled MS input images. To decompose such high-frequency components, wavelet or undecimated decomposition techniques are utilized. Then these decomposed components are incorporated into interpolated MS input images to form pan-sharpened images. These methods have relatively low computational complexity but tend to produce the resulting images with mismatched spectral information and artifacts because they do not consider local properties of MS and PAN images. Model learning-based methods [11], [29], [31] learn pan-sharpening models by using regularization terms. In these methods, pan-sharpening is defined as an ill-posed problem, where a certain model is optimized to generate an output image so that a similarity metric between the output and target pan-sharpened image is maximized. These methods tend to produce pan-sharpened images with better quality having well-preserved spectral information, but require high computational complexity compared to the previously mentioned methods.
B. Deep-Learning-Based Pan-Sharpening Methods
Recent pan-sharpening methods incorporate various types of CNN structures. Pan-sharpening CNN (PNN) [28] is known to be the first CNN-based pan-sharpening method, showing competitive performance compared to conventional methods. The PNN adopted a shallow 3-layered network structure from SRCNN [7], which is the first super-resolution method to use CNN. Inspired by the success of ResNet [10] in classification, Yang et al. [43] proposed PanNet that has adopted the ResNet structure as their backbone network, where residual connection enables the network to focus on preserving the high-frequency details. PanNet applies high-pass filtering to MS and PAN inputs, and their edge components are used as network inputs. This enables better network generalization, being robust for unseen satellite datasets.
By adopting the network architecture of the state-of-the-art SR network, EDSR [22], Lanaras et al. [19] proposed a deep network (DSen2) and a deeper network (VDSen2) for super-resolution of the Sentinel-2 satellite images. DSen2 and VDSen2 are not exactly pan-sharpening methods since they super-resolve the images in 9 lower-resolution bands using the images in 4 higher-resolution bands as guidance. PAN images are not included in the Sentinel-2 dataset. PanNet and DSen2 show top performance in various quantitative metrics, producing PS images with high visual quality. Zhang et al. proposed a bidirectional pyramid network [45] that processes the MS and PAN images in two separate branches, which allows the spatial detail features from the PAN branch to be fused into the spectral information features of the MS branch, finally generating the output pan-sharpened images. This type of feature fusion has improved the preservation of high-frequency spatial information from PAN images.
Recently, Choi et al. proposed an S3 [6] loss, which considers the correlation between PAN and MS images. The S3 loss is devised to be applied adaptively for the areas according to the correlation values between MS and PAN images, thus reducing the ghosting artifacts around moving objects such as cars on the roads. Although the aforementioned deep-learning-based methods have greatly enhanced the performances and visual qualities over the traditional methods, they still have some limitations that those methods were trained in lower scales in a supervised manner, resulting in suboptimal PS outputs.
Recently, a few attempts have been made to tackle the drawbacks that come from supervised learning with pseudo ground truth. Ma et al. [26] proposed an unsupervised scheme based on a generative adversarial network with spatial and spectral discriminators. PercepPan [46] adopted an auto-encoder architecture into their unsupervised PS network design, and utilized a perceptual loss to improve visual quality. Qu et al. incorporated a self-attention mechanism [32] that estimates spatially varying detail extraction and injection functions. Luo et al. also proposed an unsupervised pan-sharpening method [25] with an iterative fusion network. Although these unsupervised PS methods resolved the drawbacks of training in lower scales, none of them considered the inherent misalignment between MS and PAN inputs.
Proposed Method
As aforementioned, the pan-sharpening (PS) is defined as a task to obtain high-quality PS images using high-resolution (HR) PAN images and their corresponding low-resolution (LR) MS images. The resulting PS images should have the high-frequency detail information of the PAN images and the color information of the MS images as similar as possible. To avoid the drawbacks that come from training PS networks using pseudo ground truth images, our UPSNet learns the pan-sharpening in the original scale scenario, as shown in Fig. 2(b). Another root cause of inferior visual quality of previous pan-sharpening methods is a misaligned PAN-MS input pair. To allow UPSNet to implicitly handle the misalignment between PAN and MS images, which we call “registration learning”, a data preparation step is introduced with a correlation-based alignment between PAN and MS images, which is only used during the training. To effectively train our UPSNet, we present two different types of loss functions, which allow the network to learn spatial information from PAN inputs and spectral information form MS inputs to produce high-quality PS images. Note that the training of UPSNet is done in the original scales of PAN and MS images, where the testing is also taken place. In order to handle diverse characteristics of PAN and MS images taken from different satellites with UPSNet, we propose a simple but very effective patch-based normalization technique to have a generalization capability for PAN-MS images from various satellites. More details for loss functions, registration method, and normalization will be thoroughly explained in the following subsections.
A. Formulations
In general, satellite imagery datasets include PAN images of higher resolution (smaller GSD), denoted as
B. Unsupervised Learning Framework for Pan-Sharpening
One of the main limitations of the previous CNN-based pan-sharpening methods is that the PAN-MS pairs are downscaled to enable supervised learning. These networks are only trained in the lower scale scenario, so they perform poorly when tested in the original scale scenario which is always a realistic case. Since the misalignment between MS and PAN images would be more severe in their original scales, the networks trained in such a lower scale scenario are not able to appropriately handle the PAN and MS input images with larger misalignment.
On the contrary, the proposed unsupervised learning framework can overcome this problem, as our network is trained and tested under the same original scale scenario. The conceptual difference between conventional methods and the proposed framework is depicted in Fig. 2.
Unlike the conventional methods in Fig. 2-(a) for pan-sharpening that are trained under a lower-scale scenario, UPSNet is trained and tested under the same original scale as depicted in Fig. 2-(b). For the training, unlike the lower-scale scenario, the original PAN images are used as targets for a detail loss, and the aligned MS images of the same scale as PAN images are used as targets for a color loss. By doing so, our UPSNet can be trained in the original scale scenario. Here, one of the main points is how to obtain the aligned MS images of the same scale as the PAN and PS images. This will be detailed in the following subsections.
C. Registration
The conventional pan-sharpening methods that were trained with L1 or L2 loss functions on the misaligned datasets tend to produce the PS images of inferior visual quality, including double edge and spread color artifacts. To remedy this, it is necessary to use aligned datasets for the training of pan-sharpening networks. For the alignment between PAN and MS images, we propose a novel correlation-based PAN-MS registration on the PAN scale, which is done off-line. The resulting MS images have the same size as PAN images and are aligned to the PAN images. It should be noted that the aligned MS images are used as targets in the color loss function during training, not as the input for the network. In doing so, UPSNet internally learns the registration for the misaligned PAN-MS input pairs. That is, the aligned MS image is not required during the test.
Fig. 3 shows the off-line alignment steps. For a given pair of an original PAN image
The details of the searching process are as follows: First, we obtain a grayed MS image (Fig. 3-(b)) where a searching window of size
The above correlation-maximization-based registration involves two hyper-parameters: the searching window size (
It is worthwhile to mention some other alignment options to perform alignment in the MS scale. In this case, the computation of color loss can have two possible options for matching the scale (resolution), where PS images and MS images have different resolutions. The first option is to downscale the PS images to the MS scale by applying a degradation model, which causes the resulting trained PS networks to yield PS outputs with checkerboard artifacts. The second option is to upscale the aligned MS images (aligned in the MS scale) to have the same resolution as PS images. However, this causes a new misalignment due to the upscaling process, thus leading to the degraded quality of the PS output. The experimental results for these options are provided in Sec. IV-C2.
D. Loss Functions
Previous deep-learning-based methods in supervised learning have applied a degradation model to the input images
1) Detail Loss
We now define a detail loss that minimizes spatial distortions between network outputs \begin{equation*} L_{d}=\sum ||d(\mathbf {S}_{0}^{g})-d(\mathbf {P}_{0})||^{1}_{1}\tag{1}\end{equation*}
One of the difficulties in pan-sharpening tasks is inherent differences in image signal characteristics between the PAN and MS images. PAN images generally cover a wide range of wavelengths by merging a broad spectrum of visible lights into a single-channel image. Therefore, luminance values in MS images considerably differ from the PAN images. For example, certain objects that appear bright in an MS image (e.g., water) can appear dark in a corresponding PAN image or vice-versa (e.g., trees, grass). When we consider three bands (R, G, B) in MS images separately, the luminance difference between each of the bands and PAN images would be even larger than comparing with the grayscale versions of the MS images.
This inherent luminance difference between PAN and MS images generates not only dissimilar luminance values but also opposite directions of intensity gradients between them, which hinders deep-learning networks from properly learning the task of pan-sharpening. To solve this, we propose a novel loss function, called a dual-gradient detail loss, which is specially designed to handle such opposite gradient directions. This loss is utilized to enforce the PS outputs to have similar edge details with PAN images, together with the vanilla detail loss. Our dual-gradient detail loss is defined as \begin{align*} L_{dg}=&\sum \min (||d(\mathbf {P}_{0})|!-\!d(\mathbf {S}_{0}^{R})||^{1}_{1}, ||-d(\mathbf {P}_{0})-d(\mathbf {S}_{0}^{R})||^{1}_{1}) \\&+\,\min (||d(\mathbf {P}_{0})-d(\mathbf {S}_{0}^{G})||^{1}_{1}, ||-d(\mathbf {P}_{0})-d(\mathbf {S}_{0}^{G})||^{1}_{1}) \\&+\,\min (||d(\mathbf {P}_{0})-d(\mathbf {S}_{0}^{B})||^{1}_{1}, ||-d(\mathbf {P}_{0})\!-\!d(\mathbf {S}_{0}^{B})||^{1}_{1}),\tag{2}\end{align*}
2) Color Loss
In addition to the two detail loss functions, we propose a guided-filter-based color loss to impose color similarity between the MS input and the network PS output. Here we utilize previously aligned PAN-resolution MS images
However in our unsupervised learning setting (original scale scenario), there exists no ground truth, but the network output \begin{equation*} L_{c}=\sum ||GF(\mathbf {S}_{0})-b(\widetilde {\mathbf {M}}_{0})||^{1}_{1}\tag{3}\end{equation*}
3) Total Loss
The total loss function to train the network is defined as a weighted sum of the aforementioned loss functions, which is given by \begin{equation*} L_{total}=L_{d}+w_{dg}L_{dg}+w_{c}L_{c}\tag{4}\end{equation*}
E. Normalization
Throughout the whole training and test processes, the inputs are normalized by the mean and standard deviation values at each pixel computed within a local patch around the pixel. We have conducted extensive experiments for various types of normalization, such as uniform normalization for all images using the dataset statistics, and global normalization by computing mean and standard deviation values for each image. But local normalization per patch has shown to be the most effective method.
As mentioned earlier, PAN and MS input images are non-stationary, having various pixel intensity distributions depending on geographical features. Also, pixel intensity distributions can be very different according to satellite sensor types. It is time-consuming and costly to train dedicated PS networks for different satellite datasets. Motivated by this, we propose a simple but effective patch-based normalization technique that allows the network trained on the images acquired by a specific satellite to be well generalized for unseen images of other satellites. Applying our normalization helps maintain the color information of the MS input.
Our proposed normalization downscales the PAN and aligned MS images to the MS scale, and computes the mean and variance values in a local window of size
F. Network Architecture and Training Details
Our network, UPSNet, comprises of 28 residual blocks, each of which has one leaky ReLU (negative slope
Evaluation
A. Experiment Setting
1) Datasets
We evaluate the performance of UPSNet using two different remote sensing image datasets that are captured with the WorldView-3 (WV3) and KOMPSAT-3A (K3A) satellite sensors. The WorldView-3 satellite provides 0.31m PAN resolution and 1.24m MS resolution. The KOMPSAT-3A satellite provides 0.55m PAN resolution and 2.2m MS resolution. Both sensors have a resolution ratio equal to 4 between PAN and MS images. Randomly cropped patch pairs of PAN-MS images were used for training of the networks, where various data augmentations were conducted on the fly. Each cropped MS image patches have a size of
2) Training
We trained UPSNet using the ADAMW [23] optimization technique with the initial learning rate of 10−4 and the weight decay of 10−7. For training other deep-learning-based PS methods, we followed the training details provided in their original papers. We employed the uniform weight initialization technique in [14] for training. All the networks were implemented using TensorFlow [1], and were trained and tested on NVIDIA TITAN
B. Results and Discussions
1) Quantitative Comparison
a: PS Methods for Comparison
We compare our UPSNet with seven non-deep-learning PS methods including Brovey transform [9], affinity PS [37], guided-filtering-based PS [39], intensity-hue-saturation (IHS) PS [5], principal component analysis (PCA) PS [34], P+XS PS [3] and variational PS [8], and five deep-learning-based PS methods including PNN [28], PanNet [43] and DSen2 [19], and their variants trained with S3 loss [6], called PanNet-S3 and DSen2-S3, respectively. UPSNet trained without the registration learning (UPSNet w/o align) is also evaluated for comparison, which is trained with bicubic interpolated original MS image for guided-filter-based color loss instead of the aligned MS image.
b: Lower-Scale Validations
Due to the unavailability of ground-truth pan-sharpened images, we evaluate the performances of UPSNet and other PS methods under two different settings: lower-scale and full-scale (original-scale) validations. We use the full-reference metrics under the lower-scale validation following the Wald’s protocol [40]. For this, the downscaled versions of PAN and MS images are fed as input to all the methods under comparison, and the resulting output PS images of lower-scale are compared with their corresponding pseudo-ground-truth original MS images. Four different metrics are used for the lower-scale validations: (i) spatial correlation coefficient (SCC) [47]; (ii) erreur relative globale adimensionnelle de synthèse (ERGAS) [24]; (iii) Q index [41]; and (iv) peak signal-to-noise ratio (PSNR).
c: Full-Scale Validations
For the full-scale validation, SCC is also measured between original PAN inputs and grayscale versions of PS output images. The SCC values measured at full-scale indicate how much a pan-sharpening method can maintain the sharpness of the input PAN images in the PS output images. We also measure the quality-with-no-reference (QNR) [2] which is a no-reference metric for pan-sharpening, and another no-reference metric called a joint quality measure (JQM) [30] metric, which is known to better coincide with the perceived visual quality of PS output images than QNR.
d: Misalignment Issue Between Pan and Ms Images
In general, the PAN and MS images are misaligned due to inevitable acquisition time difference and mosaicked sensor arrays. However, none of the above seven metrics for lower- and full-scale validations considers the inherent misalignment between PAN and MS images. On one hand, UPSNet is designed to correct the inherent misalignment between them by aligning the color (MS) of an object with the objects’ details (PAN). So, it can produce output PS images that have very well aligned colors and shapes of objects. In this case, it is important to note that directly measuring the spectral distortion of the PS output with respect to the color of the original MS input is meaningless for the aligned PS output. This is because the colors of the PS output generated by UPSNet are moved (aligned) to match the shapes (details). Therefore, in addition to such conventional direct measures with respect to the original MS inputs, we also measure the distortions with respect to the aligned MS images created by the alignment method in Section III-C for fair and meaningful comparison.
e: Analysis for Experimental Results
Tables 1 and 2 show the average metric scores for 100 randomly chosen test image pairs from the WorldView-3 dataset measured with respect to the original MS input without alignment and with the aligned MS image, respectively.
2) Qualitative Comparison
Fig. 6 and 7 show visual comparisons for our UPSNet against the previous state-of-the-art methods. It is clearly shown in Fig. 6-(p) and 7-(p) that the PS output image from UPSNet well preserves the high-frequency details of PAN inputs and the color information as similar as possible with MS inputs, also having minimal distortions. The effectiveness of registration (alignment) learning by our UPSNet can be clearly seen around the pool area in Fig. 6-(h). Since the pool is located at a slightly up-right position in the MS image (Fig. 6-(b)) compared to the PAN image (Fig. 6-(a)), most of the previous SOTA PS methods show artifacts (color of the water is placed at slightly up-right position compared to the shape of the pool) due to this misalignment, but the output PS image of UPSNet shows no such artifacts. Also, UPSNet produces the most similar color with the original MS images, especially the color of the water in the pool (Fig. 6-(h)). The effectiveness of the registration learning is even more emphasized in Fig. 7-(h). As can be seen in Fig. 7-(a) and (b), the color of the orange roof in the MS image is placed slightly upward compared to the shape of that in the PAN image. UPSNet is the only method that is able to fuse the colors of the orange roof from the MS image with their appropriate shapes in the corresponding PAN image. More visual comparisons are provided in Figs. 13 and 14.
Visual comparison of various pan-sharpening methods including their QNR and JQM values.
Visual comparison of various pan-sharpening methods including their QNR and JQM values.
Qualitative comparison between our proposed UPSNet and its variants trained with registration in MS scale for a cropped region of an image ‘AOI_2_Vegas_Roads_Test_public_img161.tif’ in WorldView-3 dataset.
Qualitative comparison between our proposed UPSNet and other SOTA methods for a cropped region of an image ‘AOI_3_Shanghai_Bldg_Test_public_img2434.tif’ in the WorldView-3 dataset.
Qualitative comparison between our proposed UPSNet and other SOTA methods for a cropped region of an image ‘AOI_2_Khartoum_Bldg_Test_public_img1522.tif’ in the WorldView-3 dataset.
3) Considerations for No-Reference Metrics: QNR and JQM
In this paper, we have utilized two full-scale no-reference metrics, QNR and JQM. However, several previous works have pointed out the drawbacks and unexpected properties of QNR [16], [30], [40], especially when perfect alignment between the MS and PAN images is not assured. As known, PAN and MS images in the WorldView-3 dataset are not well-aligned, so it can be expected that the values of the QNR metric are not well agreed with the observed visual quality.
We have intensively investigated this discrepancy between QNR metric and subjective quality for PS output. Figs. 8 and 9 show visual comparison on PS outputs obtained by various pan-sharpening methods. As shown, it is important to note that, although the PS output images of PNN, PanNet and Dsen2 relatively exhibit higher QNR scores than those of PanNet-S3, DSen2-S3 and UPSNet, their perceived visual qualities are much worse, showing severe ghost artifacts in Fig. 8 and the misalignment between colors and shapes (details) in Fig. 9. It is also worthwhile to point out that the PS output of UPSNet in Fig. 9 shows the best visual quality but has the lowest QNR value.
To remedy this problem, we additionally adopted another metric (JQM) which is known to be better agreed with the perceived visual quality on PS images [32]. As shown in Figs. 8 and 9, it can be easily noticed that the values of the JQM metric are very well agreed with the perceived visual qualities of the PS output. As opposed to the QNR metric, PNN, PanNet and Dsen2 relatively exhibit lower JQM scores than those of PanNet-S3, DSen2-S3 and UPSNet in Figs. 8 and 9. In both figures, PS outputs from our UPSNet yield the highest JQM scores, coinciding with the perceived visual quality. The visual qualities of the PS outputs produced by DSen2-S3 and PanNet-S3 are ranked the second and the third in terms of JQM values, which are very reasonably ranked in agreement with the perceived visual qualities.
The discrepancy between QNR and perceived visual quality comes from the fact that QNR does not directly reflect the spectral and spatial distortions in its calculation form [2]. The spectral distortion term (
C. Ablation Studies
Ablation studies have been conducted in a few different settings to show the effectiveness of key aspects of our proposed UPSNet. Throughout the experiments, only one component has been changed, and others remained the same. Evaluation of different models has been conducted under full-scale, using original MS and PAN input as inputs for the network. We measure two different criteria for measuring the performance of output PS images: high-frequency detail similarity with PAN images (SCC) and color similarity with MS images (ERGAS). ERGAS is measured between aligned MS images and PS output images. We denote this as ERGAS-A.
1) Learning Framework
First, we provide ablation study results on learning framework including unsupervised learning, training in original scales, and alignment. Experiment conditions are as follows.
Condition 1 is for training on lower scales using our unsupervised framework and testing on original scales. Condition 2 is for training without alignment, using the bicubic interpolated original MS image as a target for the color loss. In Condition 3, we train UPSNet in a supervised manner, similarly to PanNet [43] and DSen2 [19], where each training pair of PAN and MS images is downscaled by a scale of 4, and the original MS input is used as a pseudo ground truth. The network for Condition 3 is regularized by L1 loss between output PS images and original MS inputs to have similar settings as PanNet [43] and DSen2 [19].
As shown in Table. 3, all conditions entail substantial performance drops in terms of all metrics. Fig. 10 shows the visual comparison for Conditions 1, 2, and 3. Due to the scale mismatch between training and testing, and the absence of alignment between MS and PAN images, it is clear that the results in Fig. 10-(b), (c), and (d) suffer from misaligned colors, especially on the areas pointed by the red arrows. As can be seen in Table. 3, UPSNet trained in a supervised manner has shown a substantial amount of performance drop, especially in terms of SCC. Fig. 10-(d) clearly shows that supervised training in lower scales causes inferior visual quality, also showing artifacts in the homogeneous region.
2) Registration Scale
In Sec. III-C, we have discussed other possible alignment options to perform the registration step in the MS scale. The aligned MS, the output of the registration step, is only used as a target for the proposed guided-filter-based color loss and has the same size as the PAN image, as explained in Sec. III-C. Then, the PS output images from UPSNet and their corresponding aligned MS images are compared by the guided-filter-based color loss function without any scale conversion. However, when the registration is performed in the MS scale, aligned MS images would have the same size as input MS images. Therefore there exists a scale mismatch between the PS images and the corresponding aligned MS images. In this case, the computation of color loss can have two possible options for matching the scale (resolution).
The first option is to downscale the PS images to the MS scale by applying a degradation model, and the second option is to upscale the aligned MS images (aligned in the MS scale) to have the same resolution as their corresponding PS images. Since both options require scale conversion, a new type of misalignment is introduced inevitably during the scale matching process.
Table 4 provides the quantitative experiment results for UPSNet and its variants trained under two options mentioned above. The values of ERGAS-A and SCC metrics are lowered in the two options. Figs. 11 shows the artifacts introduced by the scale conversion. UPSNet can effectively handle the misalignment between the PAN and MS images, especially on the moving cars, but variants of UPSNet that included scale conversion (Fig. 11-(c), (d)) failed because they could not properly learn to handle the misalignment. The overall experiment results show that registration in the PAN scale yields the best pan-sharpening performance in both quantitative and qualitative perspectives.
3) Loss Functions
In this section, we discuss the effectiveness of the proposed loss functions. Two loss functions have been newly proposed to train our UPSNet: a guided-filter-based color loss (
Ablation studies have been conducted under two different conditions to show the effectiveness of the proposed loss functions. Condition 1 is training the network without the dual-gradient detail loss. Condition 2 is applying Gaussian blur kernel instead of the guided-filter used for the color loss. We denote the Gaussian blur kernel-based color loss as
Table 5 shows the average metric scores of ERGAS-A and SCC. The performance drops are observed for both Condition 1 and Condition 2, showing that the proposed loss functions are essential for training our UPSNet. Fig. 12 shows the visual comparison regarding the ablation study on loss functions. Condition 1 seems to produce reasonable visual quality, but it tends to disturbingly enhance the local contrast, introducing some artifacts in the output PS images. Condition 2 introduce rainbow-like artifacts in all images. Both quantitative and qualitative experiments show the effectiveness of our proposed loss functions in training UPSNet.
4) Cross-Dataset Experiment
Cross-dataset experiments have been conducted to show the generalization capability of our UPSNet. Each pan-sharpening network is trained and tested in four different settings using the datasets acquired from two different satellites, KOMPSAT-3A (K3A) and WorldView-3, as described in Tables. 6, 7, 8, and 9. The upward and downward arrows
Conclusion
In this work, we propose an effective unsupervised learning framework with registration learning for pan-sharpening, called UPSNet. To resolve a misalignment between PAN and MS, we first propose a simple PAN-MS registration based on correlations to obtain an aligned MS target of PAN-resolution from each misaligned PAN-MS input pair. The aligned MS target is then used to enforce the network to learn how to handle the misalignment between PAN and MS images by giving it as a target for the color loss. It should be noted that the registration for training is no longer required in testing. Additionally, we designed two loss functions for the training of our network: a guided-filter-based color loss between the network’s PS outputs and our aligned MS targets; and a dual-gradient detail loss between the network’s PS outputs and PAN inputs. Intensive experimental results show that our UPSNet can generate pan-sharpened images with remarkable improvements in terms of color similarity and texture details compared to the state-of-the-art pan-sharpening methods.