Introduction
Nowadays, numerous remote sensing images are obtained to monitor the conditions of agriculture, forestry, ocean, land, environmental protection, and meteorology [1]. Usually, most earth observation satellites can provide two kinds of images, namely, panchromatic (PAN) images with a high spatial resolution band and multispectral (MS) images with higher spectral resolution but lower spatial resolution, which are limited to the image signal-to-noise ratio (SNR) and data storage and transmission. Naturally, the technique for PAN and MS image fusion has been proposed and developed. This technology, which is known as pansharpening, integrates the complementary advantages of spatial and spectral information respectively from PAN and MS images to obtain high spatial resolution MS images. Fused images with both high spectral and spatial resolution can achieve better results in subsequent tasks, such as image classification and object detection [2].
In early research, many traditional methods were proposed to develop pansharpening algorithms, and most of them can be generally summarized into three categories.
1) Methods based on component substitution (CS) [3] attempt to transform MS images and PAN images into a new space in which the structural component of MS images can be substituted by PAN images to achieve spatial information injection. Representative attempts include principal component analysis (PCA) [4], intensity-hue-saturation (IHS) [5], and Gram–Schmidt adaptive (GSA) transform [6].
Multiresolution-analysis-based methods utilize the high frequencies of PAN images to restore the spatial details in MS images. To extract this high-frequency information in PAN images, various transform algorithms are applied, such as Laplacian pyramid transform [7], discrete wavelet transform (DWT) [8], and support value transform [9].
Model-based methods [10] treat pansharpening as an inverse process of the degradation in which the ideal high-resolution multispectral (HRMS) image degenerates to a PAN image and low-resolution multispectral (LRMS) image. One typical example is the band-dependent spatial detail (BDSD) method [11].
In the past decade, deep learning approaches, especially convolutional neural networks (CNNs), have achieved excellent performance in various fields, including computer vision and image processing tasks [13]. Some pioneering methods have applied CNNs to the pansharpening task. Typical examples include PNN [14], PanNet [15], PSGAN [16], RED-cGan [17], and TFNet [18]. These supervised learning methods use an end-to-end network to learn the pansharpening process and achieve desirable performance with high spatial resolution and few spectral distortions. However, two vital problems still exist in most CNN-based methods. The first issue is that most networks are based on supervised learning and the training data are generated following Wald’s protocol [19]. These models perform spatial downsampling and blurring operations on the MS images to obtain the LRMS images and treat the original MS images as ground truth. These operations may not be consistent with the degradation processes in the real situation. The other issue is that these schemes do not effectively utilize the rich spatial information of PAN images [20] and ignore the relation between MS images and PAN images.
To address these problems, we propose a novel unsupervised network for pansharpening based on a two-stream CNN-based architecture with two learnable degradation processes, dubbed as LDP-Net. Pansharpening can be regarded as a superresolution or deblurring problem [21] with additional PAN images and aims to restore the spatial details from PAN images and simultaneously maintain the spectral information of LRMS images. Owing to the lack of ground truth, the inverse process of pansharpening can be divided into two degradation processes: one process uses a spectral response function to transform the HRMS image into a single grayed image similar to the PAN image, and the other process models a spatial blurring operation from the HRMS image into an upsampled LRMS image with a blurring kernel. In the proposed LDP-Net, we adopt two CNN modules to learn two degradation processes. Moreover, according to the relation between MS and PAN images, we propose a new loss function to effectively constrain both spatial and spectral information. Furthermore, a KL divergence loss function is proposed to maintain the spectral distribution of the difference between the MS and PAN images at two resolutions, which has never been explored. As a result, our proposed model achieves desirable performance in which the predicted HRMS image can preserve the high spatial resolution of the PAN image and rich spectral information of the LRMS image under unsupervised conditions. The main contributions of this article are summarized as follows.
An unsupervised pansharpening model is proposed based on a two-stream end-to-end network, which is trained without relying on supervised labels. The hyperparameters of the model can be easily tuned in training phase.
Different from other models with specified degradation operators, our proposed model learns the degradation processes in a data-driven manner.
A novel hybrid loss function, which consists of three parts, is proposed. The first two parts maintain the spatial and spectral consistency between the inputs and the predicted HRMS image in two different resolutions. The other part constrains the difference between the MS and PAN images at different resolutions to have similar distributions.
Extensive experiments on different remote sensing datasets demonstrate the effectiveness and robustness of our method over several state-of-the-art methods in both qualitative and quantitative aspects.
The rest of this article is organized as follows. In Section II, we review related works on pansharpening. Section III introduces the framework of the proposed unsupervised model and the loss function for training without labels. In Section IV, extensive experiments were conducted to illustrate our pansharpening method compared with several representative traditional, supervised, and unsupervised learning based approaches. Finally, Section V concludes this article.
Related Works
Numerous pansharpening methods have emerged in recent decades, and this section briefly reviews these methods, including classic approaches, supervised learning based approaches and unsupervised learning based approaches.
A. Classic Methods
Traditional pansharpening methods can be roughly classified into three categories. First, early pansharpening studies focused on CS. Some components of the upsampled LRMS images are substituted by corresponding components of PAN images in a specific transform domain. The spectral information and spatial information are separated using a simple and fast transformation, such as IHS [5], principal components transform [22], and GSA transform [6]. Moreover, Dou et al. [23] proposed a general framework to implement these CS-based methods systematically. These methods can effectively achieve high spatial resolution but may cause spectral distortions in the pansharpened results. The second category is multiresolution-analysis-based methods, which apply multiscale decomposition techniques to inject high-frequency information of the PAN image into the upsampled LRMS image. High-frequency spatial information is usually extracted by several transform algorithms, such as wavelet transform [24], Laplacian pyramid transform [7], curvelet transform [25], and contourlet transform [4]. Although these methods can achieve improved performance in spectral fidelity, they may also cause aliasing distortion and blurring effects in spatial details. The third type is model-based methods. For instance, Garzelli et al. [11] presented two linear injection models, including the single spatial detail (SSD) model and the BDSD model and optimized the models by minimizing the squared error between the original MS image and pansharpened results. Another pansharpening model proposed by Wright achieved fast image fusion with a Markov random field [26]. In addition, Guo et al. [27] adopted an online coupled dictionary learning approach to model the relation between LRMS and PAN images to reduce the spectral distortion and restore the spatial details. Recently, Guo et al. [28] developed a new posterior probability model based on the Bayesian theory to achieve better spectral and spatial fusion.
B. Supervised Learning Based Approaches
These deep learning methods specifically design a CNN-based network driven by large quantities of paired training data and achieve better performance than traditional methods. Motivated by the superresolution convolutional neural network (SRCNN) model [29], Giuseppe et al. [14] first proposed a three-layer CNN-based network named PNN according to the characteristics of remote sensing images. Later, Yang et al. [15] directly added the upsampled LRMS image to the output of the network to maintain spectral consistency and treated the edges of PAN and LRMS images as the inputs of the network to restore the spatial details. However, introducing only high-frequency information and superimposing the upsampled LRMS image on the results can cause a blurring effect and lead the training difficult to converge. Scarpa et al. [30] adopted a target-adaptive usage modality to ensure that a lightweight network can be applied to different remote sensing sensors. In the deep residual pansharpening neural network (DRPNN) model [31], the concept of residual learning is introduced to form a very deep convolutional neural network, which can further improve the pansharpening performance. He et al. [32] introduced a new detail injection strategy into the CNN-based pansharpening methods. Subsequently, Deng et al. [33] further exploited a new detail injection-based network aided by the difference between the PAN image and the upsampled LRMS image. Recently, Liu et al. [18] also incorporated residual learning into a two-stream CNN architecture to fuse the features extracted from both MS and PAN images. Zhang et al. [34] designed a triple-double network with a level-domain-based loss function to fully exploit the spatial details of the PAN image. Jin et al. [35] utilized Laplacian pyramid network to recover the crucial spatial information at multiscales. Moreover, several generative adversarial network (GAN)-based methods have been proposed to utilize a discriminator to distinguish the generated images from the ground-truth images. In PSGAN [16], the authors first attempted to produce high-quality pansharpened images with GANs and design a two-stream fusion architecture as the generator and a fully convolutional network as the discriminator. In RED-cGAN [17], a residual encoder–decoder conditional GAN was proposed to produce more details with sharpened images. However, as we mentioned above, these methods require HRMS images for supervised learning and still suffer from spectral distortions or blurring effects.
C. Unsupervised Learning Based Approaches
To address the unreality of simulated data and bridge the gap between classic and supervised learning based approaches, some unsupervised learning based approaches have been developed. Ma et al. [20] achieved unsupervised pansharpening using one generator and two discriminators that were designed to distinguish the spatial and spectral characteristics between generated and real images, respectively. Then, Zhou et al. [36] combined a generative multiadversarial network and nonreference loss function to improve the performance of unsupervised pansharpening. Motivated by some priors about downsampling and blurring, several methods have been developed for unsupervised pansharpening. For instance, a deep learning prior based on spatial downsampling with blurring has been applied for image fusion to obtain the loss function in [37]. The authors embedded the semantic features extracted from the guidance PAN image by an encoder–decoder network into another deep decoder to generate an output image. Similarly, Luo et al. [38] designed an iterative network architecture with a PAN-guided strategy and a set of skip connections to continuously extract and fuse the features from the input and then used a fixed unidimensional Gaussian kernel to obtain a blurred version from the fused HRMS image. However, these prior-based methods are limited to handcrafted training data and cannot be effectively applied to real scenes.
In this article, we propose an unsupervised learning model based on a two-stream CNN network incorporated with two learnable degradation modules that can be adaptive to complex simulated and real situations. Moreover, we specifically design a hybrid spectral loss to effectively maintain spectral consistency between the output and input LRMS images.
Method
A. Problem Formulation and Framework
Unsupervised pansharpening aims to obtain the pansharpened HRMS image by fusing the LRMS image and the HR PAN image without the ground-truth. We denote the LRMS image by
Our proposed LDP-Net is based on a two-stream encoder–decoder fusion network. As shown in Fig. 1, the network mainly consists of several different modules, including feature extraction block (FEB), dense encoder–decoder block (DEDB), reconstruction block (REC), graying block (GB), and reblurring block (RB). First, we interpolate the LRMS image
\begin{equation*}
\widehat{M} = f\left(\uparrow m,\widetilde{P};\Theta \right) \tag{1}
\end{equation*}
Overview of the proposed LDP-Net for pansharpening. FEB denotes the feature extraction block. DEDB denotes the dense encoder–decoder block. RB and GB represent the reblurring block and graying block, respectively. REC stands for the reconstruction block.
Since we do not have the HRMS image as labels, to achieve unsupervised learning, two degradation processes, namely, the degradation between the ideal HRMS image
\begin{equation*}
P = \sum \limits _{i = 1}^{C} {{\alpha _{i}}{M_{i}}} \tag{2}
\end{equation*}
\begin{equation*}
\uparrow m = k * M \tag{3}
\end{equation*}
B. Loss Function
Given the upsampled LRMS image
\begin{align*}
{\widehat{M}_{gray}} &= G\left({\widehat{M}} \right) \tag{4}
\\
{\widehat{M}_{blur}} &= B\left({\widehat{M}} \right) \tag{5}
\\
\uparrow {m_{gray}} &= G\left({ \uparrow m} \right) \tag{6}
\\
{\widetilde{P}_{blur}} &= B\left({\widetilde{P}} \right) \tag{7}
\end{align*}
1) Spatial Loss
The degradation relationship between the MS image and PAN image can be used to restore the high-resolution spatial information of the output HRMS image. Thus, the spatial loss of our method, which can be divided into spatial constraints at both low and high resolutions, is defined as
\begin{equation*}
{L_{spatial}} = \left\Vert {{{\widetilde{P}}_{blur}} - \uparrow {m_{gray}}} \right\Vert _{2}^{2}+ \delta * \left\Vert {\widetilde{P} - {{\widehat{M}}_{gray}}} \right\Vert _{2}^{2} \tag{8}
\end{equation*}
2) Spectral Loss
Another degradation between the HRMS image and the upsampled LRMS image can be regarded as the blurring operation, which can be used to maintain the spectral consistency between the output HRMS image and the input upsampled LRMS image at different resolutions. Then, similar to (8), the spectral loss is defined as
\begin{equation*}
{L_{spectral}} = \left\Vert { \uparrow m - {{\widehat{M}}_{blur}}} \right\Vert _{_{2}}^{2}\mathrm{{ + }}\gamma * \left\Vert {m - \downarrow \widehat{M}} \right\Vert _{2}^{2} \tag{9}
\end{equation*}
3) Spectral KL Divergence Loss
On the other hand, we consider the inverse process of graying degradation and note that the spectral information of MS images in different spectral bands should follow a specific pattern. The difference between the MS image and PAN image at different resolutions should have similar distributions. Based on this consideration, we use the softmax function to transform the residual terms into a form of probability distribution. Then, the spectral Kullback–Leibler (KL) divergence loss is added to constrain the distribution of the residual terms at different resolutions, which is formulated as follows:
\begin{equation*}
{L_{KL}} = KL(p({x_{low}})\left\Vert {q(x))} \right., \tag{10}
\end{equation*}
In summary, we utilize spatial loss and spectral loss to simultaneously restore the spatial details and preserve the spectral information from the inputs. Moreover, an additional spectral KL divergence loss is proposed to further adjust the spectral qualities. Finally, our proposed unsupervised model is trained by minimizing the following loss function:
\begin{equation*}
L = \alpha {L_{spatial}} + \beta {L_{spetral}} + \mu {L_{KL}}. \tag{11}
\end{equation*}
C. Network Architecture
As mentioned in Section III-A, there are several CNN-based blocks that are designed to implement our proposed network framework, including FEB, DEDB, GB, and RB. Specifically, FEB is used to extract the shallow features from the upsampled LRMS image and HR PAN image to contribute to the subsequent fusion step. Thus, given
\begin{equation*}
F_{m} = {f_{FEB}} (\uparrow m) \tag{12}
\end{equation*}
\begin{equation*}
F_{p} = {f_{FEB}}(\widetilde{P}) \tag{13}
\end{equation*}
Structure of (a) FEB, (b) DEDB, (c) GB, (d) RB, and (e) REC, where k3n128s1 denotes a convolution layer with a 3 × 3 kernel size, 128 channels, and stride 1.
The role of DEDB is to learn more high-level features and fuse sufficient spatial and spectral information. As shown in Fig. 2(b), we adopt four convolutional layers with dense connections to enhance the fusion and inference abilities. Then, the fused features are fed into a deconvolutional layer for upsampling before concatenation with the two residual connections. To reconstruct the output HRMS image, we use a reconstruction block (REC) that consists of two convolutional layers followed by a ReLU activation layer as demonstrated in Fig. 2(e).
GB and RB are vital parts of our proposed unsupervised model. Taking the output HRMS image or the upsampled LRMS image as the input, GB is implemented aided by the channel attention mechanism, as shown in Fig. 2(c). First, we adopt two convolutional layers to transform the input into weight features and use global average pooling (GAP) and fully connected layers to obtain the channel weight vector, which is used to simulate the graying process. Finally, we obtain the stacked output by copying it in the channel dimension. For RB, we implement this module by using a single convolution layer to simulate the spatial degradation as illustrated in Fig. 2(d). Additionally, these modules are jointly optimized to adaptively learn the degradation in the training phase.
Experiments and Evaluations
A. Experimental Setup
1) Datasets and Metrics
To evaluate the performance of the proposed method, we conduct experiments on three datasets: GaoFen-2 (GF-2), Worldview-2 (WV-2), and Worldview-3 (WV-3). The spatial resolutions of the MS and PAN images for GF-2 satellite are 3.2 m and 0.8 m, respectively, those for WV-2 satellite are 1.84 m and 0.46 m and those for WV-3 satellite are 1.2 m and 0.31 m. The satellite of GF-2 has four bands, while the satellites of later two have eight bands. We produced the training data following the Wald’s protocol [19], cropping the PAN and upsampled LRMS images into patch pairs of size 256 × 256 in the training phase. Furthermore, another pairs of size 512 × 512 were selected to implement test experiments of the reduced resolution and full resolution. The partitions of both datasets are listed in Table I.
The performance of different methods in the reduced-resolution and full-resolution experiments are evaluated by different quantitative metrics. In reduced-resolution testing, four widely used metrics with reference are involved, namely, the spectral angle mapper (SAM) [42], spatial correlation coefficient (SCC) [43], relative global synthesis errors (ERGAS) [44], and 4-band extension of the universal image quality index (Q4) [45], while the quality with no-reference (QNR) [46] and its spectral components
2) Implementation Details
No postprocessing operations were applied on the output HRMS image. The network was trained with approximately 50 epochs. The Adam optimizer [47] was used to minimize the loss function, with an initial learning rate of 1e
3) Comparison Methods
In our experiments, we compared the proposed LDP-Net with several state-of-the-art methods, including PCA [4], IHS [5], Brovey [48], GS [49], BSBD [11], additive wavelet luminance proportional (AWLP) [50], PNN [14], DiCNN [32], PanNet [15], DMDNet [51], FusionNet [33], PGMAN [36], and Pan-GAN [20]. The first six methods belong to traditional method. PNN, DiCNN, PanNet, DMDNet, and FusionNet are supervised learning based methods. Pan-GAN and PGMAN are recently proposed unsupervised methods. For fair comparison, these methods were reimplemented with the PyTorch framework according to their publicly available codes and retrained using the same training datasets at the reduced resolution.
B. Comparison at Reduced Resolution
The experiment was performed on three datasets at reduced resolution, which follows the Wald’s protocol. The original MS image can be used as the reference. Figs. 3–5 show three examples cropped from the results of GF-2, WV-2, and WV-3 processed using different methods. In each case, one region that is marked by a red rectangle is magnified to visualize the differences of these results. In Figs. 3–5, it can be observed that the results of traditional methods can restore spatial details effectively but still exhibit some blurring effects and spectral distortions. For example, the results of BDSD suffer from severe spectral distortions and some blurring effects, while the results of AWLP reduce the blurring effect but introduce some spatial artifacts. Supervised learning based methods can improve the spectral performance of pansharpening results but still exist spatial blurring. For the unsupervised method, Pan-GAN successfully achieves unsupervised pansharpening but its results contain some spatial blurring and obvious spectral distortions, especially in WV-2 and WV-3 datasets. In Fig. 3(n), PGMAN recovers more spatial details in pansharpened results while still exists some spectral distortions. Moreover, GAN-based pansharpening methods are difficult to tune the hyperparameters and easily generate spatial and spectral artifacts. As shown in the magnified regions in Figs. 3(o) and 5(o), compared to other methods, it can be seen that our proposed LDP-Net effectively recovers spatial details and preserves spectral information without introducing artifacts and the fusion results are more vivid and much closer to the ground truth than other methods.
Pansharpened results from different methods on the GF-2 dataset at reduced resolution. (a) Upsampled LRMS. (b) PCA. (c) IHS. (d) Brovey. (e) GS. (f) BDSD. (g) AWLP. (h) PNN. (i) DiCNN1. (j) PanNet. (k) DMDNet. (l) FusionNet. (m) Pan-GAN. (n) PGMAN. (o) Ours. (p) Ground truth.
Pansharpened results from different methods on the WV-2 dataset at reduced resolution. (a) Upsampled LRMS. (b) PCA. (c) IHS. (d) Brovey. (e) GS. (f) BDSD. (g) AWLP. (h) PNN. (i) DiCNN1. (j) PanNet. (k) DMDNet. (l) FusionNet. (m) Pan-GAN. (n) PGMAN. (o) Ours. (p) Ground truth.
Pansharpened results from different methods on the WV3 dataset at reduced resolution. (a) Upsampled LRMS. (b) PCA. (c) IHS. (d) Brovey. (e) GS. (f) BDSD. (g) AWLP. (h) PNN. (i) DiCNN1. (j) PanNet. (k) DMDNet. (l) FusionNet. (m) Pan-GAN. (n) PGMAN. (o) Ours. (p) Ground truth.
Tables II–IV show the average values of the quantitative results of different methods on three datasets. The methods are classified into three groups, including traditional, supervised, and unsupervised, and the best result in each group are highlighted in bold. Compared with Pan-Gan and PGMAN, the proposed LDP-Net achieves better scores in most metrics. Among the CNN-based methods, the proposed method can approach the performance of supervised methods. In particular, our model achieves the SCC, ERGAS, and Q4 scores close to the supervised methods, which verifies that our proposed method can effectively fuse the spatial and spectral information without the reference.
C. Comparison at Full Resolution
In this section, all the methods were validated on real data. Figs. 6–8 illustrate the representative results of the real GF-2, WV-2, and WV-3 data. Moreover, to verify the robustness of the proposed LDP-Net, the models trained with reduced images were used for the full-resolution test, which means we do not need to train new models for the full-resolution datasets. In these cases, most traditional methods can significantly restore the spatial information compared with that in LRMS images but most still suffer from a certain degree of spectral shift. In contrast, AWLP reduces the spectral distortion in the results while introduce noticeable spatial artifacts. Compared with these traditional methods, CNN-based models can effectively maintain spectral consistency and improve the spatial resolution over different datasets. However, PanNet and DMDNet generate perceptible blurring effects and artifacts. DiCNN1 can restore the spatial details better with a high spectral resolution, but spectral distortions are still observed in parts of regions. As shown in Figs. 7(j) and 8(j), the light blue mark and the cyan buildings are not as vividly colored as those obtained by other methods. Compared with other supervised methods, FusionNet can further reduce the spatial blurring and spectral distortions. Pan-GAN, which achieves unsupervised learning using spatial and spectral discriminators, can improve the spatial and spectral resolution but still exist spatial blurring and introduce spectral distortions to the results in Figs. 7(n) and 8(n). Though PGMAN maintains the spectral consistency as the upsampled LRMS image, there are still noticeable distortions of spatial details in pansharpened results. It is obvious that in the magnified regions indicated by red boxes, our proposed method preserves better spatial details and maintains higher spectral consistency than other methods. Apparently, our pansharpened images are clearer and more vivid than all the other methods, as shown in Figs. 6(p) and 8(p).
Pansharpened results from different methods on the GF-2 dataset at full resolution. (a) Upsampled LRMS. (b) PAN. (c) PCA. (d) IHS. (e) Brovey. (f) GS. (g) BDSD. (h) AWLP. (i) PNN. (j) DiCNN1. (k) PanNet. (l) DMDNet. (m) FusionNet. (n) Pan-GAN. (o) PGMAN. (p) Ours.
Pansharpened results from different methods on the WV-2 dataset at full resolution. (a) Upsampled LRMS. (b) PAN. (c) PCA. (d) IHS. (e) Brovey. (f) GS. (g) BDSD. (h) AWLP. (i) PNN. (j) DiCNN1. (k) PanNet. (l) DMDNet. (m) FusionNet. (n) Pan-GAN. (o) PGMAN. (p) Ours.
Pansharpened results from different methods on the WV-3 dataset at full resolution. (a) Upsampled LRMS. (b) PAN. (c) PCA. (d) IHS. (e) Brovey. (f) GS. (g) BDSD. (h) AWLP. (i) PNN. (j) DiCNN1. (k) PanNet. (l) DMDNet. (m) FusionNet. (n) Pan-GAN. (o) PGMAN. (p) Ours.
Due to lack of ground truth, QNR,
D. Ablation Study of Loss Function
In this section, several experiments were conducted to investigate the impacts of each component in our loss function. Based on two learnable degradation processes, the loss function plays an important role in our unsupervised training process. The proposed loss function can be subdivided into five parts, namely, the spatial loss at high resolution
Pansharpened results from the ablation study of the loss functions. (a) Ground truth. (b) The combination I. (c) The combination II. (d) The combination III. (e) The combination IV. (f) The combination V. (g) The combination VI. (h) The combination VII. (i) The combination VIII.
E. Efficiency Study
In this section, the computational efficiencies of all comparison methods are evaluated. As mentioned in Section IV-A, all deep learning based methods were implemented in PyTorch and tested on an Nvidia GeForce GTX 1080Ti GPU, while all traditional methods were implemented in MATLAB R2019b framework on CPU. Table VII lists the computational times of different approaches and the parameters of different models. The cost times are evaluated by averaging the inference time in the testing set at the reduced resolution experiment. Compared with other methods, the number of the parameters of our model is small but the computational time of our method is at the middle level. The main reason is that our proposed network contains two additional degradation modules and a deeper network structure. Compared to GAN-based unsupervised pansharpening methods, we must mention that our model is easier for hyperparameter tuning in the training phase. Generally, in addition to ensuring the superiority of performance, our proposed unsupervised model makes a reasonable tradeoff between model performance and computational cost.
Conclusion
In this article, we propose an unsupervised pansharpening method based on two learnable degradation processes. The method can adaptively learn the degraded processes with two corresponding CNN-based modules and successfully achieve unsupervised pansharpening. Moreover, we consider the degradation processes at different resolutions and present a novel hybrid loss that can effectively maintain spatial and spectral consistency. Thus, this unsupervised training strategy adequately improves the spatial details and reduces the spectral distortion in the results. Then, extensive experiments were performed on different-resolution images from three datasets, demonstrating the superiority of our proposed method over other state-of-the-art methods.