Processing math: 100%
Band-Wise Hyperspectral Image Pansharpening Using CNN Model Propagation | IEEE Journals & Magazine | IEEE Xplore

Band-Wise Hyperspectral Image Pansharpening Using CNN Model Propagation


Abstract:

Hyperspectral (HS) pansharpening has received a growing interest in the last few years as testified by a large number of research papers and challenges. It consists in a ...Show More

Abstract:

Hyperspectral (HS) pansharpening has received a growing interest in the last few years as testified by a large number of research papers and challenges. It consists in a pixel-level fusion between a lower resolution HS datacube and a higher resolution single-band image, the panchromatic (PAN) image, with the goal of providing an HS datacube at PAN resolution. Due to their powerful representational capabilities, deep learning models have succeeded to provide unprecedented results on many general-purpose image processing tasks. However, when moving to domain-specific problems, as in this case, the advantages with respect to traditional model-based approaches are much lesser clear-cut due to several contextual reasons. Scarcity of training data, lack of ground truth (GT), and data shape variability are some such factors that limit the generalization capacity of the state-of-the-art deep learning networks for HS pansharpening. To cope with these limitations, in this work, we propose a new deep learning method, which inherits a simple single-band unsupervised pansharpening model nested in a sequential band-wise adaptive scheme, where each band is pansharpened refining the model tuned on the preceding one. By doing so, a simple model is propagated along the wavelength dimension, adaptively and flexibly, with no need to have a fixed number of spectral bands, and, with no need to dispose of large, expensive, and labeled training datasets. The proposed method achieves very good results on our datasets, outperforming both traditional and deep learning reference methods. The implementation of the proposed method can be found at https://github.com/giu-guarino/R-PNN.
Article Sequence Number: 5500518
Date of Publication: 04 December 2023

ISSN Information:

Funding Agency:


SECTION I.

Introduction

Earth observation from space is a hot topic requiring the development of instruments and satellite platforms to face with increasing need for data with worldwide coverage. Unfortunately, remote sensing optical sensors show a tradeoff among spatial and spectral resolutions, and signal-to-noise ratio (SNR). These tradeoffs cannot be solved through hardware solutions capturing high spatiospectral representations of the Earth without strongly penalizing the SNR. Thus, simultaneously acquiring more than one representation of the Earth with sensors showing different (often opposite) features is usually considered in the design of a payload for a satellite. Hence, in optical remote sensing, it is common to see devices having high spatial resolution but with limited spectral bands [e.g., panchromatic (PAN)] working together with high spectral resolution but with lower spatial resolution sensors (e.g., multispectral (MS)/hyperspectral (HS) ones capturing tens/hundreds of spectral bands, respectively). Starting from the images acquired by these systems, researchers are developing software-based solutions to combine these data to get the best from each source of information. The most dated and widely used framework relies upon the fusion of PAN and MS images. Pansharpening, which stands for PAN sharpening, refers to these approaches [1], [2]. Other powerful examples of these techniques, often representing an extension of the pansharpening concept, are the HS pansharpening ones [3], involving HS data to be enhanced by PAN images, and MS and HS image sharpening methodologies [4], which exploit MS data to provide high spatial resolution information to sharp an HS cube.

HS images acquired by sensors onboard of satellite platforms are widely used for several tasks (see, e.g., classification and object detection) due to their very appealing spectral features. However, a limitation is represented by the spatial resolution of these data that are rarely finer than 30 m. Thus, cutting-edge research can be found in the literature fusing HS with PAN images to improve it (i.e., the so-called HS pansharpening). The relevance of this topic can be demonstrated by the recent scientific production summarized in review papers as [3], and the last HS pansharpening challenge [5] organized in conjunction with the 12th WHISPERS (Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing), which leverages on datasets from the PRecursore IperSpettrale della Missione Applicativa (PRISMA) system, owned and managed by the Italian Space Agency.

The first attempt of fusing real HS and PAN images captured by Hyperion/Advanced Land Imager (ALI) dated back to 2007 [6], where an optimized component substitution (CS) approach inspired by the literature of MS pansharpening [7], [8] has been proposed. Afterward, many methods borrowed from the classical MS pansharpening literature have been tested on the HS and PAN fusion problem. A pioneering work has been presented in [1], where several CS, e.g., [9], [10], and multiresolution analysis (MRA), e.g., [11], [12], approaches designed for MS pansharpening have been considered. In the abovementioned paper, an interesting study about the comparison between the fusion of HS and PAN images acquired by the same or different platforms has been proposed. Moreover, other interesting approaches, originally designed for fusion of high-resolution PAN and low-resolution MS data, have been adapted to the HS pansharpening case. These can be roughly cast in Bayesian [13], [14], [15], [16], matrix factorization [17], [18], [19], and variational [20] methods. In 2015, many of these methods and others have been compared in an extensive review [3]. Besides, several solutions specifically conceived for the HS pansharpening problem have also been proposed. Notable examples are the use of guided filters [21], variational approaches such as [22] and [23], and the saliency analysis-based CS method proposed in [24].

Recently, the number of research papers about sharpening of optical remote sensing data has dramatically grown, in particular related to the use of deep learning [25], [26], [27], [28], [29]. This trend is confirmed even for the HS pansharpening. Indeed, in 2019, a new HS pansharpening framework via spectrally predictive convolutional neural networks (CNNs) was proposed in [30] to strengthen the spectral prediction capability of a pansharpening network. Subsequently, a dual-attention residual network upsampling the HS image using a deep HS prior module has been considered in [31]. In [32], a new spectral-fidelity CNN for HS pansharpening has been developed to control the spectral distortion of fused products and to progressively synthesize spatial details. Furthermore, a novel CNN-based method for arbitrary resolution HS pansharpening based on a two-step relay optimization process has been proposed in [33]. On the same research line, an arbitrary scale attention upsampling module has been introduced in [34]. Thus, in [35], an overcomplete residual network, which is focused on learning high-level features by constraining receptive fields of deep layers, has been designed together with a new spatial-domain constraint between the PAN and its predicted version. An unsupervised HS pansharpening method via ratio estimation and residual attention network has been described in [36]. A multistage dual-attention guided fusion network has been considered in [37] employing a three-stream structure and fusing the extracted features through a dual-attention guided fusion block. In [38], it is proposed a deconvolution long short-term memories network with bidirectional learning for HS upsampling and spatial–spectral reconstruction based on a two-branch divide-and-conquer network. A generative super-resolution network combined with a segmentation-based injection gain estimation [39], [40] is instead proposed in [41]. Finally, a deep CNN exploiting Gaussian–Laplacian pyramids for pansharpening has been presented in [42]. Following the same idea of multiresolution fusion, in [43], a multiresolution spatial–spectral feature learning has been proposed transforming the existing deep (and complex) network into several simple and shallow subnetworks to simplify the learning process and using multiresolution 3-D convolutional autoencoder networks to learn spatial–spectral HS features.

The 2022 WHISPERS contest on HS pansharpening was held with the goal of providing a picture of the state-of-the-art on the topic, also in light of the recent advances on deep learning, to pave the way for better solutions. Unfortunately, none of the competitors achieved convincing results compared to the baseline methods and no winners were declared by the organizing committee. Actually, a careful inspection of the outcomes reveals that a critical bottleneck was the limited capacity of the proposed solutions to generalize moving from synthetic reduced-resolution datasets (ground truth (GT) available) to real full-resolution ones (GT unavailable). Indeed, this is one of the main problems already encountered in the case of MS image pansharpening using deep learning, motivating the development of unsupervised training solutions [44], [45], [46], [47], [48]. In fact, unsupervised learning procedures do not require GTs, with no need to do synthetic (downgrading) resolution shifts on data. An attempt to follow this same path for the HS case can be found in [36]. However, the wide variability of observed images, due to diversity of sensors, scenes, and operating conditions, still prevents from generalizing well to data not seen during training. In computer vision, this problem is usually solved by increasing the training set and by using suitable forms of augmentation [49]. Such solutions are hardly viable in remote sensing using HS images, due to the scarcity of high-quality training data (often proprietary) and the peculiarities of HS imaging, including the data volume per ground surface unit. Besides, compared to the MS case, in the HS case, the resolution ratio is typically higher and the spectral coverage is much denser and wider, exceeding the PAN bandwidth by far, causing further ill-posedness issues. Furthermore, the number of spectral bands is a specific feature of the HS sensor and can even change from one image to another of the same sensor, because of acquisition errors that can make useless subsets of bands. A solution for handling a variable number of bands [50], based on a single pretrained model, has been proposed in [51].

To cope with the above issues, in this work, we propose a new HS CNN-based pansharpening method, which regards the HS datacube as a chain of individual bands to be sequentially pansharpened. This is achieved by leveraging on a lightweight single-band pansharpening network operating in adaptive mode, whose optimized parameters for a given band are used as the starting point for the self-adaptive inference step of the next spectral band. By doing so, we bridge the model parameters for adjacent bands to some extent, simplifying the adaptation task due to their expected correlation. Both (pre)training and tuning iterations for target adaptation are run at full resolution due to a suitably defined unsupervised loss comprising both spectral and spatial consistency terms. It is worth to observe that the proposed solution is not just a divide-and-conquer solution based on the split of the HS datacube in batches of bands. In fact, the tuning-based protocol for bridging the models of adjacent bands depicts a completely new framework, where any baseline single-band pansharpening network, as well as any unsupervised loss, can be straightforwardly integrated.

Specifically, the advantages of the proposed solution, hereinafter referred to as rolling HS pansharpening neural network (R-PNN), with respect to the state-of-the-art are the following. All but the first spectral band do not need pretrained parameters, as they are inherited from the previous one. In this way, the model progressively and adaptively learns exclusively on the target image, with a limited computational cost due to several design choices such as lightweight architecture, band-wise processing, model propagation, and adaptive distribution of the tuning iterations. Also, the method is not subject to cross-resolution generalization issues as it learns at the target resolution (being unsupervised). More in general, the generalization is not an issue because the network learns directly on the target image, dynamically fitting its parameters to it. Finally, the sequential structure of the method allows to handle an arbitrary number of spectral bands, not necessarily uniformly sampled.

The proposed solution has provided state-of-the-art results on all the considered datasets, both full- and (surprisingly) reduced-resolution ones, consistently outperforming all the competitors. The above-discussed properties combined with the good obtained results make the proposed method very attractive for its use in practical real-world applications. In this perspective and to ensure full reproducibility of our research outcomes, the code of the proposed method is shared at https://github.com/giu-guarino/R-PNN.

In summary, the main contributions of this work are given as follows:

  1. a new unsupervised CNN-based HS pansharpening approach based on a band-wise model propagation protocol;

  2. a new unsupervised spectral–spatial consistency loss for PAN–HS pairs;

  3. a target-adaptive solution for the PAN–HS fusion problem;

  4. state-of-the-art results on both reduced-resolution synthetic data and full-resolution real (PRISMA) data.

The remaining of this article is organized as follows. Section II presents the proposed solution. Section III describes datasets, quality assessment indexes, and reference methods. Section IV presents an experimental analysis aimed to support and validate several design choices. Finally, Section V gathers and discusses comparative numerical and visual results with conclusion given in Section VI.

SECTION II.

Proposed Method for HS Pansharpening: R-PNN

Compared to the more familiar case of pansharpening of MS images, in the case of HS data, the generalization of deep learning models is a more critical issue for several reasons:

  1. lesser training datasets;

  2. increased spectral information;

  3. low or no correlation between the PAN and many spectral bands to be super-resolved;

  4. variable number of bands, even for the same sensor, due to acquisition issues (different bands may be discarded for quality reasons).

Moving from the above observations, here, we propose a CNN-based band-wise pansharpening solution, which inherits the basic idea proposed in [47] for the MS case, that is, to run parameter tuning iterations on the target image using a suitably defined unsupervised loss. However, differently from [47], in the present proposal, the PAN is fused with only one band at a time, rather than with all of them, and the tuned model is passed to the next band, which is further adapted.

In the following, we will first detail the single-band tuning/inference block, and then, we will describe the overall high-level scheme, before providing details about the core CNN network and the loss.

A. Single-Band Pansharpening With Tuning

Fig. 1 provides a high-level description of the tuning loop for the generic b th spectral band, \mathbf {M}_{b} \in \mathbb {R}^{W \times H \times 1} , to be fused with the PAN image, \mathbf {P}\in \mathbb {R}^{RW \times RH} , being W and H the spatial dimensions in the low resolution domain, B the number of spectral bands, and R the resolution ratio between \mathbf {P} and the HS image \mathbf {M}\in \mathbb {R}^{W \times H \times B} .

Fig. 1. - CNN-based single-band unsupervised tuning block for pansharpening. The module takes in input the two images to fuse (the 
$b$
th band, 
$\mathbf {M}_{b}$
, and the PAN, 
$\mathbf {P}$
) and the initial parameters, 
$\phi _{b-1}$
, inherited from the previous module. It runs 
$N_{b}$
 tuning iterations to fit on 
$\mathbf {M}_{b}$
 and 
$\mathbf {P}$
. The tuned parameters, 
$\phi _{b}$
, are used to provide the final pansharpened band, 
$\widehat { \mathbf {M}}_{b}$
, and passed to the next module. 
${\mathcal{ L}}_{\lambda} $
 and 
${\mathcal{ L}}_{S}$
 are the spectral and spatial consistency loss terms, respectively.
Fig. 1.

CNN-based single-band unsupervised tuning block for pansharpening. The module takes in input the two images to fuse (the b th band, \mathbf {M}_{b} , and the PAN, \mathbf {P} ) and the initial parameters, \phi _{b-1} , inherited from the previous module. It runs N_{b} tuning iterations to fit on \mathbf {M}_{b} and \mathbf {P} . The tuned parameters, \phi _{b} , are used to provide the final pansharpened band, \widehat { \mathbf {M}}_{b} , and passed to the next module. {\mathcal{ L}}_{\lambda} and {\mathcal{ L}}_{S} are the spectral and spatial consistency loss terms, respectively.

The tuning starts from the initial network parameters \phi _{b-1} fit on (\mathbf {M}_{b-1}, \mathbf {P}) in the previous step, when b>1 , otherwise from pretrainied weights when b=1 . Then, a prefixed number, N_{b} , of training iterations are run on the same target image pair, (\mathbf {M}_{b}, \mathbf {P}) , leveraging on an unsupervised loss comprising both spectral ({\mathcal{ L}}_{\lambda} ) and spatial ({\mathcal{ L}}_{S} ) consistency terms. The tuned parameters, \phi _{b} , are then used for the final prediction of the pansharpened band \widehat { \mathbf {M}}_{b}\in \mathbb {R}^{RW \times RH \times 1} , being \widehat { \mathbf {M}}\in \mathbb {R}^{RW \times RH \times B} the full set of HS pansharpened bands, and passed to the next fusion step involving \mathbf {M}_{b+1} and \mathbf {P} . Details about the loss will be provided in Section II-D.

B. High-Level Model Propagation Scheme

The overall tuning-prediction chain involving all B spectral bands is shown in Fig. 2. Since the interband spacing can vary a lot from one pair of neighboring bands to another (there can be large gaps), the number of tuning iterations, N_{b} , is increased for larger spectral gaps, \Delta \lambda _{b} = \lambda _{b} - \lambda _{b-1} , being \lambda _{b} the b th band wavelength. This is because the larger the spectral distance, the lower the expected interband correlation, the lesser the fitting of the model parameters \phi _{b-1} to the target band \mathbf {M}_{b} . In particular, we have heuristically fixed such numbers as follows:\begin{align*} N_{b} = \begin{cases} \displaystyle 20, & b = 1 \\ \displaystyle \min \left ({\alpha \Delta \lambda _{b}, 80 }\right), & b>1 \end{cases} \tag{1}\end{align*}

View SourceRight-click on figure for MathML and additional features. being \alpha (iter./nm) the number of iterations per nanometer of distance between bands b and b-1 . As we can roughly assume a minimum spectral sampling interval of about 10 nm for PRISMA, if \alpha =1 , each band, \forall b>1 , will undergo at least ten tuning iterations up to a maximum of 80 (upper bound value).

Fig. 2. - Unsupervised rolling adaptation scheme for HS image pansharpening. Each tuning module (detailed in Fig. 1) inherits the initial weights, 
$\phi _{b-1}$
, from the previous module and passes the tuned ones, 
$\phi _{b}$
, to the next block. 
$B$
 is the total number of spectral bands.
Fig. 2.

Unsupervised rolling adaptation scheme for HS image pansharpening. Each tuning module (detailed in Fig. 1) inherits the initial weights, \phi _{b-1} , from the previous module and passes the tuned ones, \phi _{b} , to the next block. B is the total number of spectral bands.

The proposed scheme allows for a drastic reduction in the computational load required for parameter tuning due to the stronger correlation expected between closer band pairs. Notice that, differently from more common training/tuning configurations, where minibatches of small example patches are formed, here, following the tuning scheme proposed in [52] and [47], we have a unique training batch containing the whole target image. Indeed, recent variants of this scheme [48], [53] propose sampling rules to keep limited the computational burden in the case for very large datasets. In this work, however, the sizes of the interested datasets were such that no sampling was needed.

C. Network

The key characteristics of the proposed approach are its adaptivity to the target image and the band-wise, chained, modality. For these reasons, it makes perfect sense to look at lightweight network architectures to preserve the nimbleness of the proposed solution. Therefore, we decided to rely on a shallow three-layer residual model similar to the one proposed in [52] for the classical pansharpening of four- or eight-band MS images. It is composed of three sequential convolutional layers, interleaved by ReLU activations, with a parallel global skip connection that brings the spectral input band to be pansharpened (already interpolated to fit the PAN size) directly to the exit (by sum) of the third convolutional layer. In detail, the hyperparameters of the network are given in Table I. The most relevant differences compared to the CNN architecture [52] are the number of spectral bands, just one instead of 4/8, and the resolution ratio, which is 6 for PRISMA datasets.

TABLE I Net Hyperparameters. ^{(*)} Concatenation of \mathbf{P} With the b th ( 6 \times6 ) \times Interpolated HS Band, \widetilde{ \mathbf{M}}_{b} . ^{(**)} Pansharpened Band, \widehat{ \mathbf{M}}_{b}
Table I- 
Net Hyperparameters. 
$^{(*)}$
 Concatenation of 
$\mathbf{P}$
 With the
$b$
th (
$6 \times6$
)
$\times$
 Interpolated HS Band, 
$\widetilde{ \mathbf{M}}_{b}$
. 
$^{(**)}$
 Pansharpened Band, 
$\widehat{ \mathbf{M}}_{b}$

D. Unsupervised Spatial–Spectral Consistency Loss

The proposed model leverages on a band-wise unsupervised loss, {\mathcal{ L}}^{(b)} , both in pretraining and fine-tuning (see Fig. 1), which comprises two terms, a spectral component, {\mathcal{ L}}_{\lambda} , and a spatial component, {\mathcal{ L}}_{ S} , weighted by a hyperparameter \beta \begin{equation*} {\mathcal{ L}}^{\left ({b}\right)} = {\mathcal{ L}}_{\lambda} \left ({\widehat { \mathbf {M}}_{b}, \mathbf {M}_{b}}\right)+ \beta {\mathcal{ L}} _{S}\left ({\widehat { \mathbf {M}}_{b}, \mathbf {P}}\right)\quad \forall b. \tag{2}\end{equation*}

View SourceRight-click on figure for MathML and additional features. The spectral loss term is based on a usual \ell _{1} -norm between the original low-resolution spectral band \mathbf {M}_{b} and the downgraded version \widehat { \mathbf {M}}_{b}^{(\mathcal {D})} of the fused band \widehat { \mathbf {M}}_{b} , obtained using a modulation transfer function (MTF)-matched Gaussian low-pass filtering (LPF) followed by decimation, i.e., \begin{equation*} {\mathcal{ L}}_{\lambda} \left ({\widehat { \mathbf {M}}_{b}, \mathbf {M}_{b}}\right) = \left \|{ \widehat { \mathbf {M}}_{b}^{\left ({\mathcal {D}}\right)} - \mathbf {M}_{b} }\right \|_{1} \tag{3}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
with \begin{equation*} \widehat { \mathbf {M}}_{b}^{\left ({\mathcal {D}}\right)}\left ({n,m}\right) \triangleq \widehat { \mathbf {M}}^{\mathrm{ LPF}}_{b}\left ({n_{0}+6n, m_{0}+6m}\right) \tag{4}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
being (n_{0},m_{0}) a proper spatial offset and R=6 the decimation step, which equals the resolution ratio for PRISMA data. On the other hand, the linking of the super-resolved band, \widehat { \mathbf {M}}_{b} , to the PAN band, \mathbf {P} , passes through the local correlation coefficient (CC) between the two. In particular, said \rho ^{\sigma} _{ \mathbf {X} \mathbf {Y} }(i,j) the local CC between scalar images \mathbf {X} and \mathbf {Y} , computed on a \sigma \times \sigma window w(i,j) centered at location (i,j) , i.e., \begin{equation*} \rho ^{\sigma} _{ \mathbf {X} \mathbf {Y} }\left ({i,j}\right) = \frac {\rm Cov\left ({\mathbf {X}_{w\left ({i,j}\right)}, \mathbf {Y}_{w\left ({i,j}\right)}}\right)}{\sqrt {\rm Var\left ({\mathbf {X}_{w\left ({i,j}\right)}}\right){\mathrm{ Var}}\left ({\mathbf {Y}_{w\left ({i,j}\right)}}\right)}} \tag{5}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
the spatial (or structural) loss term is defined as \begin{equation*} {\mathcal{ L}}_{S} = \left \langle{ \left |{ \rho ^{\mathrm{ max}}\left ({i,j}\right) - \rho ^{\sigma} _{ \mathbf {P}\widehat { \mathbf {M}}_{b}}\left ({i,j}\right) }\right | }\right \rangle _{i,j} \tag{6}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where {\mathrm{ Cov}}(\cdot) and {\mathrm{ Var}}(\cdot) indicate the covariance and variance operators, respectively, \langle \cdot \rangle denotes the average over the image, and \rho ^{\mathrm{ max}} is a suitable upper bound CC map. \rho ^{\mathrm{ max}} may be just a constant 1, meaning that we seek to reach the maximum possible CC at each image location or a constant lesser than 1, relaxing a little the CC target, or be spatially varying, e.g., estimated at each location comparing the smoothed version of \mathbf {P} with \mathbf {M}_{b} (resized). Details about the setting of \beta , \sigma , and \rho ^{\mathrm{ max}} will be provided in Section IV.

SECTION III.

Data, Quality Assessment, and Methods

The main goal of the proposed work was to develop a new data-driven method for HS pansharpening that can outperform the state-of-the-art recently presented in the paper on the HS pansharpening challenge at IEEE WHISPERS 2022 [5]. Leveraging on this claim, our experiments relied upon datasets, quality assessment procedures, and comparative methods exploited in the abovementioned challenge and briefly described in the rest of this section. Besides, to enrich the comparative assessment, four additional deep learning solutions have also been enclosed [30], [32].

A. Datasets

Despite the development of new approaches for HS pansharpening, most of them have been tested on simulated data neglecting an assessment using real data at full resolution. To overcome this limitation, PRISMA data have been distributed after the end of the WHISPERS challenge. Four datasets (the ones used for the contest) both at reduced and full resolutions have been shared. Each dataset comprises a PAN component and an HS image. The spatial resolution of the PAN image is 5 m. Instead, the HS sensor acquires about 250 spectral bands with a spatial resolution of 30 m. Before the announcement of the contest, only very few works on HS pansharpening of PRISMA images have been published. An application-oriented work using pansharpened PRISMA data has just been presented in 2021 [54]. Thus, the goal of the challenge has been to boost the research on HS pansharpening pushing researchers toward using new data, thus addressing new challenges: for example, the tradeoff between computational cost (critical for images with hundreds of bands) and fusion performance or other peculiarities related to the HS pansharpening problem, such as the scale ratio different from 4 (that is instead widely used for MS pansharpening), the effects of a residual space-varying registration error between PAN and HS images, and the fusion of an elevate and sensor-dependent number of bands, which sometimes show low SNRs. Four teams accepted the challenge proposing innovative solutions that relied upon machine learning and variational optimization-based methodologies. Despite the use of state-of-the-art techniques, the four participating teams did not get outstanding results compared to the baseline and, for this reason, the organizing committee decided to close the contest and claim it was inconclusive (no winner).

In Table II, some characteristics of the images are reported, while Fig. 3 shows the data of the challenge. More specifically, four datasets are distributed, i.e., FR1 and FR2 for the assessment at full resolution, and RR1 and RR2 for the assessment at reduced resolution. In this work, a further dataset (i.e., FR0) has been exploited for validation purposes and to generate the initial weights of the proposed model.

TABLE II Datasets. FR0 Is Used for Validation Purposes Only
Table II- 
Datasets. FR0 Is Used for Validation Purposes Only
Fig. 3. - HS pansharpening challenge datasets [5]. For each dataset (column), the PAN (top), an RGB subset using bands from the visible spectrum (middle), and a false-color subset using far apart bands (bottom) are shown. From left to right: RR1, RR2, FR1, and FR2.
Fig. 3.

HS pansharpening challenge datasets [5]. For each dataset (column), the PAN (top), an RGB subset using bands from the visible spectrum (middle), and a false-color subset using far apart bands (bottom) are shown. From left to right: RR1, RR2, FR1, and FR2.

B. Accuracy Indexes

Assessing the performance of image fusion products is still an open issue, given the lack of full-resolution GTs. A widespread approach is to rely on the so-called synthesis property of Wald’s protocol [55]. The implementation of the abovementioned protocol is based on a proper downsampling of the available data, under the hypothesis of invariance among scales of pansharpening algorithm performance [56]. Hence, the original HS data play the role of GT on which to measure the similarity with the fused product obtained by combining the degraded versions of the original PAN and HS images. The higher the similarity, the better the performance. This similarity degree can be evaluated through multidimensional score indexes [56].

  1. The Q2^{n} [57] is the multidimensional extension of the universal image quality index (UIQI) [58]. The upper bound of the index is one, even representing the optimal value.

  2. The spectral angle mapper (SAM) [59] determines the spectral similarity (usually in degree) between the fused and the reference spectra. It is measured pixel-by-pixel and averaged over the whole image. The optimal value is zero.

  3. ERGAS [60] is a French acronym that stands for Erreur Relative Globale Adimensionelle de Synthèse (dimensionless global relative error of synthesis). It is a normalized dissimilarity index (multidimensional extension of the root-mean-square error) that measures the radiometric distortion of the fused product with respect to the reference (GT) image. The optimal value is zero.

  4. PSNR, measured in decibels, stands for peak SNR and is one of the most popular quality indexes in the general image processing domain. Higher PSNR values indicate better quality.

It is worth pointing out that the reduced-resolution assessment relies upon the invariance among scales hypothesis. This assumption could not be valid. Furthermore, the accuracy can depend on how to degrade the original PAN and HS products [56]. As a consequence, to provide a complete assessment of pansharpening algorithms, the validation at full resolution is also adopted [61], [62], [63], [64], [65], [66]. In this article, we followed the indications in [5] using the same quality indexes at full resolution. More specifically, the Q^{\ast} index [67] is exploited consisting of a spectral distortion index [64], [67], D_{\lambda } , based on Wald’s consistency property [55], and a regression-based spatial distortion index, D_{S} , first proposed in [68] and then deeply investigated in [67]. In the ideal case, both the spatial and spectral distortion indexes are zero, thus obtaining a Q^{\ast} index equal to one.

For additional details about the PSNR, interested readers can refer to [69], while, for all other reduced- and full-resolution indexes, implementation details are given in [5] and in the related freely available toolbox.1

C. Benchmarking

Table III summarizes the techniques used in our experimental analysis. The benchmarking approaches taken from the WHISPERS challenge are described in [5]. More specifically, five methods, exploited as baseline solutions for the challenge, are borrowed from the pansharpening literature. The first two [2], [73] belong to the CS class, i.e., the Gram–Schmidt (GS) [10] approach and its adaptive version, GSA [9]. The other three baseline methods are representative of the MRA. More in detail, the third method is the classical additive wavelet luminance proportional (AWLP) [70]; the fourth technique is the MTF-generalized Laplacian pyramid (MTF-GLP) [12] with histogram matching [71]; and finally, the last baseline method is the morphological filters (MFs) [72]. The other four techniques (labeled Teams 1–4) are innovative variational optimization-based and machine learning-based solutions proposed by the participants to the HS pansharpening challenge [5]. Finally, four extra-challenge deep learning solutions [30], [32] have been reimplemented and trained on PRISMA data for further comparison. In particular, for these methods, lacking extra datasets for training and giving the heterogeneity (different numbers of bands) of the test datasets, we used in training a portion of the same test datasets, applying the canonical resolution downgrade protocol needed for supervised models.

TABLE III Benchmarking Approaches
Table III- 
Benchmarking Approaches

SECTION IV.

Experimental Validation

In this section, we show and discuss several experimental results aimed to support our design choices. In particular, we will analyze the relationship between the proposed loss and the accuracy indicators (see Section IV-A). We will show the impact of the model propagation mechanism (see Section IV-B) and the tuning (see Section IV-C). Then, we will provide details about the setting of the spatial loss term (see Section IV-D). Finally, we carry out an ablation study on the network architecture (see Section IV-E), before concluding with some details about the pretraining phase (see Section IV-F).

A. Loss and Accuracy

The first experiment deals with the choice of the loss. Since we propose an unsupervised one, it is not obvious or, at least, not always possible that the smaller the loss, the better the accuracy, according to the available quality indicators. Therefore, it is a fundamental question to understand to what extent there is agreement between the loss and the quality indicators. Toward this goal, we have selected a full-resolution (FR2) and a reduced-resolution (RR1) dataset, restricting the pansharpening to a single HS band. For both datasets, we consider two opposite conditions: a highly and a weakly correlated (to the PAN) HS band. Since we target a single HS band, the proposed solution reduces to the traditional pansharpening without the need of model propagation. Moreover, to avoid eventual biases, we run the fine-tuning target adaptation from scratch (random initial weights). In particular, here, we are interested to monitor the evolution of the loss in comparison with the evolution of the accuracy indicators during the training. In Fig. 4, all the involved training curves are gathered for the reduced-resolution case: spectral and spatial loss terms, ERGAS, and 1-Q . SAM is not applicable in the single-band case, and hence, it is not monitored. Blue and red curves refer to the cases of highly (#2) and weakly (#41) correlated bands, respectively. As a first observation, we notice that both the loss terms, {\mathcal{ L}}_{\lambda} and {\mathcal{ L}}_{S} , keep descending monotonically even if they seem to be very close to their lower bound after about 1000 iterations. To this regard, it should be remarked that the spatial and the spectral loss terms are partially fighting with each other [47] and could be very hard to bring both to zero. Such a tradeoff is more noticeable for band #41 where a smaller correlation with the PAN makes it difficult to minimize {\mathcal{ L}}_{S} without sacrificing spectral consistency ({\mathcal{ L}}_{\lambda} ). Moving the focus on ERGAS, we can observe a coherent behavior with both the loss terms, as it decreases monotonically at least in the first 600 iterations, before reaching a plateau. This occurs for both the bands and a similar behavior is registered for 1-Q . Eventually, it seems safe to say that the two chosen loss terms are generally consistent with the standard quality indicators ERGAS and Q . Above certain quality levels (after 600 iterations), they seem to lose their correlation with ERGAS and Q . However, this is somewhat expected as they are no-reference indicators, whereas the latter are reference-based.

Fig. 4. - Training curves for single-band pansharpening on RR1. (a) Spectral and (b) spatial loss terms. (c) ERGAS. (d) 
$1-Q$
.
Fig. 4.

Training curves for single-band pansharpening on RR1. (a) Spectral and (b) spatial loss terms. (c) ERGAS. (d) 1-Q .

Let us now move the focus on the most interesting full-resolution case with the help of Fig. 5 that shows the related training curves: spectral and spatial loss terms, and the spectral distortion index D_{\lambda} . We do not show the spatial distortion D_{S} as it loses its meaning if computed on a single spectral band. Again, blue and red curves refer to the cases of highly (#2) and weakly (#56) correlated bands, respectively. In this case, the analysis is simpler as we can relate homogeneous no-reference quantities. Focusing on the spectral numerical figures, we can observe a perfect agreement between D_{\lambda} and {\mathcal{ L}}_{\lambda} , no matter which band is concerned. On the other side, we do not dispose of a numerical reference to judge the spatial consistency of {\mathcal{ L}}_{S} . Nonetheless, we can first observe that {\mathcal{ L}}_{\lambda} and {\mathcal{ L}}_{S} can be minimized simultaneously and, at the same time, they show different trajectories, symptomatic of a weak correlation between the two, with D_{\lambda} clearly linked to {\mathcal{ L}}_{\lambda} rather than to {\mathcal{ L}}_{S} . Moreover, with the help of Fig. 6, we can appreciate (subjectively) the improvement of the spatial quality due to the minimization of {\mathcal{ L}}_{S} through the visual inspection of some sample results obtained along the tuning process (full-resolution case). From left to right are shown the original low-resolution HS bands, the PAN, and a series of pansharpening results progressively singled out along the training process. Below each result is reported the number of tuning iterations, and the values of the corresponding spectral and spatial loss terms. The spectral adherence of these outcomes to the interested low-resolution band is hard to see. However, the spectral distortion D_{\lambda} decreases monotonically [Fig. 5(c)], guaranteeing for a progressive spectral quality enhancement. Indeed, it is worth noticing that most of the spectral distortion (see {\mathcal{ L}}_{\lambda} or D_{\lambda} ) is removed in the first 50 iterations for band # 2 (first 200 iterations for band #56), whereas the spatial loss presents a more distributed and regular decay. This allows us to easily attribute to {\mathcal{ L}}_{S} the spatial improvement clearly visible in the sequence of Fig. 6. To be more specific, we can look at the pansharpening results for band #2 at 50 and 100 iterations. In this interval, the spectral loss remains nearly constant (0.0057–0.0054) whereas the spatial one halves (0.349–0.175). In addition, the visual inspection of these partial results reveals a remarkable improvement, which cannot but depend on the reduction of {\mathcal{ L}}_{S} . Similar observations can be done for band #56 moving from 400 to 1000 iterations. Encouraged by this experimental analysis, we decided to keep the proposed spatial loss [see (6)] that ensures a consistent behavior according to a visual assessment.

Fig. 5. - Training curves for single-band pansharpening on FR2. (a) Spectral and (b) spatial loss terms. (c) 
$D_{\lambda} $
.
Fig. 5.

Training curves for single-band pansharpening on FR2. (a) Spectral and (b) spatial loss terms. (c) D_{\lambda} .

Fig. 6. - Single-band sample pansharpening results during training.
Fig. 6.

Single-band sample pansharpening results during training.

B. Spectral Correlation Analysis and Model Propagation

To gain insight the interband dependence, let us have a look at the covariance matrix normalized in the range [−1, 1] (CC) shown in Fig. 7 associated with a sample HS PRISMA image. The maximum values are on the diagonal and marked in black. For each given band (e.g., look at row #8), the best correlated bands looking backward and forward are marked in green and red, respectively. Focusing on the backward case, we can notice that, in the large majority of the cases, the most correlated band is just the previous one. When this is not the case, the correlation value for that band is, however, very close to the maximum. In the forward search, we have nearly the same situation with just one exception for band 18. We have carried out the same analysis on all the available datasets and the conclusions were always the same. Based on the above considerations, it makes perfect sense to propagate the model from one band to the next one (or previous one, if we proceed in the opposite direction).

Fig. 7. - Covariance matrix for a sample HS PRISMA datacube. For each band (fix a row index), the most correlated previous and next bands are marked in green and red, respectively.
Fig. 7.

Covariance matrix for a sample HS PRISMA datacube. For each band (fix a row index), the most correlated previous and next bands are marked in green and red, respectively.

To further clarify the propagation tuning process, in Fig. 8, it is shown that the progress of the loss adaptation, limited to a subset (1–18) of bands for ease of view. The loss (Section II-D) comprises two terms responsible for the spectral (top) and spatial (bottom) consistencies. Checkpoints highlight the final loss for each band. A careful inspection of the curves reveals that in many cases, the same model for a given band already fits well (for {\mathcal{ L}}_{\lambda} , {\mathcal{ L}}_{S} , or both) on the next band when a new tuning starts. In some cases, for example on band 8, we can notice a typical behavior, where the propagated model needs to be adjusted for the new band. In fact, coming back to Fig. 7, we can observe that the correlation between adjacent bands remains very high until band 7. Then, a drop is registered for band 8 because of a larger spectral gap, which justifies the model mismatch between bands 7 and 8.

Fig. 8. - Concatenated spectral (top) and spatial (bottom) loss progress during model propagation (close-up on the first 18 bands). Each vertical stripe corresponds to a different band whose final loss is marked with a dot.
Fig. 8.

Concatenated spectral (top) and spatial (bottom) loss progress during model propagation (close-up on the first 18 bands). Each vertical stripe corresponds to a different band whose final loss is marked with a dot.

It is also worth observing how spectral and spatial losses contrast each other when both reach too small values (how much small depends on the band and, in particular, on its correlation with the PAN). In fact, in a few cases, for example, for bands 9, 12, and 18, only one of the two losses decreases. The spatial loss descends at a price of a little increase of the spectral loss for bands 9 and 12 (notice that the spectral loss is one order of magnitude smaller than the spatial loss). For band 18, instead, likely because of a large mismatch with the preceding band, the spectral loss dominates the tuning process.

Finally, as a general remark, we observe that the balance between {\mathcal{ L}}_{\lambda} and {\mathcal{ L}}_{S} is a function of the band and of its (stronger or weaker) relationship with the PAN.

Let us now focus on a set of validation experiments dealing with the effectiveness of the model propagation scheme. Here, all spectral bands are concerned according to the processing chain summarized in Fig. 2. First, we compare the proposed solution based on model propagation, where the b th tuning block inherits as starting parameters the ones (\phi _{b-1} ) tuned on band b-1 , with the case where the model is not propagated and the tuning always starts from the same set of parameters, \phi _{0} . In both cases, the number of tuning iterations is computed band-wise using (1). The experiment is run on the full-resolution image FR0 and the results are shown in Fig. 9 in terms of achieved band-wise loss at the end of the tuning, separately for the spectral (top) and spatial (bottom) components. Due to a nonuniform coverage of the spectral range by the available HS datasets, the wavelength axis has been split into three intervals for better visualization. The plots show a clear gain due to the propagation of the model on both spectral and spatial sides and all along the wavelength axis, with quite large gaps especially in terms of {\mathcal{ L}}_{S} .

Fig. 9. - Band-wise spectral (top) and spatial (bottom) loss components after tuning with (blue) or without (red) model propagation.
Fig. 9.

Band-wise spectral (top) and spatial (bottom) loss components after tuning with (blue) or without (red) model propagation.

To further validate the proposed solution, we have also compared the proposed forward model propagation with the backward option on the validation dataset FR0. The resulting accuracy indicators are reported in Table IV. The numbers show that the forward propagation option provides only slightly better scores. Experiments carried out on other datasets, however, confirm a substantial equivalence between the two solutions, and hence, we eventually opted for the causal ordering (forward) without loss of generality.

TABLE IV Numerical Comparison Between the Forward and Backward Propagation Schemes on the FR0 Dataset
Table IV- 
Numerical Comparison Between the Forward and Backward Propagation Schemes on the FR0 Dataset

C. Tuning Strength

According to the proposed empirical rule to fix the number N_{b} of tuning iterations per band [see (1)], such a number is proportional (up to a maximum saturation value) to the wavelength distance from the previous band, \Delta \lambda _{b} = \lambda _{b} - \lambda _{b-1} . In practice, since in most of the cases, such a step is about 10 nm (the minimum for PRISMA), a proportionality factor \alpha = 1 iteration/nm would amount to run about ten iterations per band except those (fewer) cases where larger spectral gaps are concerned. Of course, the larger \alpha , the better the fitting to the target image, but also, the higher the computational time. Therefore, to fix an appropriate value for \alpha , we have run a sequence of validation tests on the FR0 image using different values of \alpha (0.2, 0.5, 1, 1.5, 2, 5). In addition, we also show the limit case where we perform the maximum number of iterations (80) for all bands, regardless of \Delta _{\lambda} . On the one hand, we provide the band-wise value of the loss in Fig. 10, splitting, as usual, the spectral (top) and spatial (bottom) terms. On the other hand, we give the computation time for each configuration in Table V.

TABLE V Computation Time to Perform Target Adaptation on a 2400 \times2400 PRISMA Image With 73 Bands Using a NVIDIA GeForce RTX 2080 Ti GPU With 11 GB of Memory
Table V- 
Computation Time to Perform Target Adaptation on a 
$2400 \times2400$
 PRISMA Image With 73 Bands Using a NVIDIA GeForce RTX 2080 Ti GPU With 11 GB of Memory
Fig. 10. - Results obtained on the FR0 image. The final value of spectral (top) and spatial (bottom) losses against band wavelength for the different configurations of 
$\alpha $
.
Fig. 10.

Results obtained on the FR0 image. The final value of spectral (top) and spatial (bottom) losses against band wavelength for the different configurations of \alpha .

From Fig. 10, it can be observed that, with respect to the limit case (dashed) where the maximum number of iterations is run for all bands, both the spectral (outside the visible spectral range only) and the spatial loss (in the visible range only) register a progressive deterioration as \alpha decreases. Besides, as reported in Table V, smaller \alpha ’s provide quicker inference. Based on these observations, we decided to fix a tradeoff value \alpha =1.5 for testing the proposed solution, leaving to the end user the possibility to change this default setting upon specific needs and hardware (in our experiments, we exploited an NVIDIA GeForce RTX 2080 Ti GPU with 11 GB of memory).

D. Spatial Loss Configuration

The proposed loss (2) comprises two contributes, the spectral (3) and the spatial (6) consistency terms. While the former leverages on a standard regression error function, such as the \ell _{1} -norm, and relates pretty well with spectral distortion, D_{\lambda} , the latter is more complex, which contributes to the spatial quality that is less obvious. To gain insight into its role, here, we provide additional details on its configuration. With respect to the classical pansharpening problem, the most critical difference is that the PAN image spectrally overlaps only with a subset of HS bands. Therefore, indiscriminately forcing correlation between each pansharpened band and the PAN may be detrimental. To mitigate eventual side effects, on one hand, we halve (from 0.5 to 0.25) the weighting hyperparameter \beta (2) for those HS bands that do not overlap with the PAN; on other hand, we use a spatial–spectral varying correlation bound, \rho ^{\mathrm{ max}} (6). To estimate this bound, we resort to a downgrading process (just an LPF) applied to the PAN. The HS image is upsampled to the PAN scale, and then, the bound local CC is computed. However, although the involved images have the full target size, the lack of high-frequency content would make meaningless the computation of the CC on a too small local window. Hence, a 6 \times larger window is considered for the computation of \rho ^{\mathrm{ max}} , i.e., 6\sigma \times 6\sigma , if \sigma \times \sigma is the window size used for \rho ^{\sigma }_{ \mathbf {P}\widehat { \mathbf {M}}_{b}} .

Let us now focus on \sigma . In principle, forcing correlation between the PAN and any super-resolved band would be in contrast with the goal of preserving the consistency of the latter with its low-resolution version. In practice, if the correlation is computed locally to an R \times R=6 \times 6 window, corresponding to the size of a low-resolution MS pixel mapped into the high-resolution space (PAN), the related conditioning will interest only the high-frequency content, which is missing in the low-resolution MS bands. In addition to this theoretical reasoning supporting a value \sigma =R=6 , we have also run an ad hoc experiment to analyze different settings. In particular, Fig. 11 shows the pansharpening results (zoomed crops were selected for an easier inspection) on a single sample band for \sigma = 6,12, 24, and 48. Below each result, the spectral loss is also reported. As it can be seen, the use of larger values of \sigma deteriorates both the spectral (see numbers) and the spatial (see images) quality: larger \sigma ’s give rise to blurring phenomena. On the other hand, it has also to be observed that the use too small \sigma values could be unsuited for the computation of the statistics involved in the CC. Eventually, in light of all the above considerations, we have fixed \sigma =R=6 . For additional details about the spatial loss, readers can refer to [47].

Fig. 11. - Impact of the scale parameter 
$\sigma $
 (correlation scale) on spectral and spatial quality. From left to right: image crops (PAN) followed by the corresponding pansharpening results for a sample band, using increasing scale values (
$\sigma = 6, 12, 24, $
 and 48). The two crops come from a larger tile on which the pansharpening is run. Below each result, the spectral loss is reported, which is obtained using 1000 tuning iterations.
Fig. 11.

Impact of the scale parameter \sigma (correlation scale) on spectral and spatial quality. From left to right: image crops (PAN) followed by the corresponding pansharpening results for a sample band, using increasing scale values (\sigma = 6, 12, 24, and 48). The two crops come from a larger tile on which the pansharpening is run. Below each result, the spectral loss is reported, which is obtained using 1000 tuning iterations.

E. Network Configuration

To validate the network architecture, we carried out experiments on the validation dataset FR0 aimed to assess the impact of network depth, width, and the use of a residual skip connection. The compared solutions are summarized in Table VI, together with the corresponding scores and execution time. At a first glance, it can be observed that the proposed configuration (top row) achieves the best score on Q^{\ast} balancing spectral and spatial accuracies. Besides, the execution time is nearly the same as the lighter configuration (32, 16, 1). A deeper analysis of these results reveals that deeper architectures tend to slightly favor the spectral consistency (D_{\lambda} ) with respect to the spatial index (D_{S} ). The quantitative analysis also shows that the residual skip connections help to preserve spectral consistency keeping lower the D_{\lambda} score. Overall, the topological configuration (i.e., with or without residual skip connection) has a more relevant impact on the network behavior than its depth and width.

TABLE VI Network Ablation Results on Dataset FR0. The Proposed Solution Is on Top
Table VI- 
Network Ablation Results on Dataset FR0. The Proposed Solution Is on Top

F. Pretraining Details

At test time, the proposed network starts the tuning on the first band using pretrained initial weights. These have been determined by pretraining using the first band of the FR0 dataset. The band and the corresponding PAN have been tiled in 100 patches of (PAN) size 240 \times 240 , splitting in training (90) and validation (10). The optimization has been carried out using ADAM in the default configuration and with a learning rate of 1e−5, run for 200 epochs using minibatches of four patches. The same pretrained weights have been used on both reduced- and full-resolution images at test time.

SECTION V.

Comparative Results and Discussion

The experimental analysis of the proposed solution ends with the presentation of the numerical and visual comparative results obtained on the test datasets (summarized in Table II, shown in Fig. 3) taken from the HS pansharpening challenge [5]. All numerical results, obtained on both reduced- and full-resolution datasets, are gathered in Table VII. On the one hand, the pansharpening results obtained on reduced-resolution datasets are quantitatively compared in terms of Q2^{n} , SAM, ERGAS, and PSNR. On the other hand, those obtained on full-resolution datasets are assessed through the spectral and spatial quality indexes D_{\lambda} and D_{S} , respectively, and their combination Q^{\ast} (details are given in Section III). Besides, an overall accuracy (OA) index is reported in the last column of Table VII. OA is computed as the average, over all four datasets, of Q2^{n} or Q^{\ast} , using the one that applies (RR or FR, respectively). OA is therefore a robust indicator of the consistency of the methods, accounting for their behavior at both resolutions, that has been chosen as the ultimate score in the challenge. Table VII gathers the comparative methods in three sets: challenge baselines (CS and MRA), challenge participants (teams), and recent deep learning solutions. Moreover, in the top row, we report the scores of the simple interpolator EXP, which serves as a reference for the spectral accuracy at full resolution, lacking GTs. Bold and underlined numbers highlight the top and second-best scores, respectively, excluding EXP, which does not perform any sharpening.

TABLE VII Comparative Numerical Results
Table VII- 
Comparative Numerical Results

Moving to the full-resolution datasets, the analysis of the results becomes less linear. On the spectral side (D_{\lambda} ), the proposed method ranks always first (the interpolator EXP is excluded as it does not perform any pansharpening as testified by the high spatial distortion D_{S} ). Instead, on the spatial consistency side, the proposed solution shows some limits. Indeed, several competitors with good (small) values of D_{S} show an exceeding spectral distortion D_{\lambda} that knocks them out. Q^{\ast} balances the two distortion indexes providing a more reliable numerical figure. According to Q^{\ast} , the proposed method ranks first on FR2 and third on FR1, though not far from the best (Team4). This last one, however, does not seem to provide consistent results across the two scales showing much worse results on the RR datasets. Overall, our method achieves the best OA with a consistent gap on GSA that ranks second. As a final remark, it is worth noticing that, at both reduced and full resolutions, the proposed method consistently outperforms the competitors from the spectral point of view. In fact, it is always on top in terms of SAM, which is probably the most sensible index to the spectral signature, and of D_{\lambda} (considering the EXP as a spectral reference at full resolution). This gives us an encouraging experimental response to a basic question about the opportunity to carry out band-wise pansharpening, as we do, ignoring the pixel spectral signature as a whole feature to account for in the fusion process. Our experiments show that, even though a marginal optimization scheme has been used for computational reasons, the pixels’ spectral signatures are very well preserved.

A careful inspection of these numerical results reveals a surprising behavior in reduced-resolution datasets by the proposed method, which outperforms consistently all the compared solutions, on both datasets and with respect to all indicators, with the exception dataset RR1, where on ERGAS and PSNR (recall these two metrics are highly correlated), HSpeNet2 performs slightly better. Actually, what is worth remarking is that while RR indexes are based on available GTs, the proposed solution is fully unsupervised (both in training and tuning) and, nonetheless, it does not seem to show any performance gap for this.

To conclude this experimental survey, we present some sample pansharpening results for both reduced- and full-resolution datasets in Figs. 12–​15 and 16 and 17, respectively. For each dataset, only a representative zoomed detail (crops within yellow boxes in Fig. 3) is shown and, for a more comprehensive analysis, both RGB (Figs. 12, 14, and 16) and false-color (Figs. 13, 15, and 17) bands subsets are displayed. In fact, while RGB bands are all well correlated with the PAN, other bands outside the visible spectrum are less correlated, hence more critical from the fusion perspective and worth inspecting. For the reduced-resolution case, in addition to the reference GT and the expanded version (EXP) of the input HS bands, useful for a direct spectral comparison of the pansharpening results, the error images are also shown. These latter clearly show that the errors are much more severe (for both the datasets) outside the visible spectrum (see false color), with the occurrence for all the methods of both spectral and spatial distortion phenomena. However, among all the compared methods, the proposed seems to mitigate better than others both spectral and spatial distortions. Some methods clearly fail, e.g., Team 1, and they are reported only for the sake of completeness.

Fig. 12. - Pansharpening results on RR1 (zoomed detail, see Fig. 3). (a) Target GT and all compared solutions on three bands sampled in the visible spectrum [wavelengths (nm): 660 (red channel), 588 (green), and 442 (blue)]. (b) Corresponding error maps.
Fig. 12.

Pansharpening results on RR1 (zoomed detail, see Fig. 3). (a) Target GT and all compared solutions on three bands sampled in the visible spectrum [wavelengths (nm): 660 (red channel), 588 (green), and 442 (blue)]. (b) Corresponding error maps.

Fig. 13. - Pansharpening results on RR1 (zoomed detail, see Fig. 3). (a) Target GT and all compared solutions on three bands sampled outside the visible spectrum [wavelengths (nm): 2053 (red channel), 1229 (green), and 770 (blue)]. (b) Corresponding error maps.
Fig. 13.

Pansharpening results on RR1 (zoomed detail, see Fig. 3). (a) Target GT and all compared solutions on three bands sampled outside the visible spectrum [wavelengths (nm): 2053 (red channel), 1229 (green), and 770 (blue)]. (b) Corresponding error maps.

Fig. 14. - Pansharpening results on RR2 (zoomed detail, see Fig. 3). (a) Target GT and all compared solutions on three bands sampled in the visible spectrum [wavelengths (nm): 632 (red channel), 500 (green), and 434 (blue)]. (b) Corresponding error maps.
Fig. 14.

Pansharpening results on RR2 (zoomed detail, see Fig. 3). (a) Target GT and all compared solutions on three bands sampled in the visible spectrum [wavelengths (nm): 632 (red channel), 500 (green), and 434 (blue)]. (b) Corresponding error maps.

Fig. 15. - Pansharpening results on RR2 (zoomed detail, see Fig. 3). (a) Target GT and all compared solutions on three bands sampled outside the visible spectrum [wavelengths (nm): 1726 (red channel), 1251 (green), and 750 (blue)]. (b) Corresponding error maps.
Fig. 15.

Pansharpening results on RR2 (zoomed detail, see Fig. 3). (a) Target GT and all compared solutions on three bands sampled outside the visible spectrum [wavelengths (nm): 1726 (red channel), 1251 (green), and 750 (blue)]. (b) Corresponding error maps.

Fig. 16. - Pansharpening results on (a) FR1 and (b) FR2 on three bands sampled in the visible spectrum; PAN image followed by EXP and all compared methods. Details on crop selection and sampled wavelengths for display are in Fig. 3.
Fig. 16.

Pansharpening results on (a) FR1 and (b) FR2 on three bands sampled in the visible spectrum; PAN image followed by EXP and all compared methods. Details on crop selection and sampled wavelengths for display are in Fig. 3.

Fig. 17. - Pansharpening results on (a) FR1 and (b) FR2 on three bands sampled outside the visible spectrum; PAN image followed by EXP and all compared methods. Details on crop selection and sampled wavelengths for display are in Fig. 3.
Fig. 17.

Pansharpening results on (a) FR1 and (b) FR2 on three bands sampled outside the visible spectrum; PAN image followed by EXP and all compared methods. Details on crop selection and sampled wavelengths for display are in Fig. 3.

Moving to full-resolution results, the evaluation becomes even more difficult and subjective, lacking reference GTs. Fig. 16 shows the pansharpening results obtained on both FR1 and FR2, limited to some selected bands of the visible spectrum, roughly corresponding to the red, green, and blue channels. Fig. 17 gathers, instead, the same results limited to other bands outside the visible range in false colors. In both cases, together with the PAN, that is, the spatial reference, it also shows the upscaled HS (EXP) that can serve as a spectral reference for quality assessment. Likewise the reduced-resolution case, we displayed zoomed details of the full pansharpening obtained on the images shown in Fig. 3. Focusing on the FR1 detail, we can observe relatively good results provided by the proposed, GSA, MTF-GLP, MF, Teams 3 and 4, HSpeNet1, and HSpeNet2, especially in the RGB space. False-color results [Fig. 17 (top)], instead, highlight some problems occurring for spectral bands outside the visible range. For example, GSA seems to be unable to sharpen the interested bands. The same is for Teams 3 and 4 and for the HSpeNet variants. The differences among the methods are more evident for the clip of FR2 due to the presence of water. In this case, the spectral distortions are quite severe in several cases, e.g., GS, HyperPNN1, HyperPNN2, and HSpeNet1, but, more interestingly, there can be noticed some PAN patterns (on the water basin), not present in false-color bands (see EXP on the bottom line), which are added to the pansharpened images. Of course, aware of the subjectiveness of these last considerations, we leave the final say to readers who can add their own perspective to our observations and numbers. In this regard, for the sake of fairness, we have to remind that the teams of the challenge had a limited time to validate their design choices, which somehow explains some unsatisfactory results.

SECTION VI.

Conclusion

In this work, we have presented a novel deep-learning-based method for HS pansharpening. The proposed approach requires a baseline CNN model for single-band pansharpening trainable/tunable in an unsupervised manner, without resolution downgrade. To this aim, we resorted to a recently proposed four-/eight-band pansharpening model [47], suitably adapted to the single-band case. The baseline model is sequentially used for a band-wise pansharpening where the current application leverages on the model parameters adjusted on the previous band, running a few tuning iterations to let them fit the current (target) band. The tuning is feasible due to the use of an unsupervised loss and the number of iterations is related to the “spectral” distance between the target band and its preceding one from which the model is inherited.

The advantages of the proposed method are: 1) the method is fully unsupervised and does not require training data other than the same target image; 2) it can be applied to any PAN–HS dataset, with no need to have a prefixed number of HS bands; 3) the learning process, which is interleaved with the band-wise inference steps, does not require resolution downgrade, a common but limiting option in pansharpening; 4) the method ensures good generalization properties due to the target-adaptive tuning; and 5) considering that tuning iterations are involved, the computational complexity is relatively limited especially if the baseline CNN is a lightweight network as actually is.

Despite the very good results achieved by the proposed approach, there is still room left for improvement. More specifically, special attention should be put on the spatial consistency loss term used for the spectral bands that have no overlap with the PAN bandwidth. Actually, this falls in the more general problem of spatial quality assessment of pansharpened images, which is well known to be far to be solved [2], [67], [66], becoming even more challenging in the HS case. Another point that is worth investigating is the model propagation rule and the iteration amount per band having an impact on the computational load.

In order to ensure full reproducibility of our research outcomes, the code is made available at https://github.com/giu-guarino/R-PNN.

NOTE

Open Access provided by 'Università degli Studi di Napoli "Parthenope"' within the CRUI CARE Agreement

References

References is not available for this document.