Journals & Magazines >IEEE Journal of Selected Topi... >Volume: 18

Fusion-Decomposition Pan-Sharpening Network With Interactive Learning of Representation Graph

Abstract:

Deep learning (DL)-based pan-sharpening methods have become mainstream due to their exceptional performance. However, the lack of ground truth for supervised learning for...Show More

Metadata

Abstract:

Deep learning (DL)-based pan-sharpening methods have become mainstream due to their exceptional performance. However, the lack of ground truth for supervised learning forces most DL-based methods to use pseudoground truth multispectral images, limiting learning potential and constraining the model's solution space. Unsupervised methods often neglect mutual learning across modalities, leading to insufficient spatial details in pan-sharpened images. To address these issues, this study proposes a fusion-decomposition pan-sharpening model based on interactive learning of representation graphs. This model considers both the compression process from source images to fused results and the decomposition process back to the source images. It aims to leveraging feature consistency between these processes to enhance the spatial and spectral consistency learned by the fusion network in a data-driven manner. Specifically, the fusion network incorporates the meticulously designed representational graph interaction module and the graph interaction fusion module. These modules construct a representational graph structure for cross-modal feature communication, generating a global representation that guides the cross-modal semantic aggregation of multispectral and panchromatic data. In the decomposition network, the spatial structure perception module and the spectral feature extraction module, designed based on the attributes of the source image features, enable the network to better perceive and reconstruct multispectral and panchromatic data from the fused result. This, in turn, enhances the fusion network's perception of spectral information and spatial structure. Qualitative and quantitative results on the IKONOS, GaoFen-2, WorldView-2, and WorldView-3 datasets validate the effectiveness of the proposed method in comparison to other state-of-the-art methods.

Published in: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing ( Volume: 18)

Page(s): 3812 - 3826

Date of Publication: 31 December 2024

ISSN Information:

DOI: 10.1109/JSTARS.2024.3524386

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Although multispectral images possess rich spectral information, their low resolution, due to the limitations of imaging sensors, results in the loss of spatial details, thereby limiting their application scope and accuracy. To overcome these limitations of satellite sensors, pan-sharpening technology has been proposed. This technique generates full-resolution multispectral (HRMS) images by fusing low-resolution multispectral (LRMS) images with high spatial resolution panchromatic (PAN) images. The resulting HRMS images, which integrate both rich spectral and spatial information, are highly advantageous for various remote sensing tasks such as land cover classification, change detection, and target tracking, and have become the focus of research [1], [2], [3], [4], [5].

Traditional pan-sharpening methods can be categorized into three main approaches: Component Substitution (CS) [6], [7], multi-resolution analysis (MRA) [8], [9], and variational optimization (VO) [10], [11]. The CS methods work by transforming the LRMS and full-resolution PAN images into a new domain, using the spatial information from the PAN image to enhance the texture of the LRMS image. MRA methods employ multiscale processing techniques to fuse images at different resolutions, thereby preserving both spectral and spatial information. VO methods treat pan-sharpening as an inverse problem, introducing prior knowledge to optimize the solution. Although these methods have achieved decent results, their effectiveness is heavily reliant on handcrafted rules. This reliance often leads to spectral and spatial distortions because of improper transformations and solution conditions, thus limiting their practical applicability.

In recent years, deep learning (DL)-based pan-sharpening methods have become a research hotspot due to their powerful feature representation capabilities [12], [13], [14], [15], [16]. These methods typically involve a two-step design: first, extracting and fusing hierarchical features from source images using carefully designed networks to output the fused image; second, encouraging the model to learn precise solutions by minimizing the distance between the fusion result and the target. However, despite the advances achieved, several key challenges remain that limit the full potential of these methods. One of the significant obstacles in applying DL to pan-sharpening is the lack of high-quality ground truth (GT) for supervised learning. GT data are crucial for effectively training models, yet in remote sensing, obtaining true high-resolution fused images is impractical due to limitations in data acquisition. To address this, researchers have constructed training datasets with pseudo-GT by decomposing LRMS images, allowing models to approximate the training process. Although this approach enables the model to better learn pan-sharpening-related features, models trained with pseudo-GT are inherently constrained by the downscaled resolution of the solution space, which imposes an upper limit on their learning capacity [12], [13], [16]. Consequently, these models may struggle to generalize to high-resolution scenarios, leading to suboptimal performance when applied to real-world satellite images. To overcome these limitations, recent studies have focused on leveraging the relationships between HRMS images and source images, adopting unsupervised training methods to eliminate dependence on pseudo-GT. This shift toward unsupervised approaches aims to remove the bottleneck of limited-resolution GT while enabling models to learn directly from available high-resolution data. Thanks to adversarial training between the generator and the discriminator, the methods based on the generative adversarial network (GAN) [17], [18], [19], [20], [21] have unique advantages in this regard. However, the adversarial training mechanism is unstable, often leading to oscillations during the training process that can adversely impact pan-sharpening performance (see Fig. 1). This instability presents a significant drawback, as achieving reliable, high-quality results requires careful and cumbersome tuning of the adversarial components. As a result, many researchers have turned to convolutional neural network (CNN)-based methods, which aim to overcome some of the limitations associated with GANs by focusing on modeling the degradation process or employing frequency separation techniques. These methods to enhance the model's perception of the spatial details in PAN images and the spectral information in LRMS images [12], [16], [22], [23]. While these CNN-based methods have demonstrated promising results, they still face several challenges.

Fig. 1.

Demonstration of the fusion results from different methods. The comparative methods are the SOTA CSFNet [15], LDP [12], and PANGAN [17]. LDP and PANGAN suffer from spectral distortion while CSFNet is accompanied by loss of spatial details.

Show All

Challenge: The primary challenge with existing CNN-based methods is their inability to effectively integrate spatial and spectral features from PAN and MS images. Conventional CNN architectures typically treat these features independently, which limits their capacity for effective cross-modal feature communication. As a result, the fusion process often fails to fully capture the intricate relationships between spatial and spectral domains, leading to pan-sharpened images with either suboptimal spatial detail or compromised spectral fidelity. In addition, CNN-based methods use complex loss functions that contain multiple components to improve the fusion quality. Although these loss terms are designed to enhance the effect of fused images, they make parameter adjustment difficult, resulting in unstable training. This complexity makes it difficult to strike a balance between maintaining spatial details and spectral fidelity, thus limiting its practical application.

Solutions and Contributions: This study proposes a fusion-decomposition pan-sharpening model based on interaction learning of representation graphs, addressing the challenges of pan-sharpening in the lack of GT through the following strategies. 1) Recognizing the ability of the graph neural network (GNN) to model complex structured data and capture long-range dependencies, we leverage this strength to enhance cross-modal feature interactions, which are crucial for effectively integrating spatial details and spectral information in unsupervised pan-sharpening tasks. To this end, we introduce a GNN-based pan-sharpening network that utilizes the GNN to efficiently capture and transmit information within graph structures, guiding the model to generate high-quality fusion results. Specifically, we design a representational graph interaction module (RGIM) and a graph interaction fusion module (GIFM). RGIM constructs a representation graph structure using multilayer inputs from the encoder, promoting cross-modal semantic learning through interactions between graph nodes. Owing to the powerful capabilities of GNNs, this module effectively integrates multilayer, multimodal feature information. GIFM then uses the cross-modal semantic representations learned by RGIM to guide the feature reconstruction process, achieving semantic aggregation of multispectral and PAN data, thereby retaining more critical feature details in the fused pan-sharpened images. 2) To ensure the fused image maintains consistency with the spatial and spectral characteristics of the source images, we model not only the forward pan-sharpening process but also introduce a decomposition network during training, which decomposes the fused output back into its original components. This bidirectional process provides additional supervision without increasing model complexity during inference, significantly enhancing training stability and consistency. The designed effective spatial structure perception module (SSPM) and spectral feature extraction module (SFEM) ensure the effective implementation of the decomposition process with SSPM is designed to capture and preserve spatial structural features in the fused image, while SFEM focuses on extracting and preserving spectral features, ensuring precise reconstruction and consistency with the source images.

We further evaluate the proposed method on WorldView-2, WorldView-3, IKONOS, and Gaofen-2 datasets. Benefiting from cross-modal graph interactions and the decomposition consistency constraint, our method achieves superior performance compared to state-of-the-art (SOTA) methods. Our main contributions are summarized as follows.

We propose a novel unsupervised pan-sharpening framework that not only considers the compression process from the source images to the fused results but also strives to decompose the fused image back into approximations of the original source images. This method leverages the consistency of decomposition in a data-driven manner to guide the optimization of the pan-sharpening network, thereby enhancing the preservation of spatial details.
We design a GNN-based fusion network, incorporating efficient representation graph interaction module (RGIM) and graph interaction fusion module (GIFM), which facilitate effective cross-modal semantic learning and feature integration.
We introduce a decomposition network into the fusion process, featuring specifically designed SSPM and SFEMs, which drive the pan-sharpening network to retain more spatial and spectral details without increasing the model's complexity.

SECTION II.

Related Work

This section introduces existing DL-based pan-sharpening works, as well as GNN-based research works, which are most relevant to the proposed method.

A. Supervised Methods

The impressive representational capabilities of CNN have made them a focal point in the field of remote sensing, leading to successful applications in this domain [12], [13], [16], [23], [24]. Consequently, constructing supervised networks for hyperspectral pan-sharpening has become a research hotspot. Masi et al. [25] were the pioneers in introducing CNN into pan-sharpening, achieving results that outperformed traditional methods. This remarkable performance has driven researchers to explore DL-based methods to tackle the challenges of pan-sharpening. Yang et al. [26] proposed PanNet, which enhances fusion results by extracting high-pass domain features to provide rich spatial information. This method also employs residual connections to inject detailed information, thereby preserving complete spectral information. Wang et al. [27] employed a stepwise refinement approach, performing two consecutive 2× up-sampling operations on pan-sharpened frames, which allows the model to better capture spectral information and spatial structures. Zhang et al. [13] introduced a progressive pan-sharpening architecture based on deep spectral transformation, balancing performance across different resolutions. Similarly, Li et al. [28] proposed a multilevel progressive enhancement pan-sharpening network to alleviate the spatial detail distortion and spectral information loss caused by upsampling. Liu et al. [24] proposed a strict tensor-based pan-sharpening model, guided by a dual nonconvex tensor low-rank before fuse source images effectively. Tan et al. [29] proposed a hierarchical frequency integration network, which uses local Fourier information to achieve hierarchical space-frequency information integration. Wang et al. [30] introduced haze-line prior to the joint haze correction of LRMS and PAN images, and designed a low-rank tensor completion pan-sharpening method based on haze correction. Although these methods have achieved decent results, the supervised learning that relies on pseudo-GT formation often limits their solution domain, significantly reducing their performance on full-resolution data.

B. Unsupervised Pan-Sharpening

Unsupervised pan-sharpening methods, by operating independently of GT, effectively mitigate the limitations inherent in supervised methods. GANs, as generative models that learn to generate data through adversarial training with a discriminator, represent a prominent unsupervised method. Ma et al. [17] first proposed using GAN for remote sensing image pan-sharpening, achieving the retention of spectral information and spatial structure by establishing adversarial relationships between the fusion results and source images. Shang et al. [31] combined transformer networks with multiscale interactions to propose MFTGAN, which learns long-range spectral and spatial correlations. Zhou et al. [32] designed UCGAN, which trains on full-size images for pan-sharpening, ensuring that the fusion results contain rich spatial and spectral information. Similarly, He et al. [33] introduced a dual-cycle consistency GAN that enables deep integration of spatial and spectral features through cross-domain information interaction. Despite these successes, the instability of adversarial training often makes it challenging to strike a balance between preserving spectral and spatial information. In contrast to GAN-based methods, CNN-based approaches typically rely on modeling the relationship between spectral and spatial details to guide the network in generating HRMS images. For instance, Xiong et al. [34] used low-pass filters to ensure spectral and spatial consistency between the fusion results and source images. Li et al. [15] enhanced spatial structure retention by maximizing local texture similarity between the fusion results and PAN images. Ciotola et al. [35] assumed known degradation operators for LRMS images to ensure spectral fidelity, while Ni et al. [12] optimized the pan-sharpening network by learning the degradation process. Similarly, Lin et al. [16] treated pan-sharpening as a blind image deblurring task, learning the blur kernel to ensure spectral consistency between the fusion results and LRMS images. However, modeling the relationship between spectral and spatial details is often unknown and challenging, often requiring the design of complex constraints. In addition, many models do not adequately consider the internal design of message passing, which can prevent the fusion results from reflecting some critical details of the source images.

In contrast, our method leverages the interaction of cross-modal graph features to learn semantic representations, which more effectively guide the reconstruction of fusion results. Furthermore, leveraging the decomposition consistency, our method constructs a decomposition network to reconstruct the source images from the fusion results. This data-driven approach implicitly ensures the preservation of spectral information and spatial structure, while also reducing the complexity of the model design.

C. Graph Neural Network

The ability of GNN to effectively capture and transmit information within graph structures has led to their increasing popularity in the field of computer vision. Particularly in feature tasks, by treating images as structured graphs, GNN can better understand and represent the complex relationships and dependencies within images, excelling in tasks such as object detection [36], [37], image segmentation [38], [39], and multimodal image fusion [40]. For instance, Xie et al. [38] introduced a GNN-based data scale-aware network for few-shot semantic segmentation. Similarly, Knyazev et al. [41] utilized image superpixels as graph nodes to form multiple graphs for image classification. Zhao et al. [36] explored the intrinsic image structure by treating superpixels as graphs to investigate the internal structure of input images. Lu et al. [39] proposed a network that addresses semantic segmentation as a graph node classification task. Li et al. [40] constructed graphs using multilevel features of source images to guide multimodal image fusion. Qiao et al. [37] incorporated geometric and semantic information to segment objects and constructed initial graphs to predict the saliency ranking of salient objects in a context-aware manner. Pan et al. [42] developed a cascade hierarchical graph convolutional network to comprehensively learn the feature correlations between manipulated and nonmanipulated regions at different scales.

In summary, unlike CNN-based methods that are constrained by local interactions on fixed grids, GNNs represent spatial and spectral features as graph nodes, enabling effective message passing that captures both global dependencies and local contextual relationships. This makes GNNs particularly advantageous for cross-modal feature fusion in pan-sharpening, where integrating the high spatial resolution of PAN images with the spectral richness of MS images requires a nuanced understanding of nonuniform and heterogeneous relationships. Leveraging this concept, the RGIM and GIFM are designed to enhance cross-modal feature fusion through graph-based interactions, effectively capturing complex global and local contextual relationships. As a result, our method provides a deeper and more coherent fusion outcome compared to CNN-based methods, offering a superior exploration of the pan-sharpening task.

SECTION III.

Method

This method aims to address two major issues in current unsupervised pan-sharpening methods: the feature loss caused by insufficient design for cross-modal communication within the network and the challenges posed by complex loss functions to model parameters. To this end, we propose a squeeze-decomposition pan-sharpening model based on interactive learning of representation graphs, as illustrated in Fig. 2, which consists primarily of a fusion network and a decomposition network. The fusion network, functioning as an end-to-end model, serves as the target network for image fusion, aiming to integrate source images into a single image. Once training is complete, the fusion network can generate the fused image directly from the input without requiring the decomposition network's involvement. Conversely, the decomposition network operates as an auxiliary network, decomposing the fused result to produce images consistent with the source images. This data-driven approach ensures that the fusion network can better capture the spatial and spectral details of the source images, ultimately producing high-quality fused images.

$Fig. 2. - Overall framework of the proposed method. PAN and LRMS images are input into the fusion network to generate the fused image $F$, which is then reconstructed through the decomposition network. Under the constraint of $\text{L}_{\text{dc}}$, the fusion network can effectively perceive the spectral and spatial information of the source images during the fusion-decomposition process.$

Fig. 2.

Overall framework of the proposed method. PAN and LRMS images are input into the fusion network to generate the fused image $F$, which is then reconstructed through the decomposition network. Under the constraint of $\text{L}_{\text{dc}}$, the fusion network can effectively perceive the spectral and spatial information of the source images during the fusion-decomposition process.

Show All

A. Fusion Network

Fig. 3 illustrates the framework of the fusion network. Given the PAN image $\operatorname{PAN}\in \mathbb {R}^{H \times W \times 1}$ and the LRMS image $\operatorname{LRMS} \in \mathbb {R}^{H \times W \times B }$, the encoder in the fusion network extracts feature map $\mathcal {F}_{\text{pan}}^{1:\mathfrak {m}} /\mathcal {F}_{\text{lrms}}^{1:\mathfrak {m}}=\lbrace \mathcal {F}_{\text{pan}}^{j} /\mathcal {F}_{\text{lrms}}^{j}\rbrace _{j=1}^{\mathfrak {m}}$ of the source image, in which $\mathfrak {m}$ denotes the $j$th encoding module. $H \times W$ and $B$ represent the spatial dimension and spectral bands. These feature maps serve two main purposes: 1) the high-level feature maps $\mathcal {F}_{\text{pan}}^{3}$ and $\mathcal {F}_{\text{lrms}}^{3}$ are concatenated and input into the decoder to reconstruct the fused result; 2) $\mathcal {F}_{\text{pan}}^{2 \& 3}$ and $\mathcal {F}_{\text{lrms}}^{2 \& 3}$ are flattened into $\mathcal {A}_{p}^{1 \& 2}$ and $\mathcal {A}_{m}^{1 \& 2}$, which are then input into RGIM to construct a representational graph structure and generate a global representation $\mathcal {R}$. In designing the encoders, we selected WaveUnet [43] for the PAN image and residual blocks (ResB) for the LRMS image. The ResB consists of two convolutional layers with kernel sizes of 3, strides of 1, and padding of 1 (3×3 Conv; unless otherwise specified, all subsequent uses use this parameter), combined with ReLU activation functions, and includes a skip connection with a 3×3 Conv. This heterogeneous encoder design accounts for the differences between source images, offering the following advantages: 1) it ensures the spatial matching of high-level semantic information $\mathcal {F}_{\text{pan}}^{3}$ and $\mathcal {F}_{\text{lrms}}^{3}$ from the source images; 2) WaveUnet effectively retains spatial structural features by separating high-frequency features (LH, HL, HH) during the encoding stage and reusing them during the decoding stage. To enhance feature extraction from the source images, we initially employ a 3×3 Conv to boost the channel dimension of features. For the decoder, we adopted the WaveUnet design, with the modification of adding the GIFM after each wavelet upsampling block. The GIFM module, using the global representation $\mathcal {R}$ as a reference, learns cross-modal feature representations, thereby guiding the integration of high-frequency and low-frequency features. This ensures that the fused results effectively preserve both spectral and spatial information.

$Fig. 3. - Illustration of the fusion network. Specifically, we input the multiscale representation features extracted from the source images ($\mathcal {A}_{M}^{1}$, $\mathcal {A}_{M}^{2}$, $\mathcal {A}_{P}^{1}$, $\mathcal {A}_{P}^{2}$) into the RGIM to construct a cross-modal representational graph structure. Under the guidance of the lead node $\mathcal {A}_{M}^{2}$, cross-modal feature interactions are effectively facilitated. Subsequently, the global representation $\mathcal {R}$, generated from the feature interactions, is used in the GIFM to guide the efficient aggregation of cross-modal features, ensuring the effective retention of both spatial and spectral information.$

Fig. 3.

Illustration of the fusion network. Specifically, we input the multiscale representation features extracted from the source images ($\mathcal {A}_{M}^{1}$, $\mathcal {A}_{M}^{2}$, $\mathcal {A}_{P}^{1}$, $\mathcal {A}_{P}^{2}$) into the RGIM to construct a cross-modal representational graph structure. Under the guidance of the lead node $\mathcal {A}_{M}^{2}$, cross-modal feature interactions are effectively facilitated. Subsequently, the global representation $\mathcal {R}$, generated from the feature interactions, is used in the GIFM to guide the efficient aggregation of cross-modal features, ensuring the effective retention of both spatial and spectral information.

Show All

RGIM and GIFM: To enhance the model's sensitivity to spatial structure and spectral information, we designed a representational graph structure for cross-modal feature interaction. Compared to the traditional pixel-level feature graph, this representational graph more efficiently aggregates and transmits semantic information, thereby enabling the model to better capture the complex relationships and dependencies within the source images. The representation graph interaction module (RGIM), as illustrated in Fig. 4, aims to enhance the model's ability to capture spatial and spectral information by constructing a graph structure where multilevel representation features from heterogeneous encoders serve as nodes. The construction of this graph structure involves both node formation and edge definition, which are detailed as follows.

Fig. 4.

Architecture of RGIM and GIFM. (a) The structure of RGIM. (b) The structure of GIFM.

Show All

1) Node construction:

To build the graph structure, multilevel representational features are first extracted from the heterogeneous encoders and serve as the initial nodes. Before constructing the graph, these initial feature representations are refined using the dynamic node refinement module. This refinement process is designed to enhance the discriminative power of the features, ensuring that the subsequent graph construction is based on optimized and information-rich representations. Furthermore, to ensure that the nodes drive the graph structure towards an optimal balance between preserving spatial details and spectral features, we designate $\mathcal {A}_{m}^{2}$ as the guiding node. This guiding node, enriched with spectral information, directs other nodes to focus on critical regions requiring feature refinement, highlighting complementary content while reducing redundancy. This ensures that the refined representations effectively capture complex spatial–spectral dependencies during graph interaction, leading to a more comprehensive fusion of multiscale information. The entire refinement process can be represented as follows. \begin{equation*} \hat{\mathcal {A}_{k}^{i}}= \mathit{\text{Sig}}(\mathit{\text{Conv}}(\mathcal {A}_{M}^{2}))\cdot \mathcal {A}_{k}^{i} + \mathcal {A}_{k}^{i} \tag{1} \end{equation*} View Sourcewhere $k \in \lbrace p, m\rbrace$, $i \in \lbrace 1, 2\rbrace$. $\mathit{\text{Sig}}(\cdot)$ and $\mathit{\text{Conv}}(\cdot)$ represent Sigmoid and convolution operations, respectively.

2) Edge Definition:

In RGIM, we construct implicit graph edges through dynamic interactions between nodes, learning richer semantic relationships by chaining nodes of different modalities. The edge construction mechanism is described as follows.

Node Feature Concatenation and Message Construction
During each update iteration, the feature vector of the current node is concatenated with feature vectors from other nodes, forming an implicit “message” that represents the interaction pathways between nodes. Let $\mathcal {G} = [ \hat{\mathcal {A}_{p}^{1}}, \hat{\mathcal {A}_{p}^{2}}, \hat{\mathcal {A}_{m}^{1}}, \hat{\mathcal {A}_{m}^{2}}]$ denote the set of node feature vectors. This concatenation effectively simulates the graph edges, enabling efficient feature propagation and establishing dependencies among nodes.
Dynamic Edge Updates via Convolutional Gated Recurrent Units (ConvGRU)
Each node's features are updated iteratively with a ConvGRU, which dynamically constructs edges through recursive updates based on neighboring nodes' aggregated messages. The recursive update for node features can be expressed as \begin{equation*} \mathcal {G}^{(z+1)} = \mathcal {\psi } \cdot \mathcal {G}^{(z)} \tag{2} \end{equation*} View Sourcewhere $\mathcal {G}^{(z)}$ denotes the feature matrix at iteration $z$, and $\psi$ represents the implicit adjacency relation established by ConvGRU. This recursive updating mechanism dynamically adjusts the edges, allowing the model to capture complex feature dependencies adaptively, effectively fusion of spatial and spectral information.
Learnable Edge Weight Control
The strength of edges is modulated by a learnable parameter $\gamma$, which adjusts the influence of neighboring nodes during each update. The weighted combination of features for node $i$ can be expressed as: \begin{equation*} \text{pred}_{i} = h_{t} \cdot \gamma + \text{base}_{i} \tag{3} \end{equation*} View Sourcewhere $h_{t}$ represents the updated feature for the current node, and $\text{base}_{i}$ is the original feature before interaction. This learnable weight mechanism enables selective integration of crucial information, enhancing feature dependency representation by dynamically regulating information flow across feature levels.

Once the nodes in the GNN have completed the information delivery process, they are aggregated through a summation operation to produce the global representation $\mathcal {R}$. This global representation $\mathcal {R}$ is then utilized in the graph interaction fusion module (GIFM) to direct the integration of high-frequency and low-frequency features. As illustrated in Fig. 4(b), the GIFM framework utilizes $\mathcal {R}$ for two primary purposes: 1) to learn deep convolutional kernels that guide image reconstruction; 2) to generate modulation coefficients based on cross-modal representations to perform channel feature adaptation. The former adjusts the distribution of the data flow to balance spectral information and spatial details, while the latter refines the data flow to ensure the retention of salient information. These processes, combined with data flow reuse to prevent gradient explosion, output integrated feature information for subsequent feature reconstruction.

B. Decomposition Network

The overall architecture of the decomposition network, as illustrated in Fig. 5, is designed to reconstruct the PAN image $\widetilde{\operatorname{PAN}}$ and the upsampled version of the LRMS image $\operatorname{LRMS}_{\uparrow 4}$ from the fused image $F$. The core of this design lies in treating the pan-sharpening process as a reversible transformation. Despite the inevitable information loss during fusion, ensuring consistency between the decomposition results and the source images helps minimize this loss. To achieve this, we introduce two core modules: the SSPM and the spectral feature extraction module (SFEM), each tailored to the unique characteristics of PAN and LRMS images.

Fig. 5.

Illustration of the decomposition network.

Show All

In the PAN reconstruction branch, the feature flow passes through the SSPM, which extracts spatial structure features from $F$. These features are then concatenated with the original fused image $F$ to reconstruct the $\widetilde{\operatorname{PAN}}$ image. The SSPM utilizes a Sobel operator combined with a 3×3 convolution to capture the spatial structures of the data, and these features are added back into the data flow in a deep residual manner to improve the extraction of spatial information, thus enhancing the spatial detail retention during the reconstruction. In the multispectral reconstruction branch, the data flow undergoes a similar process, passing through the SFEM to extract spectral information from $F$. These features are concatenated with $F$ to reconstruct the $\operatorname{LRMS}_{\uparrow 4}$. The SFEM uses learnable parameters $\Theta$ to construct deep-wise convolutions with ReLU activations, enabling the model to capture more complex and abstract patterns in both the depth and breadth of spectral features. This allows the network to more effectively extract spectral information and ensure the multispectral image reconstruction is both accurate and representative.

By incorporating learnable parameters $\Theta$ in the SFEM and utilizing deep residual structures in both the SSPM and SFEM, the network is able to adapt to varying satellite characteristics while mitigating potential issues such as gradient explosion. This modular architecture enables each module to focus on specific aspects of the image reconstruction, effectively reducing the complexity of individual components and minimizing the risk of overfitting. Consequently, the network demonstrates strong adaptability to different types of multispectral and PAN data.

C. Loss Function

In our method, we not only consider the forward fusion process but also the reverse process of decomposing the fusion results back into the source images. Therefore, we can drive the fusion network to generate high-quality fusion results by constraining the decomposition effectiveness of the decomposition network, which greatly simplifies the complexity of the loss function design. Consequently, the fusion network and the decomposition network can complete parameter learning under the constraint of the decomposition consistency loss $L_{\text{dc}}$, which can be represented as follows: \begin{equation*} L_{\text{dc}}=\Vert \widetilde{\text{PAN}}-\text{PAN}\Vert _{1}+\left\Vert \text{LRMS}_{\uparrow 4}-(\text{LRMS})_{\uparrow 4}\right\Vert _{1} \tag{4} \end{equation*} View Sourcewhere $\Vert \cdot \Vert _{1}$ denote the $\ell _{1}$-norm, $(\text{LRMS})_{\uparrow 4}$ represents the LRMS after bilateral upsampling by 4×.

In practice, there is an inherent challenge in that the optimization process of network models, often treated as a closed system, relies solely on the objective function as a constraint, rendering their internal optimization uncontrollable. Therefore, in our two-stage parameter optimization approach, relying solely on $L_{\text{dc}}$ may lead to ambiguous fusion results. To address this issue, we pretrained the decomposition network based on [13] to establish the initial mapping relationship. Subsequently, fine-tuning is employed to enhance the robustness of the fusion network. Specifically, we refer to [12] to train the PAN reconstruction branch using LRMS to learn the mapping from LRMS to PAN, while training the multispectral reconstruction branch using PAN to learn the mapping from PAN to $\operatorname{LRMS}_{\uparrow 4}$. It is worth noting that we use concatenation to ensure the consistency of channel numbers between PAN and $F$. Given that this process closely resembles autoencoder-based multimodal image fusion, we draw on the approach in [43] and [44], utilizing the reconstruction loss $L_{\text{re}}$ to pretrain the decomposition network. This pretraining step helps in stabilizing the subsequent parameter optimization and ensures more reliable fusion outcomes. The reconstruction loss can be formulated as follows: \begin{equation*} L_{\text{re}}=\Vert O-I\Vert _{1} + \lambda (1-\text{SSIM}(O,I)) \tag{5} \end{equation*} View Sourcewhere $O$ and $I$ denote indicate the output and input images, respectively. $\text{SSIM}(\cdot)$ denotes the structural similarity [45] between the output and input images.

SECTION IV.

Experiments and Results

A. Experimental Settings

Benchmark datasets: We use four satellite datasets: 4-band IKONOS and GaoFen-2 and 8-band WorldView-2 and WorldView-3 to evaluate the effectiveness of our method. The experiments are conducted on both reduced-resolution and full-resolution data. The reduced-resolution data is synthesized according to Wald's protocol, where the LRMS and PAN images are downsampled by a factor of four, with the original LRMS images serving as GT. The basic information of these four datasets is summarized in Table I, where 10% of the data in the training dataset is used as the validation dataset.

TABLE I Details of the Benchmark Dataset

Implementation details: The experiments were conducted on a computer running Windows 11, with the following hardware configuration: Intel Core i9-12900 CPU and NVIDIA GeForce RTX4090 GPU. We implemented the proposed method using PyTorch and employed two Adam optimizers to separately train the fusion network and the decomposition network. The optimizer for the fusion network starts with an initial learning rate of $\text{10}^{-3}$, which is reduced by 10% every 100 epochs, with a batch size set to 4. For the decomposition network, the initial learning rate was set differently at $\text{10}^{-4}$, while all other parameters remained the same. The model converged after 700, 1000, 1000, and 300 epochs on the IKONOS, GaoFen-2, WorldView-2, and WorldView-3 datasets, respectively. Prior to the main training phase, the decomposition network underwent a pretraining phase for 200 epochs, with the hyperparameter $\lambda$ in $L_{\text{re}}$ empirically set to 5, as recommended by [44].

Baseline: We compared the proposed method with eight SOTA pan-sharpening methods, including three traditional methods: Brovey [46], IHS [13], and GS [47]; four unsupervised methods: CSFNet [15], LDP [12], PANGAN [17], and ZeroSharpen [48]; and a semisupervised method, ZS-PAN [49]. All of the aforementioned methods were provided by the corresponding authors. For fairness, we retrained these methods on our datasets.

To evaluate our method, we utilized three popular quality assessment metrics on full-resolution data: the spectral distortion index ($D_{\lambda }$), the spatial distortion index ($D_{S}$), and the no-reference quality index (QNR) [50]. Specifically, $D_{\lambda }$ measures spectral distortion by using interpolated LRMS images as the spectral reference. $D_{S}$ evaluates spatial distortion by selecting a low-pass version of the PAN image as the spatial reference. QNR assesses the quality of HRMS by measuring the similarity between the bands and between each band and the PAN image, with higher QNR values indicating better quality and lower $D_{\lambda }$ and $D_{S}$ values indicating less distortion. For the reduced-resolution data, we employed four metrics to quantitatively assess the image quality: Erreur Relative Globale Adimensionnelle de Synthese (ERGAS), the spectral angle mapper (SAM), the spatial correlation coefficient (SCC), and the image quality index (Q). These metrics evaluate the quality of the fusion by assessing the relationships between the reference image and the test image. Specifically, higher values of Q and SCC indicate better fusion quality, while lower values of SAM and ERGAS denote less distortion and better spectral fidelity. Detailed settings and deployment information for these metrics can be found in [51].

B. Results on 4-Band Dataset

We conducted qualitative and quantitative comparisons of the proposed method on IKONOS and GaoFen-2 datasets to assess its performance on 4-band data.

Qualitative experiments: Figs. 6 and 7 display the qualitative results of various methods using full-resolution and reduced-resolution data, respectively. Intuitively, compared to eight other SOTA methods, our method better balances spectral information and spatial features, thus providing superior visual perception. Traditional methods guided by prior knowledge have achieved some success on IKONOS data. However, the fixed nature of prior information limits their adaptability when dealing with different satellite datasets, resulting in ineffective constraints for spatial and spectral information fusion. Consequently, these methods exhibit spectral distortions on the GaoFen-2 dataset, particularly evident in the reduced-resolution data. The error map compared to the GT demonstrates these issues. Unsupervised DL-based methods have achieved decent performance by exploring content correlations between source images. Although PANGAN captures rich spatial details, it is frequently affected by spatial artifacts and spectral distortions due to the instability of GAN training. On the other hand, CSFNet leverages spectral correlation mapping between LRMS and PAN image features to guide the generation of HRMS images, effectively preserving spectral information but lacking sufficient spatial structure awareness. Zero-Sharpen utilizes variational models to guide the preservation of spectral and spatial information, which mitigates data dependency; however, its strong adaptability results in inadequate retention of spatial details. In contrast to these methods, our method leverages multilevel features to construct a representation graph structure, effectively guiding the feature reconstruction process through cross-modal communication, thereby better preserving both spectral information and spatial structure compared to SOTA methods. Although LDP also preserves source features effectively through degradation estimation, it falls slightly short compared to our approach due to insufficient interaction between source features. In addition, the bidirectional fusion-decomposition process enables the proposed method to more effectively perceive the spectral and spatial information from the source images, leading to superior performance compared to the semisupervised ZS-PAN. Despite the superior performance of our method, the lack of supervision from GT constraints may still introduce some distortion.

Fig. 6.

Visualization of the fusion results of different methods on IKONOS and GaoFen-2 full-resolution data. The areas highlighted and enlarged within the red frame, as well as those indicated by the white arrow, clearly demonstrate the superiority of the proposed method.

Show All

Fig. 7.

Visualization of the fusion results of different methods on IKONOS and GaoFen-2 reduced-resolution data. The areas highlighted and enlarged within the red frame clearly illustrate the superiority of the proposed method.

Show All

Quantitative experiments: Tables II and III present the quantitative results of different methods on the IKONOS and GaoFen-2 datasets, including both full-resolution and reduced-resolution results. Despite variations in individual scores across the two datasets, our method consistently ranks first overall, demonstrating its robustness and effectiveness. Specifically, on the IKONOS dataset, our model excels in the reduced-resolution dataset, whereas on the GaoFen-2 dataset, it achieves superior results at full-resolution. We attribute these differences to the varying properties of the satellite images. Nonetheless, the consistent top ranking across datasets underscores the robustness and reliability of our proposed method. The success of our method can be attributed to the cross-modal feature interactions facilitated by representation graph structures, as well as the advantage of decomposing consistency constraints in the pan-sharpening task. As a result, our method is more effective in perceiving and extracting spectral information and spatial details from the source images.

TABLE II Quantitative Comparison on the IKONOS Dataset

TABLE III Quantitative Comparison on the GaoFen-2 Dataset

C. Results on 8-Band Dataset

Qualitative experiments: We also conducted qualitative experiments on the 8-band data, with results for the full-resolution dataset and reduced-resolution shown in Figs. 8 and 9, respectively. A similar 4-band situation is also seen here, and the increase in the amount of source feature information challenges the robustness of these methods. Brovey, GS, and IHS, constrained by prior knowledge, exhibit notable deficiencies in capturing spatial details across the 8-band data. In addition, Brovey and IHS exhibit greater fusion bias in the reduced-resolution data, as illustrated by the error maps in Fig. 9. Moreover, the substantial information contained in the 8-band data poses challenges for PANGAN and LDP, both of which show noticeable spectral distortions in the reduced-resolution data. A similar issue is observed in ZS-PAN, where a decrease in performance is evident in the full-resolution data. In contrast, ZeroSharpen, benefiting from variational priors, achieves superior spectral preservation, albeit with some loss of spatial details. Compared to these methods, our method effectively balances both forward information fusion and backward decomposition, demonstrating superior qualitative performance that further validates the effectiveness and robustness of the proposed method.

Fig. 8.

Visualization of the fusion results of different methods on WorldView-2 and WorldView-3 full-resolution data. The areas highlighted and enlarged within the red frame, as well as those indicated by the white arrow, clearly demonstrate the superiority of the proposed method.

Show All

Fig. 9.

Visualization of the fusion results of different methods on WorldView-2 and WorldView-3 reduced-resolution data. The areas highlighted and enlarged within the red frame clearly illustrate the superiority of the proposed method.

Show All

Quantitative experiments: Tables IV and V present the quantitative results of different methods on the WorldView-2 and WorldView-3 datasets. It can be observed that our method exhibits some degradation as the data complexity increases, particularly on the WorldView-3 dataset, where overall performance falls below that of ZeroSharpen. We attribute this primarily to our method's relatively lower performance in chromatic fidelity compared to ZeroSharpen, as the variational prior gives ZeroSharpen an advantage in preserving spectral information. However, this also results in poorer scores for ZeroSharpen in terms of $D_{S}$, as the emphasis on spectral fidelity comes at the cost of spatial imbalance, which is also evident in the qualitative results. Furthermore, ZS-PAN's performance on the WorldView-2 dataset demonstrates the effectiveness of the semisupervised approach, especially with the eight-band data containing richer information. Overall, while our method exhibits a certain degree of distortion, both qualitative and quantitative assessments indicate that this level of degradation is acceptable. This is because our model training does not rely on direct GT supervision. Moreover, the relatively lower degree of degradation compared to other SOTA methods further validates the robustness of our method.

TABLE IV Quantitative Comparison on the WorldView-2 Dataset

TABLE V Quantitative Comparison on the WorldView-3 Dataset

D. Fusion Performance Analysis

1) Ablation Study

In our method, we designed a fusion network and a decomposition network, aiming to leverage their synergistic optimization to ensure that the fused images effectively retain the spatial structure and spectral information of the source images. To further investigate the rationality and effectiveness of this design, we conducted ablation experiments on the internal design of the fusion and decomposition networks using the IKONOS dataset. In the fusion network, we first removed the RGIM ($\text{w/o} \operatorname{RGIM}$), using only the output features from the last layer of the encoder to guide the GIFM in reconstructing the fused image. This was done to explore the role of the representation of graph structure in cross-modal feature interaction. Next, we removed the GIFM ($\text{w/o} \operatorname{GIFM}$) and used $\mathcal {R}$ as a guide to build an attention module to promote the feature reconstruction process to examine the guiding role of GIFM in cross-modal semantic information aggregation. Finally, we removed both the RGIM and GIFM ($\text{w/o} \operatorname{RGIM \& GIFM}$) to assess their effectiveness in cross-modal semantic learning and feature integration. In the decomposition network, we replaced SSPM and SFEM ($\text{w/o} \operatorname{SSPM \& SFEM}$) with regular convolutions to explore the effectiveness of these designs in enhancing the fusion network's perception of spatial details and spectral information of the source images.

Figs. 10 and 11 show the qualitative results of the ablation experiments on full- and reduced-resolution data. Intuitively, removing both GIFM&RGIM and SSPM&SFEM resulted in a severe distortion in the fused images. Combined with the quantitative results in Table VI, it can be seen that these design combinations are effective. Although removing GIFM&RGIM achieved the highest score on $D_{S}$, the spectral degradation resulting from prioritizing spatial structure is intolerable. Furthermore, both qualitative and quantitative results demonstrate the positive impact of GIFM and RGIM on model performance. RGIM, by constructing graphical structures using representation graphs, effectively promotes cross-modal semantic learning. Consequently, removing RGIM leads to spectral information degradation in the fused results, as shown by the buildings circled in blue and houses in red in Fig. 10. GIFM primarily adjusts the distribution of data flow using $\mathcal {R}$ to balance spatial structure and spectral information. Therefore, removing GIFM causes the model's control over these aspects to deteriorate, which is most evident in the error maps of reduced-resolution data in Fig. 11, where significant distortion appears at street corners.

TABLE VI Quantitative Comparison of Ablation Experiments on IKONOS Dataset

Fig. 10.

Visualization of ablation experiments on IKONOS full-resolution data. The regions highlighted and enlarged within the red and blue frames demonstrate the superiority of the proposed method. (a) PAN. (b) MS. (c) Ours. (d) w/o RGIM. (e) w/o GIFM. (f) w/o GIFM&RGIM. (g) w/o SSPM&SFEM.

Show All

Fig. 11.

Visualization of ablation experiments on IKONOS reduced-resolution data. The highlighted and enlarged areas within the red frame demonstrate the superiority of the proposed method. In addition, the orange frame, which is error map of the red frame regions, further illustrates the differences across various modules. (a) PAN. (b) MS. (c) GT. (d) Ours. (e) w/o RGIM. (f) w/o GIFM. (g) w/o GIFM&RGIM. (h) w/o SSPM&SFEM.

Show All

The ablation studies validate the effectiveness of the proposed method. Whether applied to full-resolution or low-resolution data, the collaborative operation of each module enhances the model's fusion performance. In addition, Fig. 12 presents the training and validation loss curves on the IKONOS dataset, indirectly demonstrating the stability of the proposed method.

Fig. 12.

Visualization of training and validation losses on IKONOS demonstrates the learning effect and generalization ability of the proposed method.

Show All

2) Number of SSPM and SFEM

SSFM and SEFM drive the fusion network in a data-driven manner within the decomposition network to better perceive the spectral information and spatial structure of the source images. Therefore, we further investigated the impact of the number of SSFM and SEFM on the fusion performance to optimize their quantity in the decomposition network. Table VII presents the quantitative results of different quantities of SSFM and SEFM. It can be observed that the model performs best on full-resolution data when the quantity is 3, whereas it achieves the highest score on low-resolution data when the quantity is 1. Therefore, after considering the performance on both full-resolution and low-resolution data, we set their quantity to 1.

TABLE VII Quantitative Comparison of IKONOS Datasets With Different Numbers of SSPMs and SFEMs

E. Efficiency Study

Table VIII presents the average runtime of various methods across different datasets, providing an evaluation of the efficiency of our proposed approach. By employing representational graphs to facilitate cross-modal feature interactions and utilizing these graphs to guide image reconstruction, our method effectively aggregates source image features at a relatively fast pace. Although it still has limitations compared to traditional methods, both qualitative and quantitative results demonstrate that our approach achieves robust and generalized fusion performance at a lower computational cost.

TABLE VIII Average Running Time of Different Methods (Unit: Second)

SECTION V.

Conclusion

This study addresses the limitations of current pan-sharpening methods by proposing a fusion-decomposition pan-sharpening model based on interactive learning of representation graphs. The model accounts for both the compression process from source images to fused results and the decomposition process from fused results back to source images. Within the fusion network, we designed an RGIM and a GIFM. The RGIM utilizes multilayer inputs from the encoder to construct a representational feature structure, facilitating cross-modal semantic learning. The GIFM leverages these cross-modal semantic representations to guide the feature reconstruction process, promoting the semantic aggregation of multispectral and PAN data, and thereby preserving important feature expressions in the pan-sharpened images. In the decomposition network, we incorporated an SSPM and an SFEM to reconstruct multispectral and PAN data from the pan-sharpened images. This data-driven approach enables the fusion network to better learn spatial and spectral consistency across different resolutions. Notably, the decomposition network serves as an auxiliary component and does not increase the model's complexity, thereby simplifying the overall training design. Our extensive experiments demonstrate the superiority of the proposed method over other SOTA methods.

References is not available for this document.

MIT Libraries

MIT Libraries

Fusion-Decomposition Pan-Sharpening Network With Interactive Learning of Representation Graph

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction