Introduction
Although multispectral images possess rich spectral information, their low resolution, due to the limitations of imaging sensors, results in the loss of spatial details, thereby limiting their application scope and accuracy. To overcome these limitations of satellite sensors, pan-sharpening technology has been proposed. This technique generates full-resolution multispectral (HRMS) images by fusing low-resolution multispectral (LRMS) images with high spatial resolution panchromatic (PAN) images. The resulting HRMS images, which integrate both rich spectral and spatial information, are highly advantageous for various remote sensing tasks such as land cover classification, change detection, and target tracking, and have become the focus of research [1], [2], [3], [4], [5].
Traditional pan-sharpening methods can be categorized into three main approaches: Component Substitution (CS) [6], [7], multi-resolution analysis (MRA) [8], [9], and variational optimization (VO) [10], [11]. The CS methods work by transforming the LRMS and full-resolution PAN images into a new domain, using the spatial information from the PAN image to enhance the texture of the LRMS image. MRA methods employ multiscale processing techniques to fuse images at different resolutions, thereby preserving both spectral and spatial information. VO methods treat pan-sharpening as an inverse problem, introducing prior knowledge to optimize the solution. Although these methods have achieved decent results, their effectiveness is heavily reliant on handcrafted rules. This reliance often leads to spectral and spatial distortions because of improper transformations and solution conditions, thus limiting their practical applicability.
In recent years, deep learning (DL)-based pan-sharpening methods have become a research hotspot due to their powerful feature representation capabilities [12], [13], [14], [15], [16]. These methods typically involve a two-step design: first, extracting and fusing hierarchical features from source images using carefully designed networks to output the fused image; second, encouraging the model to learn precise solutions by minimizing the distance between the fusion result and the target. However, despite the advances achieved, several key challenges remain that limit the full potential of these methods. One of the significant obstacles in applying DL to pan-sharpening is the lack of high-quality ground truth (GT) for supervised learning. GT data are crucial for effectively training models, yet in remote sensing, obtaining true high-resolution fused images is impractical due to limitations in data acquisition. To address this, researchers have constructed training datasets with pseudo-GT by decomposing LRMS images, allowing models to approximate the training process. Although this approach enables the model to better learn pan-sharpening-related features, models trained with pseudo-GT are inherently constrained by the downscaled resolution of the solution space, which imposes an upper limit on their learning capacity [12], [13], [16]. Consequently, these models may struggle to generalize to high-resolution scenarios, leading to suboptimal performance when applied to real-world satellite images. To overcome these limitations, recent studies have focused on leveraging the relationships between HRMS images and source images, adopting unsupervised training methods to eliminate dependence on pseudo-GT. This shift toward unsupervised approaches aims to remove the bottleneck of limited-resolution GT while enabling models to learn directly from available high-resolution data. Thanks to adversarial training between the generator and the discriminator, the methods based on the generative adversarial network (GAN) [17], [18], [19], [20], [21] have unique advantages in this regard. However, the adversarial training mechanism is unstable, often leading to oscillations during the training process that can adversely impact pan-sharpening performance (see Fig. 1). This instability presents a significant drawback, as achieving reliable, high-quality results requires careful and cumbersome tuning of the adversarial components. As a result, many researchers have turned to convolutional neural network (CNN)-based methods, which aim to overcome some of the limitations associated with GANs by focusing on modeling the degradation process or employing frequency separation techniques. These methods to enhance the model's perception of the spatial details in PAN images and the spectral information in LRMS images [12], [16], [22], [23]. While these CNN-based methods have demonstrated promising results, they still face several challenges.
Challenge: The primary challenge with existing CNN-based methods is their inability to effectively integrate spatial and spectral features from PAN and MS images. Conventional CNN architectures typically treat these features independently, which limits their capacity for effective cross-modal feature communication. As a result, the fusion process often fails to fully capture the intricate relationships between spatial and spectral domains, leading to pan-sharpened images with either suboptimal spatial detail or compromised spectral fidelity. In addition, CNN-based methods use complex loss functions that contain multiple components to improve the fusion quality. Although these loss terms are designed to enhance the effect of fused images, they make parameter adjustment difficult, resulting in unstable training. This complexity makes it difficult to strike a balance between maintaining spatial details and spectral fidelity, thus limiting its practical application.
Solutions and Contributions: This study proposes a fusion-decomposition pan-sharpening model based on interaction learning of representation graphs, addressing the challenges of pan-sharpening in the lack of GT through the following strategies. 1) Recognizing the ability of the graph neural network (GNN) to model complex structured data and capture long-range dependencies, we leverage this strength to enhance cross-modal feature interactions, which are crucial for effectively integrating spatial details and spectral information in unsupervised pan-sharpening tasks. To this end, we introduce a GNN-based pan-sharpening network that utilizes the GNN to efficiently capture and transmit information within graph structures, guiding the model to generate high-quality fusion results. Specifically, we design a representational graph interaction module (RGIM) and a graph interaction fusion module (GIFM). RGIM constructs a representation graph structure using multilayer inputs from the encoder, promoting cross-modal semantic learning through interactions between graph nodes. Owing to the powerful capabilities of GNNs, this module effectively integrates multilayer, multimodal feature information. GIFM then uses the cross-modal semantic representations learned by RGIM to guide the feature reconstruction process, achieving semantic aggregation of multispectral and PAN data, thereby retaining more critical feature details in the fused pan-sharpened images. 2) To ensure the fused image maintains consistency with the spatial and spectral characteristics of the source images, we model not only the forward pan-sharpening process but also introduce a decomposition network during training, which decomposes the fused output back into its original components. This bidirectional process provides additional supervision without increasing model complexity during inference, significantly enhancing training stability and consistency. The designed effective spatial structure perception module (SSPM) and spectral feature extraction module (SFEM) ensure the effective implementation of the decomposition process with SSPM is designed to capture and preserve spatial structural features in the fused image, while SFEM focuses on extracting and preserving spectral features, ensuring precise reconstruction and consistency with the source images.
We further evaluate the proposed method on WorldView-2, WorldView-3, IKONOS, and Gaofen-2 datasets. Benefiting from cross-modal graph interactions and the decomposition consistency constraint, our method achieves superior performance compared to state-of-the-art (SOTA) methods. Our main contributions are summarized as follows.
We propose a novel unsupervised pan-sharpening framework that not only considers the compression process from the source images to the fused results but also strives to decompose the fused image back into approximations of the original source images. This method leverages the consistency of decomposition in a data-driven manner to guide the optimization of the pan-sharpening network, thereby enhancing the preservation of spatial details.
We design a GNN-based fusion network, incorporating efficient representation graph interaction module (RGIM) and graph interaction fusion module (GIFM), which facilitate effective cross-modal semantic learning and feature integration.
We introduce a decomposition network into the fusion process, featuring specifically designed SSPM and SFEMs, which drive the pan-sharpening network to retain more spatial and spectral details without increasing the model's complexity.
Related Work
This section introduces existing DL-based pan-sharpening works, as well as GNN-based research works, which are most relevant to the proposed method.
A. Supervised Methods
The impressive representational capabilities of CNN have made them a focal point in the field of remote sensing, leading to successful applications in this domain [12], [13], [16], [23], [24]. Consequently, constructing supervised networks for hyperspectral pan-sharpening has become a research hotspot. Masi et al. [25] were the pioneers in introducing CNN into pan-sharpening, achieving results that outperformed traditional methods. This remarkable performance has driven researchers to explore DL-based methods to tackle the challenges of pan-sharpening. Yang et al. [26] proposed PanNet, which enhances fusion results by extracting high-pass domain features to provide rich spatial information. This method also employs residual connections to inject detailed information, thereby preserving complete spectral information. Wang et al. [27] employed a stepwise refinement approach, performing two consecutive 2× up-sampling operations on pan-sharpened frames, which allows the model to better capture spectral information and spatial structures. Zhang et al. [13] introduced a progressive pan-sharpening architecture based on deep spectral transformation, balancing performance across different resolutions. Similarly, Li et al. [28] proposed a multilevel progressive enhancement pan-sharpening network to alleviate the spatial detail distortion and spectral information loss caused by upsampling. Liu et al. [24] proposed a strict tensor-based pan-sharpening model, guided by a dual nonconvex tensor low-rank before fuse source images effectively. Tan et al. [29] proposed a hierarchical frequency integration network, which uses local Fourier information to achieve hierarchical space-frequency information integration. Wang et al. [30] introduced haze-line prior to the joint haze correction of LRMS and PAN images, and designed a low-rank tensor completion pan-sharpening method based on haze correction. Although these methods have achieved decent results, the supervised learning that relies on pseudo-GT formation often limits their solution domain, significantly reducing their performance on full-resolution data.
B. Unsupervised Pan-Sharpening
Unsupervised pan-sharpening methods, by operating independently of GT, effectively mitigate the limitations inherent in supervised methods. GANs, as generative models that learn to generate data through adversarial training with a discriminator, represent a prominent unsupervised method. Ma et al. [17] first proposed using GAN for remote sensing image pan-sharpening, achieving the retention of spectral information and spatial structure by establishing adversarial relationships between the fusion results and source images. Shang et al. [31] combined transformer networks with multiscale interactions to propose MFTGAN, which learns long-range spectral and spatial correlations. Zhou et al. [32] designed UCGAN, which trains on full-size images for pan-sharpening, ensuring that the fusion results contain rich spatial and spectral information. Similarly, He et al. [33] introduced a dual-cycle consistency GAN that enables deep integration of spatial and spectral features through cross-domain information interaction. Despite these successes, the instability of adversarial training often makes it challenging to strike a balance between preserving spectral and spatial information. In contrast to GAN-based methods, CNN-based approaches typically rely on modeling the relationship between spectral and spatial details to guide the network in generating HRMS images. For instance, Xiong et al. [34] used low-pass filters to ensure spectral and spatial consistency between the fusion results and source images. Li et al. [15] enhanced spatial structure retention by maximizing local texture similarity between the fusion results and PAN images. Ciotola et al. [35] assumed known degradation operators for LRMS images to ensure spectral fidelity, while Ni et al. [12] optimized the pan-sharpening network by learning the degradation process. Similarly, Lin et al. [16] treated pan-sharpening as a blind image deblurring task, learning the blur kernel to ensure spectral consistency between the fusion results and LRMS images. However, modeling the relationship between spectral and spatial details is often unknown and challenging, often requiring the design of complex constraints. In addition, many models do not adequately consider the internal design of message passing, which can prevent the fusion results from reflecting some critical details of the source images.
In contrast, our method leverages the interaction of cross-modal graph features to learn semantic representations, which more effectively guide the reconstruction of fusion results. Furthermore, leveraging the decomposition consistency, our method constructs a decomposition network to reconstruct the source images from the fusion results. This data-driven approach implicitly ensures the preservation of spectral information and spatial structure, while also reducing the complexity of the model design.
C. Graph Neural Network
The ability of GNN to effectively capture and transmit information within graph structures has led to their increasing popularity in the field of computer vision. Particularly in feature tasks, by treating images as structured graphs, GNN can better understand and represent the complex relationships and dependencies within images, excelling in tasks such as object detection [36], [37], image segmentation [38], [39], and multimodal image fusion [40]. For instance, Xie et al. [38] introduced a GNN-based data scale-aware network for few-shot semantic segmentation. Similarly, Knyazev et al. [41] utilized image superpixels as graph nodes to form multiple graphs for image classification. Zhao et al. [36] explored the intrinsic image structure by treating superpixels as graphs to investigate the internal structure of input images. Lu et al. [39] proposed a network that addresses semantic segmentation as a graph node classification task. Li et al. [40] constructed graphs using multilevel features of source images to guide multimodal image fusion. Qiao et al. [37] incorporated geometric and semantic information to segment objects and constructed initial graphs to predict the saliency ranking of salient objects in a context-aware manner. Pan et al. [42] developed a cascade hierarchical graph convolutional network to comprehensively learn the feature correlations between manipulated and nonmanipulated regions at different scales.
In summary, unlike CNN-based methods that are constrained by local interactions on fixed grids, GNNs represent spatial and spectral features as graph nodes, enabling effective message passing that captures both global dependencies and local contextual relationships. This makes GNNs particularly advantageous for cross-modal feature fusion in pan-sharpening, where integrating the high spatial resolution of PAN images with the spectral richness of MS images requires a nuanced understanding of nonuniform and heterogeneous relationships. Leveraging this concept, the RGIM and GIFM are designed to enhance cross-modal feature fusion through graph-based interactions, effectively capturing complex global and local contextual relationships. As a result, our method provides a deeper and more coherent fusion outcome compared to CNN-based methods, offering a superior exploration of the pan-sharpening task.
Method
This method aims to address two major issues in current unsupervised pan-sharpening methods: the feature loss caused by insufficient design for cross-modal communication within the network and the challenges posed by complex loss functions to model parameters. To this end, we propose a squeeze-decomposition pan-sharpening model based on interactive learning of representation graphs, as illustrated in Fig. 2, which consists primarily of a fusion network and a decomposition network. The fusion network, functioning as an end-to-end model, serves as the target network for image fusion, aiming to integrate source images into a single image. Once training is complete, the fusion network can generate the fused image directly from the input without requiring the decomposition network's involvement. Conversely, the decomposition network operates as an auxiliary network, decomposing the fused result to produce images consistent with the source images. This data-driven approach ensures that the fusion network can better capture the spatial and spectral details of the source images, ultimately producing high-quality fused images.
Overall framework of the proposed method. PAN and LRMS images are input into the fusion network to generate the fused image
A. Fusion Network
Fig. 3 illustrates the framework of the fusion network. Given the PAN image
Illustration of the fusion network. Specifically, we input the multiscale representation features extracted from the source images (
RGIM and GIFM: To enhance the model's sensitivity to spatial structure and spectral information, we designed a representational graph structure for cross-modal feature interaction. Compared to the traditional pixel-level feature graph, this representational graph more efficiently aggregates and transmits semantic information, thereby enabling the model to better capture the complex relationships and dependencies within the source images. The representation graph interaction module (RGIM), as illustrated in Fig. 4, aims to enhance the model's ability to capture spatial and spectral information by constructing a graph structure where multilevel representation features from heterogeneous encoders serve as nodes. The construction of this graph structure involves both node formation and edge definition, which are detailed as follows.
Architecture of RGIM and GIFM. (a) The structure of RGIM. (b) The structure of GIFM.
1) Node construction:
To build the graph structure, multilevel representational features are first extracted from the heterogeneous encoders and serve as the initial nodes. Before constructing the graph, these initial feature representations are refined using the dynamic node refinement module. This refinement process is designed to enhance the discriminative power of the features, ensuring that the subsequent graph construction is based on optimized and information-rich representations. Furthermore, to ensure that the nodes drive the graph structure towards an optimal balance between preserving spatial details and spectral features, we designate
\begin{equation*} \hat{\mathcal {A}_{k}^{i}}= \mathit{\text{Sig}}(\mathit{\text{Conv}}(\mathcal {A}_{M}^{2}))\cdot \mathcal {A}_{k}^{i} + \mathcal {A}_{k}^{i} \tag{1} \end{equation*}
2) Edge Definition:
In RGIM, we construct implicit graph edges through dynamic interactions between nodes, learning richer semantic relationships by chaining nodes of different modalities. The edge construction mechanism is described as follows.
Node Feature Concatenation and Message Construction
During each update iteration, the feature vector of the current node is concatenated with feature vectors from other nodes, forming an implicit “message” that represents the interaction pathways between nodes. Let
denote the set of node feature vectors. This concatenation effectively simulates the graph edges, enabling efficient feature propagation and establishing dependencies among nodes.$\mathcal {G} = [ \hat{\mathcal {A}_{p}^{1}}, \hat{\mathcal {A}_{p}^{2}}, \hat{\mathcal {A}_{m}^{1}}, \hat{\mathcal {A}_{m}^{2}}]$ Dynamic Edge Updates via Convolutional Gated Recurrent Units (ConvGRU)
Each node's features are updated iteratively with a ConvGRU, which dynamically constructs edges through recursive updates based on neighboring nodes' aggregated messages. The recursive update for node features can be expressed as
where\begin{equation*} \mathcal {G}^{(z+1)} = \mathcal {\psi } \cdot \mathcal {G}^{(z)} \tag{2} \end{equation*} View Source\begin{equation*} \mathcal {G}^{(z+1)} = \mathcal {\psi } \cdot \mathcal {G}^{(z)} \tag{2} \end{equation*}
denotes the feature matrix at iteration$\mathcal {G}^{(z)}$ , and$z$ represents the implicit adjacency relation established by ConvGRU. This recursive updating mechanism dynamically adjusts the edges, allowing the model to capture complex feature dependencies adaptively, effectively fusion of spatial and spectral information.$\psi$ Learnable Edge Weight Control
The strength of edges is modulated by a learnable parameter
, which adjusts the influence of neighboring nodes during each update. The weighted combination of features for node$\gamma$ can be expressed as:$i$ where\begin{equation*} \text{pred}_{i} = h_{t} \cdot \gamma + \text{base}_{i} \tag{3} \end{equation*} View Source\begin{equation*} \text{pred}_{i} = h_{t} \cdot \gamma + \text{base}_{i} \tag{3} \end{equation*}
represents the updated feature for the current node, and$h_{t}$ is the original feature before interaction. This learnable weight mechanism enables selective integration of crucial information, enhancing feature dependency representation by dynamically regulating information flow across feature levels.$\text{base}_{i}$
Once the nodes in the GNN have completed the information delivery process, they are aggregated through a summation operation to produce the global representation
B. Decomposition Network
The overall architecture of the decomposition network, as illustrated in Fig. 5, is designed to reconstruct the PAN image
In the PAN reconstruction branch, the feature flow passes through the SSPM, which extracts spatial structure features from
By incorporating learnable parameters
C. Loss Function
In our method, we not only consider the forward fusion process but also the reverse process of decomposing the fusion results back into the source images. Therefore, we can drive the fusion network to generate high-quality fusion results by constraining the decomposition effectiveness of the decomposition network, which greatly simplifies the complexity of the loss function design. Consequently, the fusion network and the decomposition network can complete parameter learning under the constraint of the decomposition consistency loss
\begin{equation*} L_{\text{dc}}=\Vert \widetilde{\text{PAN}}-\text{PAN}\Vert _{1}+\left\Vert \text{LRMS}_{\uparrow 4}-(\text{LRMS})_{\uparrow 4}\right\Vert _{1} \tag{4} \end{equation*}
In practice, there is an inherent challenge in that the optimization process of network models, often treated as a closed system, relies solely on the objective function as a constraint, rendering their internal optimization uncontrollable. Therefore, in our two-stage parameter optimization approach, relying solely on
\begin{equation*} L_{\text{re}}=\Vert O-I\Vert _{1} + \lambda (1-\text{SSIM}(O,I)) \tag{5} \end{equation*}
Experiments and Results
A. Experimental Settings
Benchmark datasets: We use four satellite datasets: 4-band IKONOS and GaoFen-2 and 8-band WorldView-2 and WorldView-3 to evaluate the effectiveness of our method. The experiments are conducted on both reduced-resolution and full-resolution data. The reduced-resolution data is synthesized according to Wald's protocol, where the LRMS and PAN images are downsampled by a factor of four, with the original LRMS images serving as GT. The basic information of these four datasets is summarized in Table I, where 10% of the data in the training dataset is used as the validation dataset.
Implementation details: The experiments were conducted on a computer running Windows 11, with the following hardware configuration: Intel Core i9-12900 CPU and NVIDIA GeForce RTX4090 GPU. We implemented the proposed method using PyTorch and employed two Adam optimizers to separately train the fusion network and the decomposition network. The optimizer for the fusion network starts with an initial learning rate of
Baseline: We compared the proposed method with eight SOTA pan-sharpening methods, including three traditional methods: Brovey [46], IHS [13], and GS [47]; four unsupervised methods: CSFNet [15], LDP [12], PANGAN [17], and ZeroSharpen [48]; and a semisupervised method, ZS-PAN [49]. All of the aforementioned methods were provided by the corresponding authors. For fairness, we retrained these methods on our datasets.
To evaluate our method, we utilized three popular quality assessment metrics on full-resolution data: the spectral distortion index (
B. Results on 4-Band Dataset
We conducted qualitative and quantitative comparisons of the proposed method on IKONOS and GaoFen-2 datasets to assess its performance on 4-band data.
Qualitative experiments: Figs. 6 and 7 display the qualitative results of various methods using full-resolution and reduced-resolution data, respectively. Intuitively, compared to eight other SOTA methods, our method better balances spectral information and spatial features, thus providing superior visual perception. Traditional methods guided by prior knowledge have achieved some success on IKONOS data. However, the fixed nature of prior information limits their adaptability when dealing with different satellite datasets, resulting in ineffective constraints for spatial and spectral information fusion. Consequently, these methods exhibit spectral distortions on the GaoFen-2 dataset, particularly evident in the reduced-resolution data. The error map compared to the GT demonstrates these issues. Unsupervised DL-based methods have achieved decent performance by exploring content correlations between source images. Although PANGAN captures rich spatial details, it is frequently affected by spatial artifacts and spectral distortions due to the instability of GAN training. On the other hand, CSFNet leverages spectral correlation mapping between LRMS and PAN image features to guide the generation of HRMS images, effectively preserving spectral information but lacking sufficient spatial structure awareness. Zero-Sharpen utilizes variational models to guide the preservation of spectral and spatial information, which mitigates data dependency; however, its strong adaptability results in inadequate retention of spatial details. In contrast to these methods, our method leverages multilevel features to construct a representation graph structure, effectively guiding the feature reconstruction process through cross-modal communication, thereby better preserving both spectral information and spatial structure compared to SOTA methods. Although LDP also preserves source features effectively through degradation estimation, it falls slightly short compared to our approach due to insufficient interaction between source features. In addition, the bidirectional fusion-decomposition process enables the proposed method to more effectively perceive the spectral and spatial information from the source images, leading to superior performance compared to the semisupervised ZS-PAN. Despite the superior performance of our method, the lack of supervision from GT constraints may still introduce some distortion.
Visualization of the fusion results of different methods on IKONOS and GaoFen-2 full-resolution data. The areas highlighted and enlarged within the red frame, as well as those indicated by the white arrow, clearly demonstrate the superiority of the proposed method.
Visualization of the fusion results of different methods on IKONOS and GaoFen-2 reduced-resolution data. The areas highlighted and enlarged within the red frame clearly illustrate the superiority of the proposed method.
Quantitative experiments: Tables II and III present the quantitative results of different methods on the IKONOS and GaoFen-2 datasets, including both full-resolution and reduced-resolution results. Despite variations in individual scores across the two datasets, our method consistently ranks first overall, demonstrating its robustness and effectiveness. Specifically, on the IKONOS dataset, our model excels in the reduced-resolution dataset, whereas on the GaoFen-2 dataset, it achieves superior results at full-resolution. We attribute these differences to the varying properties of the satellite images. Nonetheless, the consistent top ranking across datasets underscores the robustness and reliability of our proposed method. The success of our method can be attributed to the cross-modal feature interactions facilitated by representation graph structures, as well as the advantage of decomposing consistency constraints in the pan-sharpening task. As a result, our method is more effective in perceiving and extracting spectral information and spatial details from the source images.
C. Results on 8-Band Dataset
Qualitative experiments: We also conducted qualitative experiments on the 8-band data, with results for the full-resolution dataset and reduced-resolution shown in Figs. 8 and 9, respectively. A similar 4-band situation is also seen here, and the increase in the amount of source feature information challenges the robustness of these methods. Brovey, GS, and IHS, constrained by prior knowledge, exhibit notable deficiencies in capturing spatial details across the 8-band data. In addition, Brovey and IHS exhibit greater fusion bias in the reduced-resolution data, as illustrated by the error maps in Fig. 9. Moreover, the substantial information contained in the 8-band data poses challenges for PANGAN and LDP, both of which show noticeable spectral distortions in the reduced-resolution data. A similar issue is observed in ZS-PAN, where a decrease in performance is evident in the full-resolution data. In contrast, ZeroSharpen, benefiting from variational priors, achieves superior spectral preservation, albeit with some loss of spatial details. Compared to these methods, our method effectively balances both forward information fusion and backward decomposition, demonstrating superior qualitative performance that further validates the effectiveness and robustness of the proposed method.
Visualization of the fusion results of different methods on WorldView-2 and WorldView-3 full-resolution data. The areas highlighted and enlarged within the red frame, as well as those indicated by the white arrow, clearly demonstrate the superiority of the proposed method.
Visualization of the fusion results of different methods on WorldView-2 and WorldView-3 reduced-resolution data. The areas highlighted and enlarged within the red frame clearly illustrate the superiority of the proposed method.
Quantitative experiments: Tables IV and V present the quantitative results of different methods on the WorldView-2 and WorldView-3 datasets. It can be observed that our method exhibits some degradation as the data complexity increases, particularly on the WorldView-3 dataset, where overall performance falls below that of ZeroSharpen. We attribute this primarily to our method's relatively lower performance in chromatic fidelity compared to ZeroSharpen, as the variational prior gives ZeroSharpen an advantage in preserving spectral information. However, this also results in poorer scores for ZeroSharpen in terms of
D. Fusion Performance Analysis
1) Ablation Study
In our method, we designed a fusion network and a decomposition network, aiming to leverage their synergistic optimization to ensure that the fused images effectively retain the spatial structure and spectral information of the source images. To further investigate the rationality and effectiveness of this design, we conducted ablation experiments on the internal design of the fusion and decomposition networks using the IKONOS dataset. In the fusion network, we first removed the RGIM (
Figs. 10 and 11 show the qualitative results of the ablation experiments on full- and reduced-resolution data. Intuitively, removing both GIFM&RGIM and SSPM&SFEM resulted in a severe distortion in the fused images. Combined with the quantitative results in Table VI, it can be seen that these design combinations are effective. Although removing GIFM&RGIM achieved the highest score on
Visualization of ablation experiments on IKONOS full-resolution data. The regions highlighted and enlarged within the red and blue frames demonstrate the superiority of the proposed method. (a) PAN. (b) MS. (c) Ours. (d) w/o RGIM. (e) w/o GIFM. (f) w/o GIFM&RGIM. (g) w/o SSPM&SFEM.
Visualization of ablation experiments on IKONOS reduced-resolution data. The highlighted and enlarged areas within the red frame demonstrate the superiority of the proposed method. In addition, the orange frame, which is error map of the red frame regions, further illustrates the differences across various modules. (a) PAN. (b) MS. (c) GT. (d) Ours. (e) w/o RGIM. (f) w/o GIFM. (g) w/o GIFM&RGIM. (h) w/o SSPM&SFEM.
The ablation studies validate the effectiveness of the proposed method. Whether applied to full-resolution or low-resolution data, the collaborative operation of each module enhances the model's fusion performance. In addition, Fig. 12 presents the training and validation loss curves on the IKONOS dataset, indirectly demonstrating the stability of the proposed method.
Visualization of training and validation losses on IKONOS demonstrates the learning effect and generalization ability of the proposed method.
2) Number of SSPM and SFEM
SSFM and SEFM drive the fusion network in a data-driven manner within the decomposition network to better perceive the spectral information and spatial structure of the source images. Therefore, we further investigated the impact of the number of SSFM and SEFM on the fusion performance to optimize their quantity in the decomposition network. Table VII presents the quantitative results of different quantities of SSFM and SEFM. It can be observed that the model performs best on full-resolution data when the quantity is 3, whereas it achieves the highest score on low-resolution data when the quantity is 1. Therefore, after considering the performance on both full-resolution and low-resolution data, we set their quantity to 1.
E. Efficiency Study
Table VIII presents the average runtime of various methods across different datasets, providing an evaluation of the efficiency of our proposed approach. By employing representational graphs to facilitate cross-modal feature interactions and utilizing these graphs to guide image reconstruction, our method effectively aggregates source image features at a relatively fast pace. Although it still has limitations compared to traditional methods, both qualitative and quantitative results demonstrate that our approach achieves robust and generalized fusion performance at a lower computational cost.
Conclusion
This study addresses the limitations of current pan-sharpening methods by proposing a fusion-decomposition pan-sharpening model based on interactive learning of representation graphs. The model accounts for both the compression process from source images to fused results and the decomposition process from fused results back to source images. Within the fusion network, we designed an RGIM and a GIFM. The RGIM utilizes multilayer inputs from the encoder to construct a representational feature structure, facilitating cross-modal semantic learning. The GIFM leverages these cross-modal semantic representations to guide the feature reconstruction process, promoting the semantic aggregation of multispectral and PAN data, and thereby preserving important feature expressions in the pan-sharpened images. In the decomposition network, we incorporated an SSPM and an SFEM to reconstruct multispectral and PAN data from the pan-sharpened images. This data-driven approach enables the fusion network to better learn spatial and spectral consistency across different resolutions. Notably, the decomposition network serves as an auxiliary component and does not increase the model's complexity, thereby simplifying the overall training design. Our extensive experiments demonstrate the superiority of the proposed method over other SOTA methods.