Introduction
Remote sensing image change captioning (RSICC) has been a hot research topic [1], [2], [3], [4]. This emerging research area seeks to provide meaningful descriptions of alterations within scenes, which is a valuable tool for understanding changes in land cover over time. The dynamic nature of remote sensing data, with its multitemporal characteristics, presents unique challenges and opportunities for change captioning, different from the traditional change detection task as shown in Fig. 1. Specifically, the output of change captioning is text descriptions, while the output of change detection is the image-changed region. The advent of multitemporal remote sensing data availability has sparked a growing interest in using change captioning to study changes in land cover [5], [6], [7]. In recent years, a series of RSICC methods have been proposed, and these methods can be categorized into three types: the first type primarily relies on traditional machine learning algorithms, such as support vector machines (SVM), to generate corresponding descriptions; the second type predominantly employs traditional convolutional neural networks (CNNs) or recurrent neural networks (RNNs) methods for generating change descriptions; the third type encompasses transformer network methods based on attention mechanisms.
Comparison between change detection and change captioning for remote sensing images. The former (top of the figure) represents the detected change areas in image form, while the latter (bottom of the figure) expresses changes in remote sensing images through human-readable language.
The first two categories of methods [8], [9] based on SVM and RNNs perform poorly in terms of captioning accuracy and precision, rendering them of limited practical value. Conversely, the third category [10] and [11] based on attention mechanisms has significantly enhanced accuracy in change description, demonstrating suitability for practical applications. Nonetheless, due to the high complexity and parameter count of attention mechanisms, particularly in Transformer architectures, deployment and practical application in industrial settings with limited computational resources remain challenging. Therefore, against this backdrop, there is an urgent need to devise a lightweight algorithm incorporating attention mechanisms, ensuring both high accuracy and practical deployability. To address the aforementioned issues, this article proposes a sparse focus transformer (SFT) network for remote sensing change captioning. As illustrated in Fig. 2, our proposed SFT can not only ensure high-precision output results, characterized by accurate descriptive text, but also significantly reduce the parameter count compared to previous approaches.
Illustration of algorithmic evaluation: Computational efficiency (parameter count) and predictive accuracy.
Our method is primarily inspired by the work of [12], which explores sparse factorizations of the attention matrix, reducing computational complexity from
We adapt the sparse factorizations of the attention matrix approach from generating long sequential text to the task of remote sensing image change detection, aiming to establish a sparse attention mechanism for locating change regions.
We construct a SFT tailored for the task of change captioning in remote sensing images, significantly eliminating redundancy in multimodal models, thereby achieving better modality representation and subsequent fusion.
Extensive validation on datasets has been conducted, resulting not only in high-accuracy outputs but also in the lowest computational complexity and parameter count among current methodologies in this field.
The subsequent sections of the article are structured as follows. Section II offers an overview of the related work. The proposed method is expounded upon in Section III. Section IV delves into the experiments, providing detailed descriptions and discussing the results obtained. Finally, a concise conclusion is drawn in Section V.
Related Work
A. Efficient Attention Architecture
The attention mechanism [13], along with the transformer architecture [14] it underpins, has emerged as a pivotal advancement in the field of deep learning in recent years. Operating within the encoder–decoder paradigm, it has found widespread application across various tasks. In the subsequent, we will explore scholarly articles dedicated to enhancing the efficacy of attention mechanisms and vision transformers (ViTs).
Enhancing Locality: Sparse transformer's [12] principle involves decomposing the computation of full attention into several faster attention operations, which, when combined, approximate dense attention computation. LongFormer [15], it introduces a spatio-temporal complexity linearly dependent on the length of the text sequence in Self-Attention, aimed at ensuring the model can effectively utilize lower spatio-temporal complexity for modeling lengthy documents. LeViT [16] and MobileViT [17] adopted hybrid architectures featuring stacked convolution layers, effectively diminishing the number of features during the initial layer. Twins [18] employed a strategy of alternating between local and global attention layers to enhance performance. RegionViT [19] introduced the concept of regional tokens and local tokens, thereby enriching local context with global information. Huang et al. [20] and Ho et al. [21] proposed axial transformers, a self-attention-based model for images process and other data organized as high dimensional tensors. CrossViT [22] utilized distinct processing for small-patch and large-patch tokens, integrating them via multiple attention mechanisms. Pan et al. [23] introduced the HiLo attention method to segregate high and low-frequency patterns within an attention layer, dividing the heads into two groups, each equipped with specialized operations tailored to local window focus.
Faster Attention: Swin [24] and Cswin [25] incorporated local attention within a window and introduced a shifted window partitioning method to enable cross-window connections. Shuffle transformer [26] and Msg-transformer [27] employed spatial shuffle operations as alternatives to shifted window partitioning, facilitating cross-window connections. FasterViT [28] introduced hierarchical attention, breaking down global self-attention into multilevel attention components. FLatten transformer [29] integrated depth-wise convolution in conjunction with linear attention mechanisms to address the challenge of maintaining diversity in output features across different positions.
A summary of the efficient attention work reveals that our sparse focus attention not only enhances local information in images but also enables the accumulation of long-distance context. Simultaneously, it is implemented with simplicity and efficiency in both its realization and speed.
B. Remote Sensing Image Change Captioning
Hoxha et al. [8] proposed early and late feature fusion strategies to integrate bitemporal visual features, employing an RNN and a multiclass SVM decoder to generate change captions. Chouaf et al. [9] pioneered the RSICC task by employing a CNN as a visual encoder to capture temporal scene changes, while adopting an RNN as a decoder to generate change descriptions. Liu et al. [10] recently introduced a transformer-based encoder–decoder framework for the RSICC task. Their approach involves utilizing a dual-branch transformer encoder to detect scene changes and proposing a multistage fusion module to merge multilayer features for change description generation. Furthermore, Liu et al. [30] enhanced their method by incorporating progressive difference perception transformer layers to capture high-level and low-level semantic change information. In addition, Liu et al. [31] proposed a prompt-based approach leveraging pretrained large language models (LLMs) for RSICC tasks. They employed visual features, change classes, and language representations as input prompts to a frozen LLM for the generation of change captions. Chang [11] proposed an attentive network for remote sensing change captioning, called Chg2Cap, which utilizes the power of transformer models in NLP.
Through an investigation of the aforementioned RSICC efforts, it was observed that prior endeavors have attempted various methods to address the task. However, they did not consider practical industrial applications; either the accuracy was insufficient or the complexity too high. Consequently, we propose a lightweight algorithm that not only achieves higher accuracy but also meets the requirements for real-world application scenarios.
Methodology
In this section, we will introduce our proposed lightweight transformer model for the RSICC task.
A. Overview
First, we will provide an overview of our model. As shown in Fig. 3, our proposed model contains two phases, i.e., training phase and testing phase. In the following, we will introduce the two phases seperately.
Overall framework of the proposed SFT method comprises three components: (a) Feature extractor: A CNN-based, weight-shared feature extractor, primarily utilizing ResNet101in this study to extract coarse change features from bitemporal remote sensing images. (b) Sparse Focus Encoder: an encoder designed to finely capture and localize change features in remote sensing images, based on the proposed sparse focus attention mechanism. (c) Change caption generator: A decoder designed to generate the final change captioning for remote sensing images by accepting both the change feature embeddings and word embeddings.
Training phase: The SFT, as illustrated in Fig. 3, consists primarily of three components: a CNN feature extractor, where we employ ResNet101 as extractor, responsible for extracting generic representation information from the bitemporal images; an image encoder based on sparse focus attention, designed to extract features of the changed regions using the sparse attention mechanism proposed in this article; and finally, the caption decoder utilizes both image embeddings and word embeddings to generate predicted change captions, capturing the inter-relationship between them. The training phase primarily follows the following procedure: First, the images
Test phase: The predicted captions are generated through an autoregressive process solely based on the input image pairs. Autoregression in the context of image captioning entails generating a caption incrementally, with each word predicted based on the preceding words. To elaborate, the caption generation process commences with the initialization of the “START” token, and subsequent predictions are contingent upon the previously generated words.
B. Theory of Sparse Attention
Analysis of Complexity: The transformer architecture primarily consists of modules such as feedforward neural networks, activation functions, self-attention mechanisms, and others. Among these, the primary attention module is described by the following formulation:
\begin{equation*}
Attention(\bm {Q, K, V}) = Softmax\left(\frac{\bm {QK}^{T}}{\sqrt{d}}\right)\bm {V} \tag{1}
\end{equation*}
C. Sparse Focus Network
Through the theoretical analysis, we visualize the sparse factorizations of the attention kernels. Fig. 4(a) illustrates the computation of a conventional attention kernel, where the prediction of the current pixel (depicted in deep red) involves the utilization of pixels within a certain range (light-colored region). Fig. 4(b) demonstrates a rowwise attention kernel, where the prediction of the current pixel (depicted in deep red) exclusively considers pixels within the same row (light-colored region). Similarly, Fig. 4(c) follows the same rationale, illustrating a columnwise attention kernel where the prediction of the current pixel only involves pixels within the same column.
Visualization of attention kernel. (a) Conventional attention kernel. (b) Row attention kernel. (c) Column attention kernel.
Based on the analysis of attention kernels presented above, we have developed Sparse Focus Attention. This module integrates both horizontal and vertical attention kernels into a novel attention kernel. In Sparse Focus Attention, we have devised two distinct types of attention kernels, as depicted in Figs. 5 and 6. These are distinguished based on how they compute the length of rows and columns influencing the prediction of pixels. They are termed “full length” and “fixed length,” respectively. The former calculates the lengths of rows and columns affecting the current pixel
Sparse focus full attention, refers to the scenario where both the rowwise attention and columnwise attention lengths for each point are equal to the entire length of the feature map.
Sparse focus fixed attention refers to the scenario where both the rowwise attention and columnwise attention lengths for each point are fixed length.
In the “full length” approach, the initial phase involves processing the feature map
Utilizing the sparse focus method on
\begin{equation*}
\mathrm{\mathbf {A}} = \text{Softmax} (\mathrm{\mathbf {Q}}_{p} \cdot \mathrm{\mathbf {K}}_{i,p}^{T}) \tag{2}
\end{equation*}
\begin{equation*}
\bm {F^{\prime }} = \sum _{i=0}^{H+W-1} \mathrm{\mathbf {A}} \cdot \mathrm{\mathbf {V}}_{p} + \bm {F} \tag{3}
\end{equation*}
Integrating sparse focus attention with other convolution operations into a SFT network involves using sparse attention mechanisms within the transformer architecture. During the sparse encoder operation, feature maps
\begin{align*}
\bm {F}_{\bm {1}}^{\prime } =& \text{SFT}(\bm {F}_{\bm {1}}) \tag{4}\\
\bm {F}_{\bm {2}}^{\prime } =& \text{SFT}(\bm {F}_{\bm {2}}) \tag{5}
\end{align*}
After processing, the resulting feature maps are concatenated to form the final output
\begin{equation*}
\bm {I_{1,2}} = \text{Concat}(\bm {F}_{\bm {1}}^{\prime }, \bm {F}_{\bm {2}}^{\prime }). \tag{6}
\end{equation*}
D. Change Caption Generator
The techniques for generating descriptions are mainly divided into three categories: template-based approaches, retrieval-based approaches, and sequence generation-based approaches. To precisely describe and characterize the differences between image pairs, we endorse the sequence generation-based approaches. These approaches employ a transformer-based decoder, a method that has gained significant traction in numerous contemporary NLP tasks.
The decoder consists of several layers of transformers, each featuring a masked multihead attention sublayer and a feed-forward network. To ensure the continuity of information propagation and bolster the model's robustness, these sublayers are enhanced with residual connections and layer normalization techniques. Ultimately, the output embedding is produced through a linear layer (LN), followed by the application of a softmax activation function. A visual depiction of this architecture is provided in Fig. 7.
During the training phase, in order to prepare change descriptions for the caption decoder, the text tokens undergo an initial mapping process to transform them into word embeddings via an embedding layer
\begin{equation*}
\bm {T}_{\text{embed}} = E_{\text{embed}}(\bm {t}) + E_{\text{pos}}. \tag{7}
\end{equation*}
\begin{equation*}
\bm {Head}_{l} = \text{Attention} (\bm {T}_{\text{embed}}^{i-1} \bm {W}_{l}^{Q}, \bm {T}_{\text{embed}}^{i-1}\bm {W}_{l}^{K}, \bm {T}_{\text{embed}}^{i-1}\bm {W}_{l}^{V}) \tag{8}
\end{equation*}
\begin{equation*}
\text{Attention}(\bm {Q, K, V}) = \text{Softmax}\left(\frac{\bm {QK}^{T}}{\sqrt{d}}\right)\bm {V} \tag{9}
\end{equation*}
\begin{align*}
\bm {T}_{\text{img}} & = \text{MHA}(\bm {T}_{\text{embed}}^{i-1}, \bm {T}_{\text{embed}}^{i-1}, \bm {T}_{\text{embed}}^{i-1}) \\
& = \text{Concat}(\bm {Head}_{1},\ldots, \bm {Head}_{h}) \cdot \bm {W}^{O} \tag{10}
\end{align*}
\begin{equation*}
\bm {T}_{\text{text}} = \text{FN}(\text{Caption}_{\text{text-img}}) + \bm {T}_{\text{embed}}^{i-1}. \tag{11}
\end{equation*}
\begin{equation*}
\mathrm{Caption_{T}} = \text{Softmax}(\text{LN}(\bm {T}_{\text{text}}))
\end{equation*}
In the validation and testing stages, the SFT network employs an autoregressive method for generating captions from input image pairs. The decoding process begins with the “START” token and utilizes the encoder's relevant image features to generate the next token. This is followed by the production of logits through a LN and the calculation of probabilities using softmax. The decoder integrates the encoder's output and previously generated tokens throughout this process to predict the next tokens until the “END” token is reached.
Experiments
To perform a thorough evaluation of the proposed SFT approach, we conducted comparisons with existing remote sensing change captioning methods on two benchmark remote sensing change captioning (RSCC) datasets. Furthermore, we provide evidence to demonstrate that our proposed method not only enables rapid predictions with low complexity but also exhibits high accuracy.
A. Datasets
LEVIR-CC Dataset: The LEVIR-CC dataset originates from a building change detection dataset consisting of 637 very high-resolution (0.5 m/pixel) bitemporal images sized
[10]. To adapt it for use in RSCC, the LEVIR-CC dataset was curated by segmenting 10 077 small bitemporal tiles sized 256 × 256 pixels, with each tile annotated as containing changes or no changes. The dataset comprises 5 038 image pairs depicting changes and 5 039 pairs without changes, with each pair accompanied by five distinct sentence descriptions delineating the nature of changes between the two acquisitions. The maximum sentence length is 39 words, with an average of 7.99 words.1024 \times 1024 Dubai-CC Dataset: The Dubai-CC dataset offers a detailed portrayal of urbanization changes within the Dubai region. To ensure precise identification and description of changes, the original images are partitioned into 500 tiles sized 50 × 50, with five change descriptions annotated for each small bitemporal tile, referencing Google Maps and publicly available documents. The dataset comprises 2 500 distinct descriptions, with a maximum length of 23 words and an average length of 7.35 words. Experimental configurations detailed in [8] were adopted, with the dataset divided into training, validation, and testing sets, containing 300, 50, and 150 bitemporal tiles, respectively.
B. Experimental Setup
Evaluation Metrics: The efficacy of the captioning model hinges on its ability to produce descriptive sentences that align well with human judgments regarding differences between bitemporal images. To gauge this alignment, automatic evaluation metrics are employed to quantify the accuracy of the generated sentences against annotated reference sentences. In our study, we utilized four standard metrics prevalent in both image captioning [33], [34] and change captioning [35], [36] domains:
BLEU-N (N = 1,2,3,4): Papineni et al. [37] measured the precision of n-gram overlap between the generated caption and the reference captions. It calculates the precision for each n-gram size (up to N) and combines them using a geometric mean, penalized by a brevity penalty to account for shorter generated captions
where\begin{equation*} \text{BLEU-N} = \text{BP} \times \exp \left(\sum _{n=1}^{N} \frac{1}{N} \log p_{n}\right) \tag{12} \end{equation*} View Source\begin{equation*} \text{BLEU-N} = \text{BP} \times \exp \left(\sum _{n=1}^{N} \frac{1}{N} \log p_{n}\right) \tag{12} \end{equation*}
is the precision of n-grams, and BP is the brevity penalty.p_{n} ROUGE-L: Rouge [38] computed the longest common subsequence (LCS) between the generated caption and the reference captions. It normalizes this by the length of the longer of the two sequences, providing a measure of how well the generated caption captures the content of the reference captions
where LCS(g, r) is the length of the LCS between the generated caption\begin{equation*} \text{ROUGE-L} = \frac{\text{LCS}(g, r)}{\text{max}(\text{len}(g), \text{len}(r))} \tag{13} \end{equation*} View Source\begin{equation*} \text{ROUGE-L} = \frac{\text{LCS}(g, r)}{\text{max}(\text{len}(g), \text{len}(r))} \tag{13} \end{equation*}
and the reference caption(g) .(r) METEOR: Banerjee et al. [39] calculated the harmonic mean of precision and recall, incorporating stemming and synonymy in its evaluation. It penalizes for fragmentation by rewarding contiguous matches and accounts for recall through the harmonic mean
where P is precision, R is recall, and\begin{equation*} \text{METEOR} = (1 - \alpha) \cdot \text{P} \cdot \text{R} \cdot \frac{\text{P} + \beta \cdot \text{R}}{\text{P} + \text{R}} \tag{14} \end{equation*} View Source\begin{equation*} \text{METEOR} = (1 - \alpha) \cdot \text{P} \cdot \text{R} \cdot \frac{\text{P} + \beta \cdot \text{R}}{\text{P} + \text{R}} \tag{14} \end{equation*}
and\alpha are tunable parameters.\beta CIDEr-D: Vedantam et al. [40] computed the consensus between the generated caption and the reference captions based on TF-IDF weighted n-grams. It emphasizes capturing diverse and descriptive phrases in the generated caption, giving higher scores for more informative and varied descriptions
where TF-IDF is the term frequency-inverse document frequency weighted similarily between the generated caption and the reference captions.\begin{equation*} \text{CIDEr-D} = \text{TF-IDF}(\text{gen}, \text{ref}) \tag{15} \end{equation*} View Source\begin{equation*} \text{CIDEr-D} = \text{TF-IDF}(\text{gen}, \text{ref}) \tag{15} \end{equation*}
Experimental Details: The deep learning methodologies expounded in this study are implemented within the PyTorch framework and executed on single NVIDIA A5000 GPU with 24G memory. Training and evaluation procedures adhere to meticulous parameters: employing the Adam optimizer [41] with an initial learning rate of 0.0001 and a weight decay of 0.5. The training regimen spans about 40 epochs, with a batch size set at 32 for enhanced computational efficiency. Following each epoch, the model undergoes rigorous evaluation against the validation set.
C. Results Analysis
This study presents a comprehensive evaluation of the proposed algorithm on two distinct datasets, LEVIR_CC and Dubai_CC.
Accuracy Analysis: We will delve into a comprehensive examination of the models' performance, focusing on their accuracy using different metrics and how effectively their results in the given dataset.
On the LEVIR_CC dataset, our SFT method demonstrates remarkable performance, outperforming several existing methods such as DUDA, MCCFormers-S, MCCFormers-D, RSICCformer, ATTENTIVE-S, without positional embedding initialization, and ATTENTIVE, with positional embedding initialization, as illustrated in Table I. Specifically, it achieves a BLEU-4 score of 62.87%, underscoring its effectiveness in accurately capturing and describing changes between bitemporal images. Furthermore, our approach surpasses the current state-of-the-art (SOTA) methods in various other evaluation metrics, highlighting its superiority in change captioning tasks. This fully validates our previous analysis and further demonstrates the effectiveness of the proposed algorithm. In the comparison table presented in Table II, our SFT method is evaluated against several other methods on the DUBAI-CC Dataset. While our method performs reasonably well, it ranks second in overall performance metrics. The leading method, ATTENTIVE, achieves the highest scores across all evaluation metrics, indicating a stronger ability to capture the nuances of change between bitemporal images. Nevertheless, our Sparse Focus method demonstrates reasonably competitive performance, closely trailing the leading approach. Specifically, Sparse Focus achieves moderate scores in various metrics, such as BLEU-1, BLEU-2, ROUGE-L, and CIDEr-D, suggesting its potential in change captioning tasks. Despite securing the second position, our method's performance underscores its efficacy and potential as a robust solution for change detection and description in remote sensing images.
Parameter and Complexity Analysis: This section provides an in-depth analysis of model parameters and computational complexity, elucidating the intricate balance between model size, computational efficiency. By examining these factors, we gain valuable insights into the efficiency and effectiveness of the proposed approach.
Parameter and Inference Analysis: The Table III compares different methods based on their total parameters, encoder module parameters, and inference time for images sized 256 × 256 pixels. Sparse Focus is prominently featured in this comparison.
While SparseFocus ranks second in terms of accuracy, it stands out as the top performer in terms of parameter efficiency and computational complexity. For example, the best model from our work, Sparse Focus Full (depicted by the gray line), has only about half the total parameters compared to the highest accuracy model, ATTENTIVE. This advantage is especially pronounced in the image encoder module, where our encoder has over 90% fewer parameters compared to the 648M parameters of the ATTENTIVE model's encoder. This considerable reduction in model size underscores its efficiency in terms of both memory occupation and resource utilization, positioning it as the most lightweight method among the approaches considered, especially given comparable accuracy.
Moreover, SparseFocus exhibits competitive performance in inference time, demonstrating its potential for fast and efficient inference. Although its inference time is marginally longer than the baseline, it still falls within a moderate range, ensuring that it is practical for real-world applications.
In summary, SparseFocus emerges as a promising solution for change detection tasks, excelling in parameter efficiency and computational complexity while maintaining competitive accuracy. Its lightweight nature and reasonable inference time make it a suitable option for applications with strict resource constraints or real-time processing requirements. This combination of efficiency and practicality underscores its potential for broader adoption in environments where hardware limitations and processing speed are critical considerations.
GPU Memory & MACs Analysis: Table IV provides a comparative analysis of various models, focusing on their memory usage and computational efficiency, specifically regarding training GPU memory, inference GPU memory, and multiply-accumulate operations (MACs) for images sized 256 × 256 pixels.
SparseFocus exhibits notable advantages in terms of GPU memory utilization for both training and inference. For example, the best model from our study (the gray line) requires approximately 9 GB for training, whereas the ATTENTIVE model requires more than 12 GB. In addition, the inference GPU memory usage for SparseFocus is around 0.6 GB, significantly lower than other models, which can exceed 1 GB or even 2 GB. This considerable reduction in memory usage demonstrates the efficiency of SparseFocus in leveraging computational resources during both model training and inference.
Furthermore, SparseFocus presents significantly lower Multiply-accumulate operations (MACs), indicating superior computational efficiency during model execution. Its MACs count is only 21.47 GB, far less than the RSICCformer and ATTENTIVE models, which require 150 GB and 464 GB, respectively. This reduction in computational overhead indicates that SparseFocus achieves comparable or even better performance with substantially lower resource requirements.
Overall, these results suggest that SparseFocus is not only efficient in terms of memory usage but also demonstrates superior computational efficiency. This makes it a highly attractive option for applications where resources are limited, or where computational efficiency is a critical factor. By maintaining high performance with reduced computational overhead, SparseFocus emerges as a compelling choice for a wide range of scenarios, particularly in environments with strict resource constraints or where speed and efficiency are paramount.
D. Ablation Studies
LEVIR_CC Ablation: In our study, we implemented the SparseFocus mechanism with attention kernel length
In our subsequent experiments, we evaluated the performance of the proposed methods, Sparse Focus Full and Sparse Focus Fixed, across LEVIR_CC datasets. In this context,
In summary, the results from these experiments emphasize the importance of carefully considering model architecture, especially when utilizing attention mechanisms. The success of Sparse Focus Full with a single stack suggests that simpler configurations can often yield superior outcomes, while excessive complexity may adversely impact performance. These insights will guide future research into optimal configurations for attention-based models, ensuring a balance between precision and efficiency.
The Table V illustrates the change accuracy of Sparse Focus methods on the LEVIR-CC dataset. Notably, Sparse Focus Full consistently outperforms Sparse Focus Fixed, emphasizing the effectiveness of utilizing the full-length attention mechanism. Specifically, Sparse Focus Full achieves superior change accuracy and total accuracy compared to Sparse Focus Fixed. Overall, Sparse Focus Full demonstrates superior performance in accurately detecting changes while maintaining high classification accuracy.
DUBAI_CC Ablation: The performance evaluation on the DUBAI-CC dataset highlights the effectiveness of Sparse Focus methods across various evaluation metrics, as depicted in Table VI. Specifically, Sparse Focus Full with
Based on the results presented in Table VI, Sparse Focus methods demonstrate superior performance in change accuracy on the dubai-CC dataset. Specifically, Sparse Focus Full(
However, Tables V and VI also reveal that in the Sparse Focus Full series, stacking a single attention layer yields better accuracy in the final captioning result, whereas stacking two layers results in higher performance in image transformation detection. This observation suggests that, despite a consistent decoder, the optimal output of the image encoder does not necessarily contribute to a more accurate captioning by the decoder. There might be several reasons for this inconsistency. First, conventional similarity computation methods, such as cosine similarity, might not effectively integrate the image and text information accurately. This limitation could hinder the decoder's ability to generate captions that accurately reflect the content of the image. Second, the text decoder might require further design considerations to effectively decode the image information output by the image encoder, indicating that a more refined approach in integrating the encoder's output with the decoder's architecture could lead to improved captioning accuracy. Overall, this analysis underscores the need to re-examine the relationship between image encoding and text decoding to ensure that the information flow between these two components is optimized. By addressing these challenges, future work can focus on developing more robust methodologies to achieve a seamless fusion of image and text information, thereby enhancing the accuracy of captioning.
Model Architecture Ablation: We conducted ablation experiments by employing different networks for the extractor, image encoder, and caption decoder components of the model to further validate the effectiveness of the method proposed in this article. As shown in Table VII, when different backbone networks were utilized as the Extractor for feature extraction, a decrease in accuracy was observed. Furthermore, we employed the “vit_base_patch16_224” model as the Image Encoder and made slight adjustments to the ViT model to better align it with our method. As indicated in the fourth row of the table, the accuracy of the experimental results is second only to our proposed method. However, the model's parameter count reached approximately 1.4G, with training GPU memory exceeding 15 GB. In addition, we integrated the Sparse Focus module into the caption generator. The results, as seen in the fourth row of the table, reveal that the accuracy was the lowest, likely due to the sequential regression and prediction of each word during caption generation. If the final decoder is too sparse, it may struggle to accurately predict the next word, thus negatively impacting overall experimental accuracy.
E. Qualitative Visualization
To assess the effectiveness of the change captions produced by our proposed SFT method, we performed a qualitative evaluation by selecting representative scenes from both the Dubai-CC and LEVIR-CC datasets. These scenes are depicted in Figs. 8 and 9. In each figure, TIME1 and TIME2 correspond to remote sensing images captured at different times in the same area. The captions on the right represent the predictions generated by our model, while the captions below labeled as “reference” serve as the ground truth labels.
F. Limitation and Future Work
Through the research presented in this article, regarding the current issues and future research directions in the captioning of change detection for remote sensing images, the main focus should be the following aspects. First, we address the problem of multimodal fusion to more accurately and effectively extract representations of individual modalities. Only when modalities are sufficiently clear and accurate can they be effectively fused and aligned. Second, for the RSICC, exploring novel solution architectures instead of the existing three-stage paradigm. Third, exploring additional multimodal fusion techniques, such as similarity calculation methods, will facilitate accurate matching between image and text modalities.
Conclusion
In summary, our SFT network offers a compelling solution to the task of change captioning in remote sensing imagery. Through rigorous experimentation, we demonstrate its effectiveness in accurately describing changes while significantly reducing computational complexity and parameter count compared to existing methods. Our approach not only advances change captioning technology but also contributes to efficient multimodal modeling, with potential applications in various industries and domains requiring high-dimensional data interaction.