Journals & Magazines >IEEE Journal of Selected Topi... >Volume: 17

A Lightweight Sparse Focus Transformer for Remote Sensing Image Change Captioning

Abstract:

Remote sensing image change captioning (RSICC) aims to automatically generate sentences that describe content differences in remote sensing bitemporal images. Recently, a...Show More

Metadata

Abstract:

Remote sensing image change captioning (RSICC) aims to automatically generate sentences that describe content differences in remote sensing bitemporal images. Recently, attention-based transformers have become a prevalent idea for capturing the features of global change. However, existing transformer-based RSICC methods face challenges, e.g., high parameters and high computational complexity caused by the self-attention operation in the transformer encoder component. To alleviate these issues, this article proposes a sparse focus transformer (SFT) for the RSICC task. Specifically, the SFT network consists of three main components, i.e., a high-level features extractor based on a convolutional neural network, a sparse focus attention mechanism-based transformer encoder network designed to locate and capture changing regions in dual-temporal images, and a description decoder that embeds images and words to generate sentences for captioning differences. The proposed SFT network can reduce the parameter number and computational complexity by incorporating a sparse attention mechanism within the transformer encoder network. Experimental results on various datasets demonstrate that even with a reduction of over 90% in parameters and computational complexity for the transformer encoder, our proposed network can still obtain competitive performance compared to other state-of-the-art RSICC methods.

Published in: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing ( Volume: 17)

Page(s): 18727 - 18738

Date of Publication: 30 September 2024

ISSN Information:

DOI: 10.1109/JSTARS.2024.3471625

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Remote sensing image change captioning (RSICC) has been a hot research topic [1], [2], [3], [4]. This emerging research area seeks to provide meaningful descriptions of alterations within scenes, which is a valuable tool for understanding changes in land cover over time. The dynamic nature of remote sensing data, with its multitemporal characteristics, presents unique challenges and opportunities for change captioning, different from the traditional change detection task as shown in Fig. 1. Specifically, the output of change captioning is text descriptions, while the output of change detection is the image-changed region. The advent of multitemporal remote sensing data availability has sparked a growing interest in using change captioning to study changes in land cover [5], [6], [7]. In recent years, a series of RSICC methods have been proposed, and these methods can be categorized into three types: the first type primarily relies on traditional machine learning algorithms, such as support vector machines (SVM), to generate corresponding descriptions; the second type predominantly employs traditional convolutional neural networks (CNNs) or recurrent neural networks (RNNs) methods for generating change descriptions; the third type encompasses transformer network methods based on attention mechanisms.

Fig. 1.

Comparison between change detection and change captioning for remote sensing images. The former (top of the figure) represents the detected change areas in image form, while the latter (bottom of the figure) expresses changes in remote sensing images through human-readable language.

Show All

The first two categories of methods [8], [9] based on SVM and RNNs perform poorly in terms of captioning accuracy and precision, rendering them of limited practical value. Conversely, the third category [10] and [11] based on attention mechanisms has significantly enhanced accuracy in change description, demonstrating suitability for practical applications. Nonetheless, due to the high complexity and parameter count of attention mechanisms, particularly in Transformer architectures, deployment and practical application in industrial settings with limited computational resources remain challenging. Therefore, against this backdrop, there is an urgent need to devise a lightweight algorithm incorporating attention mechanisms, ensuring both high accuracy and practical deployability. To address the aforementioned issues, this article proposes a sparse focus transformer (SFT) network for remote sensing change captioning. As illustrated in Fig. 2, our proposed SFT can not only ensure high-precision output results, characterized by accurate descriptive text, but also significantly reduce the parameter count compared to previous approaches.

Fig. 2.

Illustration of algorithmic evaluation: Computational efficiency (parameter count) and predictive accuracy.

Show All

Our method is primarily inspired by the work of [12], which explores sparse factorizations of the attention matrix, reducing computational complexity from $\mathcal {O}(n^{2})$ to $\mathcal {O}(n\sqrt{n})$ in order to generate long sequences with lower computation in natural language processing (NLP). After further refinement and enhancement of this concept, we have introduced it into change captioning, achieving the first instance of model compression in the multimodal domain of remote sensing image change detection captioning, Our work primarily contributes in the following three aspects:

We adapt the sparse factorizations of the attention matrix approach from generating long sequential text to the task of remote sensing image change detection, aiming to establish a sparse attention mechanism for locating change regions.
We construct a SFT tailored for the task of change captioning in remote sensing images, significantly eliminating redundancy in multimodal models, thereby achieving better modality representation and subsequent fusion.
Extensive validation on datasets has been conducted, resulting not only in high-accuracy outputs but also in the lowest computational complexity and parameter count among current methodologies in this field.

The subsequent sections of the article are structured as follows. Section II offers an overview of the related work. The proposed method is expounded upon in Section III. Section IV delves into the experiments, providing detailed descriptions and discussing the results obtained. Finally, a concise conclusion is drawn in Section V.

SECTION II.

Related Work

A. Efficient Attention Architecture

The attention mechanism [13], along with the transformer architecture [14] it underpins, has emerged as a pivotal advancement in the field of deep learning in recent years. Operating within the encoder–decoder paradigm, it has found widespread application across various tasks. In the subsequent, we will explore scholarly articles dedicated to enhancing the efficacy of attention mechanisms and vision transformers (ViTs).

Enhancing Locality: Sparse transformer's [12] principle involves decomposing the computation of full attention into several faster attention operations, which, when combined, approximate dense attention computation. LongFormer [15], it introduces a spatio-temporal complexity linearly dependent on the length of the text sequence in Self-Attention, aimed at ensuring the model can effectively utilize lower spatio-temporal complexity for modeling lengthy documents. LeViT [16] and MobileViT [17] adopted hybrid architectures featuring stacked convolution layers, effectively diminishing the number of features during the initial layer. Twins [18] employed a strategy of alternating between local and global attention layers to enhance performance. RegionViT [19] introduced the concept of regional tokens and local tokens, thereby enriching local context with global information. Huang et al. [20] and Ho et al. [21] proposed axial transformers, a self-attention-based model for images process and other data organized as high dimensional tensors. CrossViT [22] utilized distinct processing for small-patch and large-patch tokens, integrating them via multiple attention mechanisms. Pan et al. [23] introduced the HiLo attention method to segregate high and low-frequency patterns within an attention layer, dividing the heads into two groups, each equipped with specialized operations tailored to local window focus.

Faster Attention: Swin [24] and Cswin [25] incorporated local attention within a window and introduced a shifted window partitioning method to enable cross-window connections. Shuffle transformer [26] and Msg-transformer [27] employed spatial shuffle operations as alternatives to shifted window partitioning, facilitating cross-window connections. FasterViT [28] introduced hierarchical attention, breaking down global self-attention into multilevel attention components. FLatten transformer [29] integrated depth-wise convolution in conjunction with linear attention mechanisms to address the challenge of maintaining diversity in output features across different positions.

A summary of the efficient attention work reveals that our sparse focus attention not only enhances local information in images but also enables the accumulation of long-distance context. Simultaneously, it is implemented with simplicity and efficiency in both its realization and speed.

B. Remote Sensing Image Change Captioning

Hoxha et al. [8] proposed early and late feature fusion strategies to integrate bitemporal visual features, employing an RNN and a multiclass SVM decoder to generate change captions. Chouaf et al. [9] pioneered the RSICC task by employing a CNN as a visual encoder to capture temporal scene changes, while adopting an RNN as a decoder to generate change descriptions. Liu et al. [10] recently introduced a transformer-based encoder–decoder framework for the RSICC task. Their approach involves utilizing a dual-branch transformer encoder to detect scene changes and proposing a multistage fusion module to merge multilayer features for change description generation. Furthermore, Liu et al. [30] enhanced their method by incorporating progressive difference perception transformer layers to capture high-level and low-level semantic change information. In addition, Liu et al. [31] proposed a prompt-based approach leveraging pretrained large language models (LLMs) for RSICC tasks. They employed visual features, change classes, and language representations as input prompts to a frozen LLM for the generation of change captions. Chang [11] proposed an attentive network for remote sensing change captioning, called Chg2Cap, which utilizes the power of transformer models in NLP.

Through an investigation of the aforementioned RSICC efforts, it was observed that prior endeavors have attempted various methods to address the task. However, they did not consider practical industrial applications; either the accuracy was insufficient or the complexity too high. Consequently, we propose a lightweight algorithm that not only achieves higher accuracy but also meets the requirements for real-world application scenarios.

SECTION III.

Methodology

In this section, we will introduce our proposed lightweight transformer model for the RSICC task.

A. Overview

First, we will provide an overview of our model. As shown in Fig. 3, our proposed model contains two phases, i.e., training phase and testing phase. In the following, we will introduce the two phases seperately.

Fig. 3.

Overall framework of the proposed SFT method comprises three components: (a) Feature extractor: A CNN-based, weight-shared feature extractor, primarily utilizing ResNet101in this study to extract coarse change features from bitemporal remote sensing images. (b) Sparse Focus Encoder: an encoder designed to finely capture and localize change features in remote sensing images, based on the proposed sparse focus attention mechanism. (c) Change caption generator: A decoder designed to generate the final change captioning for remote sensing images by accepting both the change feature embeddings and word embeddings.

Show All

Training phase: The SFT, as illustrated in Fig. 3, consists primarily of three components: a CNN feature extractor, where we employ ResNet101 as extractor, responsible for extracting generic representation information from the bitemporal images; an image encoder based on sparse focus attention, designed to extract features of the changed regions using the sparse attention mechanism proposed in this article; and finally, the caption decoder utilizes both image embeddings and word embeddings to generate predicted change captions, capturing the inter-relationship between them. The training phase primarily follows the following procedure: First, the images $\mathrm{I_{1}}$ and $\mathrm{I_{2}}$ at time instances $\text{Time 1}$ and $\text{Time 2}$ , along with their manually annotated corresponding change descriptions $\bm {Text=(t_{1},t_{2},{\ldots },t_{n})}$ , are taken as input, where $n$ is the length of the sequence. Subsequently, a CNN, employing weight sharing, serves as the extractor for image representation, and the output is fed into the subsequent SFT phase for the localization and capture of changed regions. Finally, this change feature, along with the corresponding $\bm {Text}$ after embedding, is passed to the caption decoder for similarity calculation and training.

Test phase: The predicted captions are generated through an autoregressive process solely based on the input image pairs. Autoregression in the context of image captioning entails generating a caption incrementally, with each word predicted based on the preceding words. To elaborate, the caption generation process commences with the initialization of the “START” token, and subsequent predictions are contingent upon the previously generated words.

B. Theory of Sparse Attention

Analysis of Complexity: The transformer architecture primarily consists of modules such as feedforward neural networks, activation functions, self-attention mechanisms, and others. Among these, the primary attention module is described by the following formulation: $\begin{equation*} Attention(\bm {Q, K, V}) = Softmax\left(\frac{\bm {QK}^{T}}{\sqrt{d}}\right)\bm {V} \tag{1} \end{equation*}$ View Sourcewhere $\bm {Q, K, V}$ are query, key, and value matrix, respectively. $\mathrm{d}$ is the feature's dimension. In this computational paradigm, the most computationally intensive aspect is focused on the operation involving the matrix $\bm {QK}^{T}$ , reaching a complexity of $\mathcal {O}(n^{2})$ . The core principle of sparse transformers is to involve only the pixels that influence the current pixel in the calculation of self-attention.

C. Sparse Focus Network

Through the theoretical analysis, we visualize the sparse factorizations of the attention kernels. Fig. 4(a) illustrates the computation of a conventional attention kernel, where the prediction of the current pixel (depicted in deep red) involves the utilization of pixels within a certain range (light-colored region). Fig. 4(b) demonstrates a rowwise attention kernel, where the prediction of the current pixel (depicted in deep red) exclusively considers pixels within the same row (light-colored region). Similarly, Fig. 4(c) follows the same rationale, illustrating a columnwise attention kernel where the prediction of the current pixel only involves pixels within the same column.

Fig. 4.

Visualization of attention kernel. (a) Conventional attention kernel. (b) Row attention kernel. (c) Column attention kernel.

Show All

Based on the analysis of attention kernels presented above, we have developed Sparse Focus Attention. This module integrates both horizontal and vertical attention kernels into a novel attention kernel. In Sparse Focus Attention, we have devised two distinct types of attention kernels, as depicted in Figs. 5 and 6. These are distinguished based on how they compute the length of rows and columns influencing the prediction of pixels. They are termed “full length” and “fixed length,” respectively. The former calculates the lengths of rows and columns affecting the current pixel $P$ across the entire $W \times H$ feature map, whereas the latter computes a fixed length of pixel attention within the entire feature map.

Fig. 5.

Sparse focus full attention, refers to the scenario where both the rowwise attention and columnwise attention lengths for each point are equal to the entire length of the feature map.

Show All

Fig. 6.

Sparse focus fixed attention refers to the scenario where both the rowwise attention and columnwise attention lengths for each point are fixed length.

Show All

In the “full length” approach, the initial phase involves processing the feature map $\bm {F} \in \mathbb {R}^{C \times W \times H}$ , which has been extracted by the feature extractor. This feature map $\bm {F}$ is first subjected to CNN operations, producing three feature matrices: $\bm {Q}$ , $\bm {K}$ , and $\bm {V}$ . Specifically, $\bm {Q}$ and $\bm {K} \in \mathbb {R}^{C^{\prime } \times W \times H}$ , where $C^{\prime }$ represents a reduced channel dimension compared to $C$ .

Utilizing the sparse focus method on $\bm {Q}$ and $\bm {K}$ , which involves overlaying the calculation methods of horizontal and vertical attention kernels, an attention map $\mathrm{\mathbf {A}} \in \mathbb {R}^{(H+W-1) \times (W\times H)}$ is obtained based on $\bm {Q}$ and $\bm {K}$ . Through the entire channel dimension we could obtain a vector $\mathrm{\mathbf {Q}}_{p} \in \mathbb {R}^{C^{\prime }}$ at each pixel $\mathrm{p}$ on the matrix $\bm {Q}$ . For matrix $\bm {K}$ , we can obtain a vector $\mathrm{\mathbf {K}}_{p} \in \mathbb {R}^{(H+W-1) \times (C^{\prime })}$ composed of attention kernels with the same rows and columns as matrix $\bm {Q}$ . Then, the attention map $\mathrm{\mathbf {A}}$ is calculated as $\begin{equation*} \mathrm{\mathbf {A}} = \text{Softmax} (\mathrm{\mathbf {Q}}_{p} \cdot \mathrm{\mathbf {K}}_{i,p}^{T}) \tag{2} \end{equation*}$ View Sourcewhere $\mathbf {K}_{i,p}$ is the ith element of $\mathbf {K}_{p}$ , applying softmax operation on it, we obtain the final attention map $\mathrm{\mathbf {A}} \in \mathbb {R}^{(H+W-1) \times (W\times H)}$ . Next, we apply the conv layer on feature $\bm {F}$ to generate matrix $\mathrm{V} \in \mathbb {R}^{C\times W\times H}$ . The vector $\mathrm{\mathbf {V}}_{p}$ is a collection of feature vectors in $\bm {V}$ which are in the same row or column corresponding pixel $p$ . The final output operation is defined as $\begin{equation*} \bm {F^{\prime }} = \sum _{i=0}^{H+W-1} \mathrm{\mathbf {A}} \cdot \mathrm{\mathbf {V}}_{p} + \bm {F} \tag{3} \end{equation*}$ View Sourcewhere $\bm {F^{\prime }}\in \mathbb {R}^{C\times W \times H}$ , and using residual operation to robust contextual information of spatial attention. Sparse focus attention with fixed length follows the same process as described above, but with a fixed length of row or column to join attention calculation.

Integrating sparse focus attention with other convolution operations into a SFT network involves using sparse attention mechanisms within the transformer architecture. During the sparse encoder operation, feature maps $\bm {F}_{\bm {1}}$ and $\bm {F}_{\bm {2}}$ , corresponding to different time steps, are passed through the SFT for feature localization and extraction. The output results, $\bm {F}_{\bm {1}}^{\prime }$ and $\bm {F}_{\bm {2}}^{\prime }$ , are then concatenated to form the final output $\bm {I_{1,2}}$ . The SFT operation can be expressed as follows: $\begin{align*} \bm {F}_{\bm {1}}^{\prime } =& \text{SFT}(\bm {F}_{\bm {1}}) \tag{4}\\ \bm {F}_{\bm {2}}^{\prime } =& \text{SFT}(\bm {F}_{\bm {2}}) \tag{5} \end{align*}$ View Sourcewhere $\text{SFT}(\cdot)$ represents the SFT operation applied to the input feature maps $\bm {F}_{\bm {1}}$ and $\bm {F}_{\bm {2}}$ .

After processing, the resulting feature maps are concatenated to form the final output $\begin{equation*} \bm {I_{1,2}} = \text{Concat}(\bm {F}_{\bm {1}}^{\prime }, \bm {F}_{\bm {2}}^{\prime }). \tag{6} \end{equation*}$ View SourceThe concatenated output $\bm {I_{1,2}}$ combines the features processed by the SFT from different time steps. By these operations, the network progressively refines its focus on the most relevant features through multiple passes. This iterative process enhances the network's ability to extract and represent meaningful information from the input data, leading to improved task performance.

D. Change Caption Generator

The techniques for generating descriptions are mainly divided into three categories: template-based approaches, retrieval-based approaches, and sequence generation-based approaches. To precisely describe and characterize the differences between image pairs, we endorse the sequence generation-based approaches. These approaches employ a transformer-based decoder, a method that has gained significant traction in numerous contemporary NLP tasks.

The decoder consists of several layers of transformers, each featuring a masked multihead attention sublayer and a feed-forward network. To ensure the continuity of information propagation and bolster the model's robustness, these sublayers are enhanced with residual connections and layer normalization techniques. Ultimately, the output embedding is produced through a linear layer (LN), followed by the application of a softmax activation function. A visual depiction of this architecture is provided in Fig. 7.

Fig. 7.

Visualization of caption generator.

Show All

During the training phase, in order to prepare change descriptions for the caption decoder, the text tokens undergo an initial mapping process to transform them into word embeddings via an embedding layer $E_{\text{embed}}$ . $E_{\text{pos}}$ represents the positional embedding computed using a sin/cos function with unique frequencies and phases assigned to each token's position [32]. $\bm {T}_{\text{embed}}$ refers to convert initial tokens $\bm {t}$ into word embeddings. The initial inputs of the transformer decoder can be acquired through the following procedure: $\begin{equation*} \bm {T}_{\text{embed}} = E_{\text{embed}}(\bm {t}) + E_{\text{pos}}. \tag{7} \end{equation*}$ View SourceThen, we feed the aforementioned inputs into the masked multihead attention layers. The computational process can be described as follows: $\begin{equation*} \bm {Head}_{l} = \text{Attention} (\bm {T}_{\text{embed}}^{i-1} \bm {W}_{l}^{Q}, \bm {T}_{\text{embed}}^{i-1}\bm {W}_{l}^{K}, \bm {T}_{\text{embed}}^{i-1}\bm {W}_{l}^{V}) \tag{8} \end{equation*}$ View Sourcewhere $\begin{equation*} \text{Attention}(\bm {Q, K, V}) = \text{Softmax}\left(\frac{\bm {QK}^{T}}{\sqrt{d}}\right)\bm {V} \tag{9} \end{equation*}$ View Sourcenext is $\begin{align*} \bm {T}_{\text{img}} & = \text{MHA}(\bm {T}_{\text{embed}}^{i-1}, \bm {T}_{\text{embed}}^{i-1}, \bm {T}_{\text{embed}}^{i-1}) \\ & = \text{Concat}(\bm {Head}_{1},\ldots, \bm {Head}_{h}) \cdot \bm {W}^{O} \tag{10} \end{align*}$ View Sourcein the masked multihead attention sublayer (MHA) and $h$ represents the number of heads. The weight matrices $\bm {W}_{l}^{Q}, \bm {W}_{l}^{K}$ , and $\bm {W}_{l}^{V}$ each of size $d_{\text{embed}}\times d_{\text{embed/}h}$ are trainable parameters associated with the $l$ th head. Here, $d_{\text{embed}}$ denotes the embedding dimension. $\bm {W}^{O} \in \mathbb {R}^{d_{\text{embed}}\times {d_{\text{embed}}}}$ , a trainable weight matrix of size, performs linear projection to adjust the output feature dimension. The ultimate output of the transformer decoder layer is attained by incorporating the input word embedding with the output of the feed-forward network via a residual connection: $\begin{equation*} \bm {T}_{\text{text}} = \text{FN}(\text{Caption}_{\text{text-img}}) + \bm {T}_{\text{embed}}^{i-1}. \tag{11} \end{equation*}$ View SourceAfter the aforementioned operations, the word embedding is passed through a LN followed by a softmax activation function to derive the word probabilities, the final caption denoted as $\begin{equation*} \mathrm{Caption_{T}} = \text{Softmax}(\text{LN}(\bm {T}_{\text{text}})) \end{equation*}$ View Sourcefor $\mathrm{Caption_{T}=[\hat{\mathbf {t_{1}}},\hat{\mathbf {t_{2}}},{\ldots },\hat{\mathbf {t_{n}}}]}\in \mathbb {R}^{n\times m}$ where $n$ is the length of the generated caption and $m$ is the size of the vocabulary, $\hat{\mathbf {t_{i}}}$ is the predicted probability of generating the $i$ th word.

In the validation and testing stages, the SFT network employs an autoregressive method for generating captions from input image pairs. The decoding process begins with the “START” token and utilizes the encoder's relevant image features to generate the next token. This is followed by the production of logits through a LN and the calculation of probabilities using softmax. The decoder integrates the encoder's output and previously generated tokens throughout this process to predict the next tokens until the “END” token is reached.

SECTION IV.

Experiments

To perform a thorough evaluation of the proposed SFT approach, we conducted comparisons with existing remote sensing change captioning methods on two benchmark remote sensing change captioning (RSCC) datasets. Furthermore, we provide evidence to demonstrate that our proposed method not only enables rapid predictions with low complexity but also exhibits high accuracy.

A. Datasets

LEVIR-CC Dataset: The LEVIR-CC dataset originates from a building change detection dataset consisting of 637 very high-resolution (0.5 m/pixel) bitemporal images sized $1024 \times 1024$ [10]. To adapt it for use in RSCC, the LEVIR-CC dataset was curated by segmenting 10 077 small bitemporal tiles sized 256 × 256 pixels, with each tile annotated as containing changes or no changes. The dataset comprises 5 038 image pairs depicting changes and 5 039 pairs without changes, with each pair accompanied by five distinct sentence descriptions delineating the nature of changes between the two acquisitions. The maximum sentence length is 39 words, with an average of 7.99 words.
Dubai-CC Dataset: The Dubai-CC dataset offers a detailed portrayal of urbanization changes within the Dubai region. To ensure precise identification and description of changes, the original images are partitioned into 500 tiles sized 50 × 50, with five change descriptions annotated for each small bitemporal tile, referencing Google Maps and publicly available documents. The dataset comprises 2 500 distinct descriptions, with a maximum length of 23 words and an average length of 7.35 words. Experimental configurations detailed in [8] were adopted, with the dataset divided into training, validation, and testing sets, containing 300, 50, and 150 bitemporal tiles, respectively.

B. Experimental Setup

Evaluation Metrics: The efficacy of the captioning model hinges on its ability to produce descriptive sentences that align well with human judgments regarding differences between bitemporal images. To gauge this alignment, automatic evaluation metrics are employed to quantify the accuracy of the generated sentences against annotated reference sentences. In our study, we utilized four standard metrics prevalent in both image captioning [33], [34] and change captioning [35], [36] domains:

BLEU-N (N = 1,2,3,4): Papineni et al. [37] measured the precision of n-gram overlap between the generated caption and the reference captions. It calculates the precision for each n-gram size (up to N) and combines them using a geometric mean, penalized by a brevity penalty to account for shorter generated captions $\begin{equation*} \text{BLEU-N} = \text{BP} \times \exp \left(\sum _{n=1}^{N} \frac{1}{N} \log p_{n}\right) \tag{12} \end{equation*}$ View Sourcewhere $p_{n}$ is the precision of n-grams, and BP is the brevity penalty.
ROUGE-L: Rouge [38] computed the longest common subsequence (LCS) between the generated caption and the reference captions. It normalizes this by the length of the longer of the two sequences, providing a measure of how well the generated caption captures the content of the reference captions $\begin{equation*} \text{ROUGE-L} = \frac{\text{LCS}(g, r)}{\text{max}(\text{len}(g), \text{len}(r))} \tag{13} \end{equation*}$ View Sourcewhere LCS(g, r) is the length of the LCS between the generated caption $(g)$ and the reference caption $(r)$ .
METEOR: Banerjee et al. [39] calculated the harmonic mean of precision and recall, incorporating stemming and synonymy in its evaluation. It penalizes for fragmentation by rewarding contiguous matches and accounts for recall through the harmonic mean $\begin{equation*} \text{METEOR} = (1 - \alpha) \cdot \text{P} \cdot \text{R} \cdot \frac{\text{P} + \beta \cdot \text{R}}{\text{P} + \text{R}} \tag{14} \end{equation*}$ View Sourcewhere P is precision, R is recall, and $\alpha$ and $\beta$ are tunable parameters.
CIDEr-D: Vedantam et al. [40] computed the consensus between the generated caption and the reference captions based on TF-IDF weighted n-grams. It emphasizes capturing diverse and descriptive phrases in the generated caption, giving higher scores for more informative and varied descriptions $\begin{equation*} \text{CIDEr-D} = \text{TF-IDF}(\text{gen}, \text{ref}) \tag{15} \end{equation*}$ View Sourcewhere TF-IDF is the term frequency-inverse document frequency weighted similarily between the generated caption and the reference captions.

These metrics assess the consistency between predicted and reference sentences, with higher scores indicating a closer resemblance and thus higher accuracy in captioning.

Experimental Details: The deep learning methodologies expounded in this study are implemented within the PyTorch framework and executed on single NVIDIA A5000 GPU with 24G memory. Training and evaluation procedures adhere to meticulous parameters: employing the Adam optimizer [41] with an initial learning rate of 0.0001 and a weight decay of 0.5. The training regimen spans about 40 epochs, with a batch size set at 32 for enhanced computational efficiency. Following each epoch, the model undergoes rigorous evaluation against the validation set.

C. Results Analysis

This study presents a comprehensive evaluation of the proposed algorithm on two distinct datasets, LEVIR_CC and Dubai_CC.

Accuracy Analysis: We will delve into a comprehensive examination of the models' performance, focusing on their accuracy using different metrics and how effectively their results in the given dataset.

On the LEVIR_CC dataset, our SFT method demonstrates remarkable performance, outperforming several existing methods such as DUDA, MCCFormers-S, MCCFormers-D, RSICCformer, ATTENTIVE-S, without positional embedding initialization, and ATTENTIVE, with positional embedding initialization, as illustrated in Table I. Specifically, it achieves a BLEU-4 score of 62.87%, underscoring its effectiveness in accurately capturing and describing changes between bitemporal images. Furthermore, our approach surpasses the current state-of-the-art (SOTA) methods in various other evaluation metrics, highlighting its superiority in change captioning tasks. This fully validates our previous analysis and further demonstrates the effectiveness of the proposed algorithm. In the comparison table presented in Table II, our SFT method is evaluated against several other methods on the DUBAI-CC Dataset. While our method performs reasonably well, it ranks second in overall performance metrics. The leading method, ATTENTIVE, achieves the highest scores across all evaluation metrics, indicating a stronger ability to capture the nuances of change between bitemporal images. Nevertheless, our Sparse Focus method demonstrates reasonably competitive performance, closely trailing the leading approach. Specifically, Sparse Focus achieves moderate scores in various metrics, such as BLEU-1, BLEU-2, ROUGE-L, and CIDEr-D, suggesting its potential in change captioning tasks. Despite securing the second position, our method's performance underscores its efficacy and potential as a robust solution for change detection and description in remote sensing images.

TABLE I Comparison of Methods' Performance on Multiple Evaluation Metrics on LEVIR-CC Dataset

TABLE II Comparison of Methods' Performance on Multiple Evaluation Metrics on DUBAI-CC Dataset

Parameter and Complexity Analysis: This section provides an in-depth analysis of model parameters and computational complexity, elucidating the intricate balance between model size, computational efficiency. By examining these factors, we gain valuable insights into the efficiency and effectiveness of the proposed approach.

Parameter and Inference Analysis: The Table III compares different methods based on their total parameters, encoder module parameters, and inference time for images sized 256 × 256 pixels. Sparse Focus is prominently featured in this comparison.

TABLE III Comparison of Parameters and Inference Time With Image Size

$256\times 256$

$Table III- Comparison of Parameters and Inference Time With Image Size $256\times 256$$

While SparseFocus ranks second in terms of accuracy, it stands out as the top performer in terms of parameter efficiency and computational complexity. For example, the best model from our work, Sparse Focus Full (depicted by the gray line), has only about half the total parameters compared to the highest accuracy model, ATTENTIVE. This advantage is especially pronounced in the image encoder module, where our encoder has over 90% fewer parameters compared to the 648M parameters of the ATTENTIVE model's encoder. This considerable reduction in model size underscores its efficiency in terms of both memory occupation and resource utilization, positioning it as the most lightweight method among the approaches considered, especially given comparable accuracy.

Moreover, SparseFocus exhibits competitive performance in inference time, demonstrating its potential for fast and efficient inference. Although its inference time is marginally longer than the baseline, it still falls within a moderate range, ensuring that it is practical for real-world applications.

In summary, SparseFocus emerges as a promising solution for change detection tasks, excelling in parameter efficiency and computational complexity while maintaining competitive accuracy. Its lightweight nature and reasonable inference time make it a suitable option for applications with strict resource constraints or real-time processing requirements. This combination of efficiency and practicality underscores its potential for broader adoption in environments where hardware limitations and processing speed are critical considerations.

GPU Memory & MACs Analysis: Table IV provides a comparative analysis of various models, focusing on their memory usage and computational efficiency, specifically regarding training GPU memory, inference GPU memory, and multiply-accumulate operations (MACs) for images sized 256 × 256 pixels.

TABLE IV Comparison of Model Memory Usage and Computational Efficiency

SparseFocus exhibits notable advantages in terms of GPU memory utilization for both training and inference. For example, the best model from our study (the gray line) requires approximately 9 GB for training, whereas the ATTENTIVE model requires more than 12 GB. In addition, the inference GPU memory usage for SparseFocus is around 0.6 GB, significantly lower than other models, which can exceed 1 GB or even 2 GB. This considerable reduction in memory usage demonstrates the efficiency of SparseFocus in leveraging computational resources during both model training and inference.

Furthermore, SparseFocus presents significantly lower Multiply-accumulate operations (MACs), indicating superior computational efficiency during model execution. Its MACs count is only 21.47 GB, far less than the RSICCformer and ATTENTIVE models, which require 150 GB and 464 GB, respectively. This reduction in computational overhead indicates that SparseFocus achieves comparable or even better performance with substantially lower resource requirements.

Overall, these results suggest that SparseFocus is not only efficient in terms of memory usage but also demonstrates superior computational efficiency. This makes it a highly attractive option for applications where resources are limited, or where computational efficiency is a critical factor. By maintaining high performance with reduced computational overhead, SparseFocus emerges as a compelling choice for a wide range of scenarios, particularly in environments with strict resource constraints or where speed and efficiency are paramount.

D. Ablation Studies

LEVIR_CC Ablation: In our study, we implemented the SparseFocus mechanism with attention kernel length $l$ determined as $l=w$ , where $w = h = 8$ is the size of our feature maps $F$ . This led to a full attention length for Sparse Focus Full, for the fixed length $1/2w$ is named Sparse Focus Fixed ensuring efficient information capture while minimizing computational complexity.

In our subsequent experiments, we evaluated the performance of the proposed methods, Sparse Focus Full and Sparse Focus Fixed, across LEVIR_CC datasets. In this context, $R$ represents the number of stacked attention layers. Table V provides a comprehensive summary of the results obtained from these experiments, demonstrating the relative efficacy of these two approaches. One of the most striking findings was that the Sparse Focus Full approach, with a single stack ( $R = 1$ ) of full-length attention, achieved the highest scores among the tested configurations. Specifically, this approach attained a BLEU-4 score of 62.87 and a CIDEr-D score of 137.05, indicating a robust capability for capturing semantic changes and outperforming alternative methods in key evaluation metrics. Interestingly, these results also highlighted a potential tradeoff associated with stacking multiple layers of SparseFocus. While increased depth could theoretically enhance the model's learning capacity, our findings suggest that this approach might lead to diminishing returns due to the introduction of additional sparsity. In particular, the data indicated that increased sparsity could result in a loss of effective information, especially when dealing with images of dimensions $256 \times 256$ . This observation motivated the adoption of a short attention length in Sparse Focus Fixed is only $l = 4$ , without shorter length, aiming to maintain a balance between model complexity and information retention.

TABLE V Performance Evaluation and Change Accuracy of SparseFocus Methods on LEVIR-CC Dataset

In summary, the results from these experiments emphasize the importance of carefully considering model architecture, especially when utilizing attention mechanisms. The success of Sparse Focus Full with a single stack suggests that simpler configurations can often yield superior outcomes, while excessive complexity may adversely impact performance. These insights will guide future research into optimal configurations for attention-based models, ensuring a balance between precision and efficiency.

The Table V illustrates the change accuracy of Sparse Focus methods on the LEVIR-CC dataset. Notably, Sparse Focus Full consistently outperforms Sparse Focus Fixed, emphasizing the effectiveness of utilizing the full-length attention mechanism. Specifically, Sparse Focus Full achieves superior change accuracy and total accuracy compared to Sparse Focus Fixed. Overall, Sparse Focus Full demonstrates superior performance in accurately detecting changes while maintaining high classification accuracy.

DUBAI_CC Ablation: The performance evaluation on the DUBAI-CC dataset highlights the effectiveness of Sparse Focus methods across various evaluation metrics, as depicted in Table VI. Specifically, Sparse Focus Full with $R=1$ stack layers outperforms $R=2$ in terms of BLEU scores, METEOR, ROUGE-L, and CIDEr-D. Notably, Sparse FocusFull achieves the highest BLEU-4 score of 37.30 and the highest CIDEr-D score of 91.59, indicating its robustness in capturing semantic changes and generating descriptive captions. It is worth noting that due to the small size of the dataset (50 × 50), we upsampled the images to 256 × 256 to ensure consistency in size and better processing of image features, which could have contributed to the improved performance of SparseFocus. This preprocessing step facilitates a more comprehensive analysis of the model's capabilities in handling diverse image characteristics and extracting meaningful information. Overall, these findings underscore the efficacy of Sparse Focus in change captioning tasks, particularly when equipped with a full-length attention mechanism.

TABLE VI Performance Evaluation and Change Accuracy of SparseFocus Methods on DUBAI-CC Dataset

Based on the results presented in Table VI, Sparse Focus methods demonstrate superior performance in change accuracy on the dubai-CC dataset. Specifically, Sparse Focus Full( $R=1$ ) consistently outperforms SparseFocus Full( $R=2$ ) across all evaluated metrics, achieving higher accuracy in both change and no-change detection tasks. The substantial improvement observed with full receptive fields suggests the significance of incorporating complete contextual information for accurate change detection. This finding underscores the effectiveness of SparseFocus in leveraging comprehensive spatial dependencies within the image, enabling more precise identification of changes between bitemporal images.

However, Tables V and VI also reveal that in the Sparse Focus Full series, stacking a single attention layer yields better accuracy in the final captioning result, whereas stacking two layers results in higher performance in image transformation detection. This observation suggests that, despite a consistent decoder, the optimal output of the image encoder does not necessarily contribute to a more accurate captioning by the decoder. There might be several reasons for this inconsistency. First, conventional similarity computation methods, such as cosine similarity, might not effectively integrate the image and text information accurately. This limitation could hinder the decoder's ability to generate captions that accurately reflect the content of the image. Second, the text decoder might require further design considerations to effectively decode the image information output by the image encoder, indicating that a more refined approach in integrating the encoder's output with the decoder's architecture could lead to improved captioning accuracy. Overall, this analysis underscores the need to re-examine the relationship between image encoding and text decoding to ensure that the information flow between these two components is optimized. By addressing these challenges, future work can focus on developing more robust methodologies to achieve a seamless fusion of image and text information, thereby enhancing the accuracy of captioning.

Model Architecture Ablation: We conducted ablation experiments by employing different networks for the extractor, image encoder, and caption decoder components of the model to further validate the effectiveness of the method proposed in this article. As shown in Table VII, when different backbone networks were utilized as the Extractor for feature extraction, a decrease in accuracy was observed. Furthermore, we employed the “vit_base_patch16_224” model as the Image Encoder and made slight adjustments to the ViT model to better align it with our method. As indicated in the fourth row of the table, the accuracy of the experimental results is second only to our proposed method. However, the model's parameter count reached approximately 1.4G, with training GPU memory exceeding 15 GB. In addition, we integrated the Sparse Focus module into the caption generator. The results, as seen in the fourth row of the table, reveal that the accuracy was the lowest, likely due to the sequential regression and prediction of each word during caption generation. If the final decoder is too sparse, it may struggle to accurately predict the next word, thus negatively impacting overall experimental accuracy.

TABLE VII Ablation Study on Model Components and Performance Evaluation on LEVIR-CC Dataset

E. Qualitative Visualization

To assess the effectiveness of the change captions produced by our proposed SFT method, we performed a qualitative evaluation by selecting representative scenes from both the Dubai-CC and LEVIR-CC datasets. These scenes are depicted in Figs. 8 and 9. In each figure, TIME1 and TIME2 correspond to remote sensing images captured at different times in the same area. The captions on the right represent the predictions generated by our model, while the captions below labeled as “reference” serve as the ground truth labels.

Fig. 8.

Visualized image and captioning examples generated by SFT on the LEVIR-CC dataset.

Show All

Fig. 9.

Visualized image and captioning examples generated by SFT on the DUBAI-CC dataset.

Show All

F. Limitation and Future Work

Through the research presented in this article, regarding the current issues and future research directions in the captioning of change detection for remote sensing images, the main focus should be the following aspects. First, we address the problem of multimodal fusion to more accurately and effectively extract representations of individual modalities. Only when modalities are sufficiently clear and accurate can they be effectively fused and aligned. Second, for the RSICC, exploring novel solution architectures instead of the existing three-stage paradigm. Third, exploring additional multimodal fusion techniques, such as similarity calculation methods, will facilitate accurate matching between image and text modalities.

SECTION V.

Conclusion

In summary, our SFT network offers a compelling solution to the task of change captioning in remote sensing imagery. Through rigorous experimentation, we demonstrate its effectiveness in accurately describing changes while significantly reducing computational complexity and parameter count compared to existing methods. Our approach not only advances change captioning technology but also contributes to efficient multimodal modeling, with potential applications in various industries and domains requiring high-dimensional data interaction.

References is not available for this document.

A Lightweight Sparse Focus Transformer for Remote Sensing Image Change Captioning

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction