Introduction
Change detection (CD) is the process of identifying differences in the state of an object or phenomenon by observing it at different times [1]. Precise identification of alterations in Earth's surface attributes is essential for comprehending the interplay between humans and natural phenomena [2]. Hence, this task has consistently captivated the interest of the remote sensing community over an extended period. Remote sensing images CD plays an important role in updating geographic data, evaluating disasters, predicting disaster development trends, land use monitoring, and other tasks.
According to the type of semantic label information desired in the output change map, CD tasks can be divided into two categories: binary CD and semantic change detection (SCD) [3], [4]. Binary CD can only tell us where changes have taken place. However, in practical applications, we are interested not only in the location of changes but also in the type of changes. In order to address this deficiency, SCD methods have emerged [5], [6]. SCD needs to further identify the change category based on the changed regions to provide detailed “from-to” change information in practical application [7], [8]. Therefore, SCD provides detailed and information-rich perspective for monitoring land cover changes in the remote sensing context.
CD methods can be roughly divided into two categories: traditional and artificial intelligence based [9]. With the increasing application of automatic CD in remote sensing images, the limitations of traditional methods are becoming more obvious. The increase in spatial resolution of remote sensing images has resulted in richer image details. However, traditional methods face limitations in feature extraction, leading to a decrease in the accuracy of CD. Moreover, these methods are also prone to the influence of factors such as seasonal changes and lighting conditions [9]. Recently, deep learning-based (DL-based) methods have aroused the interest of many researchers due to the explosive development of artificial intelligence technology. Due to their excellent nonlinear feature extraction and learning ability, DL-based methods can better understand complex scenes, and their performance is far superior to traditional methods [10], [11]. For DL-based SCD algorithms, leveraging large volume remote sensing data, especially high-resolution data, to detect finer changes [12], has become an urgent problem that needs to be addressed.
SCD is an intricate task, which includes two subtasks: 1) semantic segmentation, and 2) binary CD. In the existing research, some approaches have tackled semantic segmentation and binary CD separately using independent branches [13], while others have relied on post-classification CD methods [14]. However, the inherent interdependency between these two subtasks has not been fully exploited. Specifically, the information gained from the land cover semantic segmentation tasks can potentially enhance the accuracy of CD [5]. Recognizing this, recent studies have proposed the use of Siamese networks to extract image features and feed them into different task heads in order to improve the accuracy of SCD [15], [16], [17]. However, these networks have yet to fully exploit the inherent correlations between subtasks, leading to inconsistent detection results.
Despite some progress, existing research on SCD methods still faces the following challenges:
Impact of Missing Details: Existing methods may result in false positives and negatives due to the lack of detailed spatial information.
Contradictory Results from Different Subtasks: The relationship between binary CD and semantic segmentation often leads to conflicting outcomes, where detected change areas do not align with segmentation results.
Class Imbalance Challenge: The scarcity of positive samples in datasets containing both changed and unchanged regions hampers the accuracy of semantic segmentation techniques.
To address the deficiencies in SCD research in the field of remote sensing, this article proposes a semantic information collaboration network (SIC-Net), that effectively integrates detailed information and maximizes the synergistic relationship between the two subtasks. Moreover, we exploit a pseudo-label growth algorithm (PLGA) to increase the number of annotation pixels, alleviating class imbalance issues and thereby improving the accuracy of SCD. The major contributions in this article are as follows.
We propose a novel SIC-Net to improve SCD performance of remote sensing images. The network incorporates a dual-branch backbone (DBB), seamlessly integrating spatial details with contextual information, achieving adaptive alignment and significantly enhancing semantic awareness. The experimental results indicate that our SIC-Net achieves state-of-the-art performance on benchmark SCD datasets.
We design a spatial-temporal semantic coordination module (STSCM) in SIC-Net. The STSCM employs an attention mechanism to facilitate information exchange between the two subtasks, thereby harnessing the collaborative potential between them and further enhancing the network's robustness.
We develop a PLGA, which generates high-quality pseudo-labels based on semantic segmentation network to alleviate the issue of class imbalance between positive and negative samples.
The rest of this article is organized as follows. Section II reviews the related work. Section III introduces our methodology in detail. Section IV describes the datasets and the experimental settings. Section V presents the results of our experiments. The details extraction strategy and model efficiency are discussed in Section VI. Finally, we conclude our work in Section VII.
Related Work
A. Binary Change Detection
In recent years, binary CD techniques has gradually become an important means of acquiring dynamic land cover change information in the field of remote sensing. Traditional methods can be divided into visual analysis, algebra-based methods, transformation-based methods, classification-based methods, advanced models, and other hybrid approaches [18]. Algebra-based methods comprise image differing methods, image regression method [19], [20], image rationing method [21], and change vector analysis [22]. Transformation-based methods involve principal component analysis [23], Tasseled Cap [24], Gramm–Schmidt, and others. Classification-based methods primarily employ post-classification comparison techniques to identify changes.
However, the emergence of deep learning-based CD networks has captured significant attention. These networks exhibit robust feature learning capabilities and flexible model architectures, greatly enhancing the performance of binary CD [25], [26], [27]. In recent years, many binary CD networks based on deep learning have been proposed. Deep learning binary CD architectures can be roughly divided into single-stream networks and double-stream networks.
Single-stream networks typically refer to semantic segmentation networks. They utilize various data fusion techniques to integrate multiple temporal cycles of remote sensing images, generating intermediate data for input [5], [28], [29], [30].
Double-stream networks are commonly composed of two weight-sharing feature extraction streams that directly take bi-temporal images as the input [31]. In recent years, researchers have proposed various innovative network architectures and methods in the field of CD. Daudt et al. [32] introduced a fully convolutional network that includes a part for extracting features from dual-temporal remote sensing images using Siamese networks. Zhang et al. [33] designed a super-pixel sampling network for feature extraction and super-pixel segmentation in dual-temporal images. Additionally, there are methods like the super-resolution-based change detection network method [34] and a local–global pyramid network for building change detection [35]. The dual-task constrained deep Siamese convolutional network [36] and the semantic feature-constrained change detection network [37] both utilize two Siamese networks to constrain the binary CD network. Other innovative approaches include the SNUNet-CD network [38], dual-attention fully convolutional Siamese networks with weighted bilateral edge contrast loss [39], methods combining transformer [40], [41], [42], [43], [44], [45], and the SAM-CD method [46] based on the segment anything model [47]. While these methods have shown gradual improvements in tasks such as identifying change areas, addressing pseudo changes, small target CD, and change area boundary recognition, traditional binary CD methods often provide insufficient information. Using these methods to accurately identify change types still requires further improvement in practical applications.
B. Semantic Change Detection
SCD stands as an informative pixel-level CD method that concurrently identifies change regions in dual-temporal images and their corresponding land cover categories. It offers semantic information overlooked by binary CD methods and finds widespread applications in diverse domains, including urban planning, farmland conversion, and disaster monitoring.
In the realm of early SCD methods, several approaches have been explored. These encompass the most intuitive post-classification change detection [7], [48], as well as direct classification methods, among others. Due to the application and advancement of deep learning in remote sensing research, tasks related to SCD based on deep learning have been gradually investigated.
Existing deep learning methods for SCD can be categorized based on the number of encoders into three types: single-encoder methods, Siamese-encoder methods, and triple-encoder methods.
Single-encoder methods: These methods treat SCD as a multiclass semantic segmentation task. In such approaches, a single encoder is employed to extract features from fused imagery of two different periods, and the network is trained to classify semantic changes. For instance, HRSCD str2 [5] is a notable example.
Siamese-encoder methods: Siamese-encoder methods encompass three variations. The first variation involves Siamese encoders combined with a single decoder, treating SCD as a classification-after-change task. The second variation utilizes Siamese encoders with dual decoders. For example, Peng et al. [50], based on the Siamese U-Net network, introduced an SCD method. They further incorporated metric learning and deep supervision strategies to enhance the network's performance. Xia et al. [14] proposed a deep Siamese classification-after-fusion network.
The third and widely adopted variation employs Siamese encoders with three decoders. This architecture is designed to better capture the spatial and temporal features in the data, making it suitable for complex SCD tasks. This work highlighted the synergy between binary CD and semantic segmentation tasks, showcasing the potential of multitask deep learning in SCD. Therefore, most of the existing network structures for SCD use Siamese networks for feature extraction and then employ different heads to handle subtasks, addressing the issue of temporal correlation neglect in previous methods [5]. For instance, Zhao et al.[4] proposed a spatially and semantically enhanced Siamese network, which aggregates the rich spatial and semantic information in the remote sensing images through a designed spatial and semantic feature aggregation module. Zhu et al. [49] proposed the Siamese global learning framework, which alleviates the issue of class imbalance by improving the sample sampling mechanism. Zheng et al. [16] proposed a multitask architecture named ChangeMask, which decouples the SCD into a temporal-wise semantic segmentation and a binary CD, and designs a temporal-symmetric transformer to guarantee temporal symmetry. Chen et al. [51] proposed a feature constraint change detection network and proved that bi-temporal semantic segmentation branches can improve the precision of CD task. Ding et al. [15] compared several base architectures for SCD and proposed a bi-temporal semantic reasoning network (Bi-SRNet) on this basis. [52] introduces MTSCD-Net, which simultaneously extracts multiscale features from dual-temporal data and leverages the spatial attention weight map from the binary CD subtask to enhance the semantic segmentation subtask. Jiang et al. [53] developed a temporal-transform network, which captures temporal changes across dimensions through a designed temporal-transform module. Chen et al. [54] proposed MambaSCD based on the Mamba structure, comprehensively learning global spatial context information from input images to achieve spatiotemporal interaction of multitemporal features.
Triple-encoder methods: Triple-encoder methods involve networks with three independent encoders paired with separate decoders. These methods separate the processes of semantic segmentation and binary CD, offering greater flexibility in modeling and training. They overlook the temporal correlation between the two time-period images and the intrinsic correlation between the two subtasks. Representative architectures of this kind include HRSCD-sr.3 and HRSCD-sr.4 as described in [5]. Building upon this foundation, Ding et al. [55] introduced SCanNet, where comprehensive learning of spatiotemporal dependencies is conducted at both encoding and decoding stages. Wang et al. [56] proposed DESNet, which effectively improves the robustness to large-scale changes and the integrity of change objects. Chang et al. [57] proposed JFRNet, which transforms independent learning of multiple tasks into joint refinement of dual temporal features. These approaches significantly improved the performance of the triple-encoder methods.
However, there are still problems in the current research of SCD. On one hand, the existing SCD models pay more attention to the extraction of semantic information and ignore the importance of spatial details. On the other hand, the number of positive samples of land cover category labels in the existing SCD dataset is too small, resulting in poor semantic segmentation accuracy, which is also a common problem in SCD datasets. To alleviate the above issues, this article proposes a SIC-Net method for SCD tasks.
Methodology
In this section, we offer a detailed description of the proposed SIC-Net. The overall architecture is illustrated in Fig. 1(a), consisting of two shared-weight dual-branch backbones and a spatial–temporal semantic coordination module. The integration of two shared-weight DBB provide the network with robust feature extraction capabilities. These backbones, specifically designed with the detail capture path (DCP) and the semantic context path (SCP), collaboratively capture information at different hierarchical levels within the image. To enhance information exchange and fusion between these two paths, we introduce a detail guidance module (DGM). During the decoding phase, SIC-Net incorporates the STSCM to facilitate information exchange between the two subtasks, thereby enabling collaborative training for both tasks. The inputs of CD_head1 and CD_head2 are derived from different nodes in STSCM's change features, achieving the goal of deep supervision. The output of CD_head1 serves as the final binary CD result. Furthermore, before training, we leverage PLGA to predict semantic class labels for unchanged regions, subsequently generating pseudo-labels.
Architecture of the proposed SIC-Net. (a) Overview of the proposed sic-net. DGM denotes detail guidance module. SCP denotes semantic context path. DCP denotes detail capture path. DBB refers to a dual-branch backbone composed of DCP and SCP. (b), (c), and (d) respectively denote the structures of conv block, seg_head1, and seg_head2, cd_head1, and cd_head2.
A. Dual-Branch Backbones
To alleviate the omission and misclassification of small patches caused by the lack of detailed information, it is crucial to balance the requirements for spatial details and a larger receptive field. Both factors are vital for achieving high segmentation accuracy [58], [59], [60], and they can mitigate the challenge of frequent false positives near changing boundaries [61], [62].
Therefore, building upon the Siamese encoder structure, SIC-Net employs DBB for extracting and aligning spatial details with contextual information. In this architecture, ResNet [63] serves as the SCP for extracting deep contextual information. The DCP architecture designed in this article is depicted in Fig. 1(a). It comprises four layers with spatial resolutions of 1/2, 1/4, 1/4, and 1/4 of the input image, respectively. After the initial extraction of fine-grained spatial details by the first two convolutional layers, the input is fed into the DGM, which adaptively integrate detailed information X1 with contextual features X2, dynamically adjusting the weighting between these two feature types. The fused feature X is then fed into the corresponding layer of the ResNet, encouraging the network to focus on finegrained details while extracting deep semantic information, thereby reducing the loss of detailed information. Through the DGM, the DCP, and the SCP can mutually influence each other, establishing a stronger collaborative relationship within the network.
The designed DGM is illustrated in Fig. 2. In the channel dimension, the finegrained feature X1 and the contextual feature X2 extract global information through adaptive average pooling, followed by concatenation to generate the attention matrix Wc. Then, utilize this weight matrix to adjust the channel-wise feature distribution of X2. In the spatial dimension, the finegrained feature X1 generates two direction-specific attention maps, namely Wh and Ww, guiding the network to concentrate more on learning the target region. Particularly in handling directional features such as edges or textures, enhancing the model's perception of details and its ability to identify specific areas is achieved by guiding the network's attention. X2 undergoes a 1x1 convolutional layers to adjust its channel dimension to match that of X1
\begin{equation*}
{{X^{\prime}}_2} = \text{Conv}1 \times 1\left({{{X}_2}} \right). \tag{1}
\end{equation*}
Subsequently, Wc is applied as channel attention, while Wh and Ww serve as spatial attention features to recover missing finegrained features, ensuring that the network adequately focuses on spatial details and receives accurate guidance from contextual information
\begin{align*}
{{\bar{X}}_2} =& {{X^{\prime}}_2}\ *{{W}_C} \tag{2}\\
\tilde{X} =& {{\bar{X}}_2}\ *{{W}_h} + {{\bar{X}}_2}*{{W}_h} + {{X}_1}. \tag{3}
\end{align*}
Finally, following the residual structure paradigm, the fused features
B. Spatial–Temporal Semantic Coordination Module
To alleviate conflicts between the results of the two subtasks, we introduce an attention-driven STSCM in this article. This module aims to delve into the intrinsic correlations between the two subtasks and leverage this correlation to optimize the overall model performance. By incorporating an attention mechanism, this module can selectively focus on specific regions of interest for the two subtasks, promoting more effective information exchange between them. In this way, the STSCM can fully exploit the potential causal relationships between spatial semantic information and temporal change features, leveraging the collaborative effects of the two subtasks and effectively alleviating conflicts in the results.
As shown in Fig. 3, STSCM concatenates the features f1 and f2, and then generates temporal feature fcd through 4 Resblocks (as illustrated in Fig. 4) to accurately model the temporal changes in features. Simultaneously, a convolutional block composed of two residual blocks, a convolutional layer, batch normalization, and ReLU activation, is employed to process f1 and f2, resulting in new features f1' and f2'. To precisely measure the similarity between f1' and f2', we adopt the method of calculating the Euclidean distance, providing a quantitative metric for their similarity
\begin{equation*}
W\ = \ \text{Sigmoid}\left({\text{dist}\left({{{{f^{\prime}}}_1} + {{{f^{\prime}}}_2}} \right)} \right) \tag{4}
\end{equation*}
Structures of the (a) STSCM, (b) SAM, (c) L2 distance. SAM denotes spatial attention modules in [64].
Furthermore, after generating the feature fcd in the binary CD branch, spatial attention is employed to provide prior positional information about changing regions. This information is subsequently propagated to the semantic segmentation branch, aiding in further optimizing semantic segmentation features. This design allows the segmentation network to focus more on the changed regions, significantly enhancing its sensitivity to change information in category recognition tasks. Overall, the STSCM strengthens the network's focus on changed regions, fully leveraging the inherent connection between temporal features and spatial semantics, thereby significantly improving the network's performance in modeling spatiotemporal information.
C. Pseudo-Label Growth Algorithm
The existing binary CD datasets lack sufficient descriptions of land cover categories, making them inadequate for SCD requirements. Additionally, in the available SCD datasets, there is a shortage of positive samples for land cover category labels (SECOND [6]: 19.87%, Landsat-SCD [3]: 18.98%, MSSCD [65]: 2.7%), resulting in poor semantic segmentation accuracy. To address this issue, this study proposes a novel method for generating high-quality pseudo-labels, aiming to significantly enhance the performance of SCD.
Drawing inspiration from the article [66] and considering our specific requirements, we introduced a PLGA, as illustrated in Fig. 5. We use a small number of land cover categories from existing labels as seed clues, and employ the predicted probability map from the HRNet [67] semantic segmentation network as the judgment basis. Given the complexity of land cover in remote sensing images, we set fairly strict criteria for seed region growth. Specifically, we set the growth threshold to 0.99. This ensures that only regions with a very high likelihood of being representative of the given land cover category are considered for further expansion. To enhance the reliability of the generated labels, an additional constraint is introduced. This constraint is based on the ratio of the maximum probability to the second maximum probability in the probability distribution map derived from the prediction results. We set this ratio as another threshold, denoted as threshold2, with a substantial value of 5. The primary objective of this constraint is to prioritize regions where the predicted land cover category is significantly more dominant than alternative possibilities. By doing so, we aim to minimize potential misclassifications and enhance the overall accuracy of the generated labels. During the training process, the computation of the seed loss is derived from the calculation results of the labeled regions' loss.
The pseudocode of the PLGA is shown in Algorithm 1.
The pixel annotation counts for each category in the training samples of the SECOND dataset exhibit varying degrees of improvement before and after PLGA, as illustrated in Fig. 6.
Furthermore, we finetuned the loss function to incorporate consistency constraints when integrating pseudo-labeled samples into the training process. In this refinement, loss calculation occurs only when the land cover types of pseudo-labeled samples (representing unchanged areas) in both temporal images exhibit consistency. This strategic adjustment aims to expand the category annotation pool without compromising accuracy, ultimately preserving the authenticity of the network's performance.
Algorithm 1: PLGA.
Image I, seed clues S, threshold1 th1, threshold2 th2.
Predict the probability map P, P = HRNet(I).
Set predicted probabilities of types not present in the S to 0.
Obtain predicted result R, R = argmax(P).
Filter the set of pixels A that satisfy the conditions based on the predicted probabilities.
\begin{align*}
\ {{A}_1} = & \left\{ {\left({i,j} \right)\left| {P\left({i,j} \right)} \right\rangle t{{h}_1}} \right\}\\ \ {{A}_2} = & \left\{ {\left({i,j} \right)\left| {\frac{{\text{max}\left({P\left({i,j} \right)} \right)}}{{\text{ma}{{\mathrm{x}}_2}\left({P\left({i,j} \right)} \right)}}} \right\rangle t{{h}_2}} \right\}\\
A =& {{A}_1}\ \cap {{A}_2}
\end{align*}
Neighborhood growth is an iterative process for each class. For the iteration process of class c:
Obtain the predicted region
Identify the categories within the growth range
Iterate over each element
If
Output: Pseudo label PL.
D. Loss Function
In order to ensure that each sub-task of SIC-Net receives proper optimization, as illustrated in Fig. 7, we designed a multitask loss function L as follows:
\begin{equation*}
L\ = {{l}_{\text{change}}}\ + {{l}_{\text{seg}}} + {{l}_{\text{uc}}} \tag{5}
\end{equation*}
The
\begin{equation*}
{\ {{l}_{\text{change}}} = \ - {{y}_c}\log \left({{{p}_c}} \right) - \left({1 - {{y}_c}} \right)\log \left({1 - {{p}_c}} \right)} \tag{6}
\end{equation*}
The
\begin{equation*}
{\ {{l}_{\text{seg}}} = \ 0.5*{{l}_{\mathrm{seg1}}} + 0.5*{{l}_{\mathrm{seg2}}}} \tag{7}
\end{equation*}
\begin{align*}
{{l}_{\mathrm{uc1}}} =& - \frac{1}{N}\mathop \sum \limits_{i = 1}^N {{y}_i}\log \left({{{p}_i}} \right) \tag{8}\\
{{l}_{\text{uc}}} =& 0.5\;*\ {{l}_{\mathrm{uc1}}} + \ 0.5{\rm{*}}{{l}_{\mathrm{uc2}}} \tag{9}
\end{align*}
By employing the aforementioned trio of loss functions, we can directly train two semantic segmentation subtasks and binary CD subtasks. This comprehensive approach ensures a robust learning process for all targeted tasks.
Dataset Description and Experiment Setting
A. Datasets
1) SECOND Dataset
The SECOND dataset is an SCD Dataset [6], collects 4662 pairs of aerial images from several platforms and sensors. Each image has size of 512 × 512, contains RGB channels, and is annotated at the pixel level in changed regions. The spatial resolution varies from 0.5 m to 3 m (per pixel). The changed regions account for 19.87% of the total image. The land cover categories of change regions in the previous and subsequent images are provided, including no-change, nonvegetated ground surface, tree, low vegetation, water, buildings, and playgrounds.
Among the 4662 pairs of temporal images, 2968 ones are openly available. There does not exist a standard splitting for this dataset, so we randomly split the dataset into a training set, and testing set based on the ratio train: test = 4:1.
2) Landsat-SCD Dataset
Landsat-SCD dataset is a recently published well-annotated dataset for SCD [3], the images in Landsat SCD dataset were collected from Landsat-like images captured between the years 1990 and 2020 in Tumushuke (39°39′N – 40°4′N, 78°53′E – 79°19′E), Xinjiang. The dataset, consisting of 8468 image pairs, and each image has size of 416 × 416, contains RGB channels. The spatial resolution is 30m per pixel. The changed regions account for 18.89% of the total image. The Landsat-SCD dataset provides 10 change types, each of which is a separate class representing land-cover transitions. To align with the SECOND dataset, we establish land cover categories of change regions in the previous and subsequent images. These categories encompass: no change, farmland, desert, building, and water. We adhere to the data split proposed by the authors [3] with 6053 pairs allocated for training, 1729 pairs for validation, and 686 pairs for testing, randomly sampled.
B. Evaluation Metrics
In this work, we use three types of evaluation metrics to evaluate the accuracy of the total task of SCD and two subtasks (binary CD and semantic segmentation). These include binary CD metrics: mean intersection over union (
\begin{align*}
\ \text{Io}{{\mathrm{U}}_1} =& {{q}_{00}}\ /\left({\mathop \sum \limits_{i= 0}^N {{q}_{i0}} + \mathop \sum \limits_{i= 0}^N {{q}_{0j}} - {{q}_{00}}} \right) \tag{10}\\
\ \text{Io}{{\mathrm{U}}_2} =& \mathop \sum \limits_{i = 1}^N \mathop \sum \limits_{i = 1}^N {{q}_{ij}}\ /\left({\mathop \sum \limits_{i= 0}^N \mathop \sum \limits_{j= 0}^N {{q}_{ij}} - {{q}_{00}}} \right) \tag{11}\\
{\rm{mIoU\ = }}&\left({\text{Io}{{\mathrm{U}}_{1}}{\rm{ + Io}}{{\mathrm{U}}_{2}}} \right){\rm{\ /2}} \tag{12}\\
R =& \mathop \sum \limits_{i = 1}^N \mathop \sum \limits_{j = 1}^N {{q}_{ij}}\ /\mathop \sum \limits_{i= 0}^N \mathop \sum \limits_{j= 1}^N {{q}_{ij}} \tag{13}\\
P =& \mathop \sum \limits_{i = 1}^N \mathop \sum \limits_{j = 1}^N {{q}_{ij}}\ /\mathop \sum \limits_{i= 1}^N \mathop \sum \limits_{j= 0}^N {{q}_{ij}} \tag{14}\\
F1 =& \frac{{2*\text{P}*\text{R}}}{{{\rm{P + R}}}}\ . \tag{15}
\end{align*}
\begin{align*}
{{p}_0} =& \mathop \sum \limits_{i = 1}^N {{q}_{ii}}\ /\mathop \sum \limits_{i = 1}^N \mathop \sum \limits_{j = 1}^N {{q}_{ij}} \tag{16}\\
{{p}_e} =& \mathop \sum \limits_{i = 1}^N \left({\mathop \sum \limits_{j = 1}^N {{q}_{ij}}*\mathop \sum \limits_{j = 1}^N {{q}_{ji}}} \right)\ /{{\left({\mathop \sum \limits_{i = 1}^N \mathop \sum \limits_{j = 1}^N {{q}_{ij}}} \right)}^2} \tag{17}\\
\text{Kappa} =& \frac{{{{p}_0} - {{p}_e}}}{{1 - {{p}_e}}}\ . \tag{18}
\end{align*}
The
\begin{align*}
\rho =& \mathop \sum \limits_{i = 1}^N {{q}_{ii}}\ /\left({\mathop \sum \limits_{i = 0}^N \mathop \sum \limits_{j = 1}^N {{q}_{ij}} - {{q}_{11}}} \right) \tag{19}\\
\eta =& \left(\mathop \sum \limits_{i = 1}^N \left({\mathop \sum \limits_{j = 0}^N {{q}_{ij}}*\mathop \sum \limits_{j = 0}^N {{q}_{ji}}} \right) \right.\\ &\left.+ \mathop \sum \limits_{j = 1}^N {{q}_{0j}}*\mathop \sum \limits_{j = 1}^N {{q}_{j0}} \right)\ /{{\left({\mathop \sum \limits_{i = 0}^N \mathop \sum \limits_{j = 0}^N {{q}_{ij}} - {{q}_{00}}} \right)}^2} \tag{20}\\
\text{SeK} = &{{e}^{\text{Io}{{\mathrm{U}}_1} - 1}}\ *\left({\rho - \eta } \right)/\left({1 - \eta } \right). \tag{21}
\end{align*}
\begin{align*}
{{P}_{\text{scd}}} =& \mathop \sum \limits_{i = 1}^N {{q}_{ii}}\ /\mathop \sum \limits_{i = 1}^N \mathop \sum \limits_{j = 0}^N {{q}_{ij}} \tag{22}\\
{{R}_{\text{scd}}} =& \mathop \sum \limits_{i = 1}^N {{q}_{ii}}\ /\mathop \sum \limits_{i = 0}^N \mathop \sum \limits_{j = 1}^N {{q}_{ij}} \tag{23}\\
{{F}_{\text{scd}}} =& \frac{{2*{{P}_{\text{scd}}}*{{R}_{\text{scd}}}}}{{{{P}_{\text{scd}}} + {{R}_{\text{scd}}}}}\ . \tag{24}
\end{align*}
Through the above three types of evaluation metrics, the accuracy of SCD tasks can be comprehensively evaluated.
Finally, to comprehensively evaluate the computational efficiency of the model, we adopted two key metrics: parameter count (Params) and floating-point operations (FLOPs). FLOPs represent the number of floating-point operations required for the model to perform a single forward pass. A higher FLOPs value and a larger Params indicate a more complex model that requires more computational resources for inference. To compute these two metrics, we provide two input images of size 1×3×512×512 each, taken at two different times.
C. Experimental Settings
The experiments in this paper were run on a desktop workstation equipped with an NVIDIA GeForce RTX 3090 GPU boasting 24 G memory. All programs are implemented based on the PyTorch platform.
During data preprocessing and augmentation for each dataset, we applied normalization to images and employed random flipping and rotating techniques to process both image and label data. For the proposed SIC-Net, consistent experimental parameters are employed on different datasets, including batch size = 8, running epochs = 50, and initial learning rate = 0.1. Additionally, we adopted the stochastic gradient descent method to optimize the weights.
D. Comparative Methods
To comprehensively evaluate the performance of the proposed SIC-Net, we further compare it with several state-of-the-art methods in SCD tasks. The compared methods include the following.
The SSCD-l [15]: This network uses two Siamese ResNet to extract semantic information, and then input it into the binary CD head and two semantic segmentation heads.
The L-UNet [68]: This method is a UNet-like network, which can simultaneously handle binary CD and semantic segmentation by using fully convolutional long short-term memory networks.
The HRNet: This is the champion scheme of the SCD competition hosted by Sense Time in 2020. The network uses Siamese HRNet as the backbone to extract multiscale feature. Then, two semantic segmentation heads and one detection head are used. Its solution open source address.1
The SCDNet [50]: This method is based on a Siamese U-Net architecture, utilizing an attention mechanism and deep supervision strategy to improve performance.
The Bi-SRNet [15]: This method is an improvement based on SSCD-l network, using two types of semantic reasoning blocks to reason both single-temporal and cross-temporal semantic correlations.
The MTSCD-Net [52]: This method employs a Siamese encoder based on the Swin transformer along with a feature aggregation module to extract multiscale features. Subsequently, it explores the correlations between subtasks through designed modules.
MambaSCD [54]: This method, based on the Mamba architecture, aims to learn global spatial context information from input images and to learn spatiotemporal features through a mechanism for modeling spatiotemporal relationships.
SCanNet [55]:This network proposed a semantic change Transformer (SCanFormer) to explicitly model the change information between dual-temporal images, while utilizing dual-time consistency as additional supervision to guide the learning of semantic changes.
DEFO-MTLSCD [17]: This method facilitates feature interaction between the binary CD and semantic segmentation subtasks through the design of two modules, thus generating more representative encoding features.
Results
To comprehensively evaluate the performance of the proposed SIC-Net, we conducted ablation experiments on the SECOND dataset. This aimed to scrutinize the influence of each component of the model on the overall performance. Additionally, we compared SIC-Net with other state-of-the-art methods on two distinct datasets, thereby validating its superiority.
A. Ablation Study
To investigate the impact of DCP, STSCM, and PLGA in our proposed SIC-Net performance, we conducted a series of ablation experiments. These experiments encompassed the baseline, baseline-PLGA, baseline-PLGA-DCP, baseline-PLGA-STSCM, and SIC-Net. These experiments aimed to delve into the functionalities of these components and their contributions to overall performance. The quantitative results are shown in Table I.
First, we test the effectiveness of the PLGA by adding it to train baseline. The PLGA significantly improves the performance of baseline and increases the accuracy by around 0.78% in SeK, 0.97% in Fscd, and 1.07% in Kappa. This result indicates that training with pseudo-labels generated by PLGA contributes to enhancing the network's capability in land cover classification. The improvement in land cover classification accuracy also assists in enhancing the precision of the binary CD task, with an increase of approximately 0.38% in F1. Therefore, it contributes to enhancing the overall accuracy of SCD.
Building upon the baseline-PLGA, we further validate the effectiveness of DCP and STSCM. With the introduction of DCP, we observed an increase of 0.70% in SeK, 0.68% in Fscd, and 0.88% in Kappa coefficient. Following the incorporation of STSCM, we saw an improvement of 1.04% in SeK, 1.07% in Fscd, and 0.66% in F1. The results indicate that both detailed information and collaborative training contribute to improving the accuracy of the SCD task. Finally, we evaluated the SIC-Net incorporating all these auxiliary designs. Compared to the baseline, the improvements were approximately SeK: 2.32%, Fscd: 2.41%. In summary, the integration of DCP, STSCM, and other auxiliary designs significantly improves the performance of SIC-Net, demonstrating its effectiveness in SCD tasks.
We present several inference results on SECOND dataset in Fig. 8. It can be observed that there are a considerable number of misclassifications in the baseline results. However, training with pseudo-labels generated by PLGA significantly improves this issue. With the inclusion of DCP, the experimental results of Baseline-PLGA-DCP gradually approach GT in identifying change regions (indicated by red dashed boxes in Fig. 8) with more accurate boundaries. The incorporation of STSCM and DCP methods further improves the prediction of land cover categories (yellow dashed box in Fig. 8). Furthermore, it effectively improves the issue of incomplete detection of change detection target areas [as shown in Fig. 8(c1), (c2)]. In summary, compared to the baseline method, by progressively integrating different design components, the changed regions and land cover categories determined by SIC-Net are much closer to the GT (as shown by the blue dashed box in Fig. 8).
To further understand the impact of spatial details on the SCD task, we selected two pairs of images as test samples and input them into the network. We then performed a visualization analysis of the three branches' features [i.e., f1, f2, and fcd in Fig. 3(a)] before and after adding DCP. The features were visualized by calculating the mean value of the feature maps, and the results are shown in Fig. 9. From the visual results, it is evident that with the introduction of DCP, the features of the semantic segmentation branch become more complete, and the contours are clearer (black dashed box in Fig. 9). Additionally, the features of the binary CD branch exhibit significant differences. After integrating the detailed information, the boundaries of the change areas become more accurate (red dashed box in Fig. 9). This demonstrates the importance of detailed information in the SCD task, significantly influencing the accurate detection of both subtasks.
B. SECOND Dataset
In this section, we present the comparison results of SIC-Net and other SCD methods on the SECOND dataset. The quantitative results are reported in Table II. L-UNet was originally designed for binary CD and exhibits inferior performance in SCD tasks. The Kappa metric exhibited a noticeable decline relative to other methods. In terms of quantitative results, recently published methods such as SCanNet, DEFO-MTLSCD, and our proposed SIC-Net all demonstrate significant advantages. It is worth mentioning that the SIC-Net method achieves the highest SCD accuracy (Sek: 23.96%, Fscd: 63.26%). Compared with the performance-suboptimal DEFO-MTLSCD, the SeK metric improves by nearly 0.5%, while semantic segmentation accuracy (Kappa) increases by nearly 1.23%, but the improvement in binary CD accuracy (F1, mIoU) is relatively low. Compared with the MambaSCD method based on the Mamba structure, SIC-Net improves SeK by 1.79%, F1 by 1.56%, and Kappa by 0.46%. This result indicates that SIC-Net has achieved significant success in optimizing the balance between the two subtasks, resulting in higher overall performance.
Fig. 10 illustrates the comparison of prediction results between SIC-Net and other methods on the SECOND dataset. The first two columns of the figure show the dual-temporal images and their corresponding ground truth land cover labels. In the comparison, HRNet, MTSCD-Net, MambaSCD, and our SIC-Net demonstrate more comprehensive and accurate recognition capabilities, especially in identifying the “playground” class, which has fewer training samples, as indicated by the yellow dashed boxes in the figure. In contrast, other methods exhibit more noticeable misclassifications when facing such uncommon classes. Despite integrating spatial details in their unique ways, SCDNet and HRNet demonstrate instability across multiple scenarios, as depicted by the green dashed boxes, where they fail to precisely detect subtle changes. In this regard, our SIC-Net outperforms the DEFO-MTLSCD method. Moreover, although DEFO-MTLSCD shows certain capabilities in change detection, it encounters inconsistencies in some cases, where the semantic segmentation results of the two periods are consistent, as shown in Fig. 10(b) and (c). This further demonstrates the superior performance of SIC-Net in recognizing land cover categories and maintaining consistency in subtask results compared to the suboptimal method DEFO-MTLSCD. Furthermore, compared to SCanNet and MambaSCD, SIC-Net exhibits fewer false detections and provides more complete representations of land objects in its prediction results. This convincingly demonstrates the advantages of our method in identifying small target changes and variations in land cover categories. These results not only highlight the effectiveness of our approach in handling SCD tasks but also underscore the significant benefits of effectively integrating contextual information with detailed information and coordinating training among subtasks.
C. Landsat-SCD Dataset
In this section, we compared the accuracy of SIC-Net and other SCD methods on the Landsat-SCD dataset, which consists of medium-resolution images. This evaluation validates the performance of our network and its compatibility with images of different resolutions. Additionally, to ensure fairness in the comparative experiments and solely validate the superiority of our proposed network architecture, we did not employ the PLGA in this scenario.
We use the same comparison methods as SECOND. As reflected in Table III, SIC-Net has achieved the best accuracy on the test set of Landsat-SCD, with an SeK of 61.29% and Fscd of 86.18%. Compared to the results on the SECOND dataset, the accuracy of detection results for various methods on the Landsat-SCD dataset is significantly higher. Bi-SRNet and MTSCD-Net, which performed well on the previous dataset, lag far behind our method, with lower binary CD accuracy (F1) at 85.98% and lower semantic segmentation accuracy (Kappa) at 86.37%. Although SCDNet achieves a relatively high Kappa coefficient, its binary CD accuracy is significantly lower compared to all other comparison methods when faced with medium-resolution images. Compared to these, DEFO-MTLSCD, MambaSCD, SCanNet, and SIC-Net demonstrate notably higher accuracy. Notably, since the Kappa in this paper evaluates the classification accuracy of the overlapping area between the predicted and actual change regions, a significantly high F1 score results in more pixels being involved in the Kappa calculation. This partly explains why SIC-Net slightly lags behind SCanNet and DEFO-MTLSCD in terms of semantic segmentation accuracy (Kappa). However, its significant advantage in binary CD accuracy with an F1 score of 92.48% leads to a notable overall accuracy improvement, with the SeK metric increasing by approximately 3%. The experimental results further demonstrate that SIC-Net, through the design of STSCM, effectively facilitates information exchange between the two subtasks, achieving a good balance and improving SCD accuracy. Additionally, the results indicate the robustness of our method in handling images of different resolutions.
Fig. 11 visually displays the SCD results of SIC-Net and the comparison methods on the Landsat-SCD dataset. Different scenarios of the Landsat-SCD dataset are selected. By observation, it can be noticed that the performance of most methods is quite satisfactory, with no significant differences. However, due to the limitations of medium-resolution images, unlike in the GT labels in SECOND dataset, there are numerous linear features (one pixel wide). Most comparative methods struggle to accurately predict this scenario. As shown in Fig. 11, SIC-Net outperforms other methods in many aspects, while the results of SSCD-l, SCDNet, and L-UNet are notably inferior to the rest. Notably, it excels in accurately identifying fine-scale features, as highlighted by the red dashed box in Fig. 11. SIC-Net can more comprehensively detect this kind of small targets that are easily overlooked by other methods. Additionally, SIC-Net demonstrates outstanding performance in accurately identifying change boundaries (as shown by the black dashed box in Fig. 11), while also outperforming comparative methods in identifying land cover categories (as indicated by the purple dashed box in the same figure). In the case of Fig. 11(a1) and (a2), other methods either produce false positives or exhibit partial omissions, resulting in incomplete target shapes, whereas SIC-Net demonstrates good stability under such circumstances. In contrast to DEFO-MTLSCD, which has suboptimal accuracy performance, SIC-Net can be observed from the prediction results that the misidentification of land cover types by SIC-Net is significantly alleviated (as indicated by the blue dashed box in Fig. 11). Experimental results demonstrate that our method performs better in identifying the types and extent of difficult samples in complex environments. Although inaccurate identification of change domains still exists, achieving precise recognition in such scenarios remains challenging even for humans.
Discussion
A. Comparison of Detail Extraction Strategies
To further validate the superior capability of the DBB composed of DCP and SCP in capturing detailed information, this study conducted a series of comprehensive comparative experiments based on the SSCD-l architecture. In this process, we selected HRNet and U-Net as comparison objects, as they are both renowned for their excellent capture of complex features. To ensure the fairness of the experiments, similar to other methods, U-Net initially down sampled the input image to 1/4 size through convolution and pooling. Through this series of comparative experiments, we aimed to clearly elucidate the differences between DCP, multiscale learning, and skip connections, and more accurately assess the ability of DBB in capturing image details.
As illustrated in Table IV, the experimental results unequivocally indicate that SSCD-l-DCP achieves the highest accuracy in both binary CD (F1) and semantic segmentation (Kappa). This not only highlights the applicability of features extracted through the proposed DBB in the SCD task but also corroborates the effectiveness of integrating detailed information extracted through DCP with contextual features in enhancing the accuracy of both subtasks.
In Fig. 12, we randomly selected images from the SECOND validation set, showcasing features obtained using different backbone networks. The visual results indicate significant differences in features obtained with the addition of DCP, further confirming its capability to improve feature extraction. Comparatively, HRNet exhibits more noise points in its features. U-Net demonstrates lower stability and offers poorer feature quality in specific scenarios (first row of Fig. 12). Our method combines detailed information with contextual features, resulting in clearer object outlines extracted by SSCD-l-DCP. The results indicate that the designed DBB composed of SCP and DCP achieves a good balance between detailed information and contextual features, demonstrating the superiority and wide applicability of DCP in the SCD task.
B. Model Efficiency
To comprehensively assess the performance of our proposed algorithm, we conducted a detailed comparison in terms of computational complexity. Table V presents a comparison of SIC-Net with several benchmark methods regarding computational complexity. This comparison focuses on two metrics: Params and FLOPs.
Although the DBB structure used in this article inevitably leads to an increase in network parameters, the parameter count of SIC-Net is still relatively lower compared to methods like MTSCD-Net and MambaSCD. Moreover, compared to advanced methods such as DEFO-MTLSCD and MTSCD-Net, SIC-Net also exhibits certain advantages in terms of FLOPs, indicating its relatively lower computational complexity. This result demonstrates that the proposed method maintains high accuracy without introducing excessive computational complexity, achieving a good balance between accuracy and efficiency, thus showing significant performance advantages overall.
Conclusion
In this article, we propose a novel SIC-Net for remote sensing images SCD. By deeply integrating the DBB with the STSCM, the network significantly enhances the model's capability in finegrained feature extraction, promotes synergistic learning between the two subtasks, and alleviates issues such as detail loss and result inconsistencies. Furthermore, we introduce the PLGA to address the issue of class imbalance in SCD tasks by increasing the annotated pixel count in the labels.
Experimental results suggest that SIC-Net achieved an improvement of over 2.32% compared to the baseline methods and obtained the highest accuracy on both SCD datasets. This demonstrates the outstanding performance of SIC-Net in the task of SCD in remote sensing image processing.
Additionally, we believe that DBB and STSCM have not fully exploited the potential of spatial detail and spatial–temporal dependency modeling. Therefore, we encourage exploring different architectures. Simultaneously, considering the complexity of SCD datasets, future research should consider incorporating semi-supervised or unsupervised learning to alleviate the limitations of feature extraction caused by limited data and extreme class imbalances in real-world remote sensing scenarios.