Introduction
Change detection is the process of identifying differences in the state of an object or phenomenon by comparing two images at the same geographical area but of different time periods, which can reveal the dynamic changes in the surface and is one of the most important techniques in remote sensing interpretation [1]. As the Earth's surface is constantly evolving, real-time and accurate access to the surface changes is important for understanding of human activities, ecosystem, and their interactions. Recently, change detection based on VHR remote sensing images has been widely applied in land use [2], disaster monitoring [3], urban environmental investigation [4], etc. In change detection tasks, some factors, e.g., anthropogenic behavior, atmospheric conditions, and illumination, may lead to false detected regions [1], [4] and manual change detection is time-consuming and tedious. Under this circumstance, a large number of change detection approaches for remote sensing images have been proposed in recent years.
Existing change detection methods can be categorized roughly into traditional methods and deep learning-based methods. Furthermore, traditional change detection methods can be divided into pixel-based approaches [5], [6], [7] and object-based approaches [8], [9], [10]. The pixel-based approaches usually generate difference images by comparing the spectral or texture information of pixels and obtain results by using pixel classification. Compared to pixel-based approaches, the object-based approaches are working in units of objects and capture the image contextual information by processing homogeneous pixels of same objects. However, these methods usually depend on the hand-crafted features and show some bottlenecks [7], [9], [10]. Specifically, it is difficult to design useful feature extraction operators for traditional methods, since remote sensing images are usually more complex than other natural images [11]. By contrast, deep learning methods [12], [13], [14], especially convolutional neural networks (CNNs), have been widely used in various fields due to their strong feature discrimination abilities [15], [16], [17]. As a result, a large number of deep learning-based approaches [17], [18], [19], [20] have been reported, although they achieve better change detection results by employing various of improved CNNs, they still face some challenges. On the one hand, the prevailing pooling operation in CNNs easily leads to a difficulty of detail feature extraction. On the other hand, a large number of parameters in CNNs may cause overfitting and some unpredictable problems [21]. Therefore, the current change detection methods based on CNNs still have much room for improvement.
The U-Net [22] is a very popular network in medical image segmentation, since it is specially designed for small samples training. Similar to medical images, remote sensing images also have difficulties in sample acquisition and data annotation [23], [24]. However, compared to medical images, remote sensing images often involve higher resolution, more complex image content, and more serious noise interference. Therefore, it is difficult to obtain good change detection results by applying the U-Net directly to remote sensing images [25]. The U-Net treats all feature maps equally, and thus, ignores the fact that different feature maps pay attention to different object regions. To address this problem, various improved U-Nets have been put forward by employing attention mechanism to improve the performance of networks. The receptive fields of image context at different convolution layers are diverse. However, the existing attention mechanisms usually adopt a fixed convolutional kernel scale at different convolutional layers, which is disadvantageous for image details representation of changed targets. Furthermore, U-Nets utilize the skip-connection to realize feature fusion of low-level details and high-level semantics. Although they can improve image feature discrimination abilities, the symmetric fusion ignores the association between shallow-layer and deep-layer features. Consequently, a lot of improved networks are proposed, such as UNet++ [26] and UNet3+ [27]. However, these networks conduct the connection using pixel-by-pixel fusion, which ignores the local and global information integration in an image.
To solve the above problems, a local and global feature learning with kernel scale-adaptive attention network (LGSAA-Net) is proposed for VHR remote sensing change detection in this article. On the one hand, we introduce a scale-adaptive convolution kernels strategy to solve the problem of the difficulty of image detail feature extraction caused by single-scale convolution kernels. On the other hand, for the large semantic gap between low-level details and high-level semantics caused by conventional skip-connection, we adopt the fusion of local and global features to alleviate this problem. The proposed LGSAA-Net achieves a good comprehensive performance in model complexity and change detection accuracy. The main contributions of this article are summarized follows.
To boost the feature learning effect on object details of VHR remote sensing images, a scale-adaptive attention (SAA) module is designed according to the change of feature map scales at different layers. The SAA module can establish the internal correlation between feature maps and convolution kernel scales.
To enhance effectively the local and global feature discrimination abilities of the proposed LGSAA-Net, a multilayer perceptron based on patches embedding (MLPPE) module is proposed. The MLPPE module uses a multilayer perceptron (MLP) to facilitate the global association learning of pixels, while employing the attention mechanism to learn the local correlation of different patches.
Related Work
A. Attention Mechanism on Change Detection
Attention mechanism in vision perception relates to the process of selectively concentrating on parts of the most informative feature discrimination while suppressing the useless ones [28]. Previous literatures [29], [30] show that the attention mechanism can help CNNs to achieve better image classification and semantic segmentation. The most popular attention module is squeeze-and-excitation (SE) [31], which simply squeezes each feature map to model the cross-channel relationships in feature maps and efficiently build interdependencies among channels. To simplify the structure of the channel attention, an efficient channel attention network (ECA-Net) [32] adopts a 1-D convolution filter to compute channel weights. However, channel attention only considers encoding interchannel information but ignores the spatial details of feature maps. In order to capture the spatial details of feature maps and aggregate image contextual information, the gather-excite network (GENet) [33] and the pointwise spatial attention network (PSA-Net) [34] extend the attention mechanism by adopting different spatial attention or designing advanced attention blocks in spatial dimension. Moreover, from the perspective of interpretability, the hybrid model of combining channel attention and spatial attention is more conducive for improving network performance. Therefore, the bottleneck attention module (BAM) [35], the convolutional block attention module (CBAM) [36] and the global context network (GCNet) [37] refine convolutional features independently in channel- and spatial-dimension by cascading these two attentions. In particular, both BAM and CBAM exploit position information by reducing the channel dimension of the input tensor and then computing spatial attention using convolutions.
Compared to the attention modules mentioned above, the self-attention mechanism can effectively model the long-range dependencies by relating different positions of a single data sample. As a result, the self-attention [38] has obvious advantages in modeling long-range dependencies and building spatial- or channelwise attention. Due to the superiority of self-attention, scholars have proposed many improved attention networks, including the nonlocal neural networks (NLNet) [39], the criss-cross attention (CCNet) [40], the dual attention network (DANet) [41], and the segmentation transformer (SETR) [42]. These networks aim to overcome the limitations of convolutional operators that only capture local relationships but fail in modeling long-range dependencies in vision tasks. Unlike in the presented models [39], [40], [41], the SETR [42] adopts a vision transformer (ViT) [43] encoder and two decoders that are designed based upon progressive upsampling and multilevel feature aggregation. Although the SETR exercises stronger reasoning and modeling abilities due to the excellent self-attention, the parallel computing increases the complexity of models, and direct upsampling or deconvolution is not conductive to global feature learning.
In recent years, attention mechanisms have also been widely used in change detection tasks [44], [45]. The Siamese CNN (Siam-Net) [44] incorporates the CBAM to Siamese network to extract adaptively spectrum-spatial features from bitemporal remote sensing images. To mitigate the problem of class imbalance in change detection, the dual task constrained deep Siamese convolutional network [45] constructs dual CBAMs for each bitemporal feature to emphasize change information. CBAM is also commonly used to refine bitemporal features, as Shi et al. [46] proposed a deeply supervised metric method. It utilizes CBAM to make the deep features of different phases more discriminative. In order to extend the advantage of self-attention in capturing long-range dependencies to remote sensing change detection tasks, a series of studies have appeared [47], [48], [49]. However, the above studies adopt fixed receptive fields at different layers, which easily lead to poor feature learning on the spatial details of changed targets. To tackle the problem, we propose a scale-adaptive attention (SAA) module in this article. The SAA module can establish the relationship between feature maps and convolution scales, implements the adaptive scale space operation on the basis of channel attention for change detection, and thus, achieves better feature learning.
B. Skip-Connection on Change Detection
In image semantic segmentation, feature fusion strategy is used to improve the problems of missed details, rough segmentation results, and low precision [50], [51], [52]. To achieve feature fusion, the skip-connection is one of the most important factors that decide the success of encoder–decoder networks in image semantic segmentation [53], [54]. The U-Net [22] applies multiple skip-connections to construct a contracting path and a symmetric expanding signal path. Similar to the U-Net, the SegNet [55] utilizes a small network structure and the skip-connection method to achieve better visual semantics as well as detailed contextures. Although the skip-connection can help the U-Net and the SegNet to achieve high segmentation accuracy, the symmetric fusion employed by skip-connection neglects the association between shallow- and deep-layer features.
In light of above problem, some improved models that can be considered as an extension of the U-Net based on skip-connection, such as the UNet++ [26] and the UNet3+ [27]. The UNet++ [26] uses a series of nested convolutional structure before feature fusion to capture contextual information, while the UNet3+ [27] applies full-scale skip-connection to capture fine-grained detail information and coarse-grained semantic information. However, as these networks achieve feature fusion in a pixel-by-pixel manner, it is not conducive to bridging effectively the semantic gap of feature maps between the encoding stage and the decoding stage. To alleviate this issue, some strategies have been designed and applied to the skip-path to improve network performance, such as the modified U-Net (mU-Net) [56] and the MultiResUnet [57]. They add some additional convolution operations before feature fusing, which reduces the difference between feature maps from encoder and decoder leading to better feature discrimination abilities. In addition, before concatenating the features at each resolution of the encoder with the corresponding features in the decoder, both the attention gate U-Net (Attention U-Net) [58] and the nonlocal U-Nets (nonlocal U-Net) [59] rescale the outputted features of the encoder by using an attention module. Furthermore, they utilize higher-level semantic features to guide the current features for attention selection, but this kind of strategies does not consider the local and global information integration in an image.
Due to the simplicity and superior performance of the skip-connection based on U-shaped structure, popular networks for change detection [25], [60], [61], [62] still depend on the U-shaped architecture. Based on UNet++, Peng et al. [25] emphasized the difference information learning by using skip-connections inside convolution units. Furthermore, Peng et al. [60] designed an improved UNet++ architecture to integrate low-level details and high-level semantics. In addition, an end-to-end LU-Net [61] is designed to leverage both spatiality and temporality characteristics simultaneously. Since change detection networks tend to focus on the extraction of semantic information and ignore the importance of shallow features, Fang et al. [62] proposed a densely connected U-Net. It reduces the loss of shallow location information by the network through compact transmission. It can be seen that those networks mentioned above can improve change detection accuracy by fusing low-level details and high-level semantics. However, due to the large semantic gap between high-level and low-level features, the existing skip-connection methods may result in limited abilities of feature discrimination. Therefore, to narrow the semantic gap, we further adopt a multilayer perceptron to learn the association of global pixels and the relationship of different patches to exploit more useful feature discrimination information.
Methods
An overview of the proposed LGSAA-Net is shown in Fig. 1. First, the feature extraction is performed on VHR remote sensing images in the first encoding branch. Second, the raw difference images by performing subtracting on bitemporal images are fed into the second encoding branch to extract the difference information. Third, the result of each feature extraction layer in the second encoding branch is fused with the output result from the first encoding branch. Fourth, the subtraction operation is performed on feature maps from the corresponding bitemporal paths. Finally, the fused features are fed into the next encoding layer.
Proposed LGSAA-Net for change detection. The backbone of LGSAA-Net is borrowed from the U-Net. Our architecture consists of the encoding stage and the decoding stage. In order to make the network lighter, stack depth-wise separable convolutional (Depth-conv) operators denoted by blue boxes are employed to extract features in the encoding stage. The two encoding branches convert the bitemporal images and the difference image to feature maps, and the number of channels is denoted on top of each box. For change detection, the SAA module denoted by yellow box is laid to each stack depth-conv operators, which can achieve better feature discrimination ability by establishing the internal correlation between feature maps and convolution kernel scales. In addition, the MLPPE module denoted by green box is laid to each skip-path, which facilitates the global association learning of pixels and learns the local correlation of different patches.
In order to refine the target contour, the SAA module is proposed to establish the relationships between feature maps and convolution kernel scales, which is described in detail in Fig. 3. We also present the structure of the MLPPE, as shown in Fig. 5, which can learn local correlation of different patches and facilitate the global association learning of pixels. In general, the proposed LGSAA-Net can effectively improve its feature discrimination abilities and provide excellent change detection results.
Comparison of the single-scale and multiscale convolution kernels on multiresolution images. It is found that the later provides better change detection results than the former due to the employment of multiscale convolution kernels.
Comparison of change detection using different networks. The first column: (a) Raw difference image and ground truth image. From the second to the fourth column, top is heatmaps and the bottom is change detection results from: (b) U-Net; (c) attention U-net; and (d) U-Net+multilayer perceptron.
A. The SAA Module for Change Detection
Multiscale Convolution Kernels: The utilization of multiscale information is an important strategy in image segmentation applications, since multiscale convolutional kernels can learn richer features. Generally, fine-grained sampling can obtain richer detail information, while coarse-grained sampling can extract richer contextual information. The latter is in favor of getting the overall trend of an image information. In addition, the existing spatial attention networks for change detection often utilize convolution kernels with fixed size to harvest the correlation of image spatial position information, which leads to the problem of the limited performance of target contour detection. Fig. 2 shows the comparison of the single-scale and multiscale convolution kernels on multiresolution images. It is clear that the latter provides better change detection results than the former due to the employment of multiscale convolution kernels.
Design of SAA Module: In light of the above discussion, the SAA module is designed based on the scale changes of feature maps in the encoding stage. Specifically, let the output of the previous layer of the network
\begin{align*}
A^{c}=&s\left(\text{conv}1D\left(\text{GAP}\left(\boldsymbol{F}\right),k^{c} \right) \right) \tag{1}
\\
T\left(\cdot \right) =&\text{ROUND}|\frac{\left(\log \left((C,\beta) \right)+\alpha \right) }{2}\vert \tag{2}
\end{align*}
\begin{equation*}
\boldsymbol{F}^{\boldsymbol {A}^{\boldsymbol c}} =\wedge _{mc} \left(\boldsymbol{F}, A^{c}\right) \tag{3}
\end{equation*}
Furthermore, the average value and maximum value in channel dimension on the feature maps
\begin{align*}
A^{s} =& s\left(\text{conv}\left(\text{concat}\left(\phi _{\text{ave}}\left({\boldsymbol{F}^{\boldsymbol{A}^{\boldsymbol c}}}\right), \phi _{\text{max}}\left({\boldsymbol{F}^{\boldsymbol{A}^{\boldsymbol c}}} \right)\right), k^{s} \right) \right) \tag{4}
\\
k^{s}=&\text{ROUND}\left(\gamma \times | \log \left(H\times W, \varepsilon \right) \vert \right) \tag{5}
\end{align*}
\begin{equation*}
\boldsymbol{F}^{\boldsymbol{s}}= \wedge _{A}\left(\boldsymbol{F},\wedge _{ms}\left(\boldsymbol{F}^{\boldsymbol{A}^{\boldsymbol c}}, A^{s} \right) \right) \tag{6}
\end{equation*}
B. MLPPE Module for Change Detection
Multilayer Perceptron: The current networks [60], [61], [62] improve change detection accuracy by fusing low- and high-level features. However, these methods mainly adopt the pixel-by-pixel fusion strategy, which ignores the integration of local and global information. Thus, most networks usually employ full connection layers in middle and high-level layers, to summarize features globally and help the network effectively learning global information. In light of discussion above, we employ full connection layers in feature fusion stage to summarize features globally and help the network effectively learning global information. As shown in Fig. 4, compared with the U-Net and Attention U-Net [58], the network with multilayer perceptron [63], named U-Net + multilayer perceptron, obtains more intuitive heat maps by fusing low- and high-level features. At the same time, it provides better detection results than the U-Net and the attention U-Net, which shows that multilayer perceptron is helpful for improving change detection results. In order to further improve the network performance, the patches strategy from the original feature maps are utilized in MLPPE module.
Global Features Based on Multilayer Perceptron: In this section, we present the MLPPE module, as shown in Fig. 5. Let feature map
\begin{align*}
\boldsymbol{M}&= \mathcal {L} \left(\mathcal {R} \left(\boldsymbol{F}\right) \right) \tag{7}
\\
\boldsymbol{F}^{\boldsymbol{g}}&=\delta \left\lbrace \widehat{\mathcal {R}} \left(\mathcal {L} \left(\sigma \left(\boldsymbol{M}\right) \right) \right) \right\rbrace \tag{8}
\end{align*}
\begin{align*}
\widehat{\boldsymbol{M}_{i,j}}=& \frac{\exp \left(\boldsymbol{M}_{\boldsymbol{i,j}}\right)}{\sum _{k}\exp \left(\boldsymbol{M}_{\boldsymbol{k,j}}\right) } \tag{9}
\\
\widehat{\boldsymbol{M}_{i,j}^{\prime}}=& \frac{\exp \left(\widehat{\boldsymbol{M}_{i,j}}\right)}{\sum _{k}\exp \left(\widehat{\boldsymbol{M}_{i,k}}\right) }. \tag{10}
\end{align*}
Local features Based on Patches Channel Information: For VHR images containing rich spatial information and complex contexture, the local spatial details of targets play a vital role in change detection, and the modeling for channel correlation is also beneficial to improving feature discrimination abilities. Therefore, the MLPPE module captures the image attention information with spatial property by learning both patches channel association and image local spatial details, which further boosts the change detection accuracy for VHR remote sensing images. As shown in Fig. 5,
\begin{align*}
\boldsymbol{U}_{\boldsymbol{n}}=& W_{n}\ast \boldsymbol{P}_{\boldsymbol{n}} \tag{11}
\\
v_{n}^{m}=&\frac{1}{H^{\prime}\times W^{\prime}} \sum _{n = i}^{H^{\prime}} \sum _{n = i}^{W^{\prime}} u_{n}^{m}(i,j) \tag{12}
\end{align*}
\begin{equation*}
\widehat{\boldsymbol{P}_{n} }=s\left(\mathcal {L}\left(\delta \left(\mathcal {L}\left(\boldsymbol{V}_{\boldsymbol{n}}\right) \right) \right) \right) \tag{13}
\end{equation*}
In this section, the globally contextual information of VHR remote sensing images is achieved by integrating the patch embedding result into the multilayer perceptron. Then weights obtained by patch embedding are applied to the patches corresponding to the global feature maps obtained by the multilayer perceptron. Similar to
\begin{align*}
\boldsymbol{L}_{\boldsymbol{n}} =\wedge _{A}\left(\boldsymbol{P}_{\boldsymbol n}^{\prime},\wedge _{mc}\left(\boldsymbol{P}_{\boldsymbol n}^{\prime},\widehat{\boldsymbol{P}_{n} }\right) \right) \tag{14}
\\
\left\lbrace \boldsymbol{L}_{\boldsymbol{1}},\boldsymbol{L}_{\boldsymbol{2}},{\ldots },\boldsymbol{L}_{\boldsymbol{n}}\right\rbrace \Rightarrow \boldsymbol{F}^{\boldsymbol{l}} \tag{15}
\\
\boldsymbol{F}^{\prime}=\wedge _{A}\left(\boldsymbol{F}^{\boldsymbol{g}},\boldsymbol{F}^{\boldsymbol{l}}\right) \tag{16}
\end{align*}
Experiments and Analysis
In order to evaluate the proposed method, some state-of-the-art methods, including FC-EF [65], FC-di [65], FC-conc [65], FCN-PP [66], FDCNN [67], DSIFN [68], SRCD-Net [18], Trans-CD [19], are considered as comparative methods in our experiments. We complete the comparisons by the released model codes. Furthermore, we conducted the ablation studies to prove the validity of each component.
A. Experimental Setup
Datasets: In this article, three benchmark datasets, including LEVIR, WHU, and GZ, are used to assess the proposed method. All of these datasets contain raw bitemporal images, and ground truths.
LEVIR Dataset [69] is a building change detection dataset with a spatial resolution of 0.55 m. It contains 637 pairs of bitemporal images with size of 1024 × 1024. These bitemporal images are within a time span of 5 to 14 years and have significant land changes, especially the growth of buildings covered by various types of buildings, such as villas, high apartments, small garages, and large warehouse. The fully annotated LEVIR dataset contains a total of 31 333 individual changed examples. We applied overlapping and nonoverlapping manners to crop the data into image patches with size of 224 × 224, then obtained 11 083 training samples, 2880 validation samples, and 2048 testing samples.
WHU Dataset [70] is a building change detection dataset with a spatial resolution of 0.075 m. It contains one pair of bitemporal images with size of 32 507 × 15 354. We first divide the bitemporal images into four smaller images without overlapping: 32 507 × 12610/18 361 × 2744/7634 × 2744/6511 × 2744. We used the first patch as the training set, the second and third patches as the validation set, and the fourth patch as the testing set. Then, we cropped these data into image patches with size of 224 × 224, obtaining 9637 training samples, 2494 validation samples, and 1600 testing samples.
GZ Dataset [71] is acquired during 2006 and 2019 period. These bitemporal images cover the suburb area of Guangzhou City, China. In order to align the image pairs, it collects 20 pairs of bitemporal images that change with the season varying by BIGEMAP software of Google Earth. These 20 pairs of bitemporal images, which have a spatial resolution of 0.55 m and a size range of 1006 × 1168 pixels to 4936 × 5224 pixels, are divided into three parts: training set (14 pairs)
Implementation Details: We implemented the proposed LGSAA-Net with PyTorch and trained it on a NVIDIA GeForce RTX 2080Ti GPU with 11 GB RAM. In this article, the parameter setting of comparative approaches follows original papers. For the proposed LGSAA-Net, we set
Evaluation Metrics: To evaluate the performance of the proposed LGSAA-Net, five popular metrics have been adopted, including precision (Pre), recall (Rec), overall error (
\begin{align*}
Pre=&\frac{TP}{TP+FP} \tag{17}
\\
Rec=&\frac{TP}{TP+FN} \tag{18}
\\
OE=&\frac{FP+FN}{TP+TN+FP+FN} \tag{19}
\\
OA=&\frac{TP+TN}{TP+TN+FP+FN} \tag{20}
\\
F1=&2\times \frac{\text{Pre}\times \text{Rec}}{\text{Pre}+\text{Rec}} \tag{21}
\end{align*}
B. Comparison With State-of-The-Art Methods
Comparison on LEVIR Dataset: Fig. 6 shows the change detection results on LEVIR dataset, where Fig. 6(a)(c) are the bitemporal images and the ground truths, respectively. In Fig. 6(d)(f), the first three comparison methods are based on fully convolution network and feature fusion methods. It can be seen that the change detection results provided by FC-di are better than FC-EF and FC-conc, which shows that the Siamese encoder can slightly improve the model accuracy. Also, the results provided by FC-EF are inferior to FC-di, but better than FC-conc, which indicates that FC-EF can extract better discriminative features from bitemporal images than FC-conc. In addition, as shown in Fig. 6(g) and (h), although FCN-PP and FDCNN miss some truly changed regions (cyan color) in sample_2, they achieve better change detection results in sample_1 and sample_3, since the Gaussian pyramid module of FCN-PP possesses a strong feature discrimination ability, and the multiscale and multidepth feature difference maps generated by FDCNN are beneficial for change detection. Thus, FCN-PP and FDCNN provide better change detection results than the first three comparison methods. In contrast, the missed regions (cyan color) in Fig. 6(i)–(k) are greatly reduced, and their internal compactness of objects are improved compared with the results in Fig. 6(d)–(h). Fig. 6(l) shows that the proposed LGSAA-Net achieves the best change detection results with complete boundaries and high internal compactness, since it uses patch embedding and multilayer perceptron to learn local and global pixel association, and the SAA module makes the network learn the feature map information more reasonably.
Experimental results on LEVIR dataset. (a) Pretemporal images. (b) Posttemporal images. (c) Ground truths. (d) FC-EF. (e) FC-di. (f) FC-conc. (g) FCN-PP. (h) FDCNN. (i) DFISN. (j) SRCD-Net. (k) Trans-CD. (l) LGSAA-Net. Note that, the black color represents the unchanged regions, the white color represents the changed regions, the pink color denotes false-detected regions, and the cyan color denotes true-missed regions.
The quantitative evaluation results on LEVIR dataset are summarized in Table I. It can be seen that FC-di obtains higher value of
Comparison on WHU Dataset: Fig. 7 shows the change detection results on WHU dataset, where Fig. 7(a)–(c) correspond to the bitemporal images and the ground truths, respectively. These changed targets mainly concentrated on buildings and suburban houses. In Fig. 7(a) and (b), the contrast of changed targets in bitemporal images is quite low, which may affect the accuracy of change detection. Fig. 7(d)–(g) contain obvious falsely changed regions (pink color). In contrast, the results in Fig. 7(h)–(k) provided by FDCNN, DSIFN, SRCD-Net, Trans-CD, and the proposed LGSAA-Net are better than the results in Fig. 7(d)–(g). Notably, the proposed LGSAA-Net can accurately detect the contour information of small changed targets more accurately, and it obtains good change detection results in Fig. 7(l) that are close to ground truths.
Experimental results on WHU dataset. (a) Pretemporal images. (b) Posttemporal images. (c) Ground truths. (d) FC-EF. (e) FC-di. (f) FC-conc. (g) FCN-PP. (h) FDCNN. (i) DFISN. (j) SRCD-Net. (k) Trans-CD. (l) LGSAA-Net. Note that, the black color represents the unchanged regions, the white color represents the changed regions, the pink color denotes false-detected regions, and the cyan color denotes true-missed regions.
Table II, respectively, shows the quantitative evaluation results on WHU datasets. Compared to the LEVIR dataset, the first three comparative methods show similar performance on WHU dataset. FC-di provides higher value of
Comparison on GZ Dataset: Fig. 8 shows the change detection results on GZ dataset to further demonstrate the superiority and generalizability of the proposed LGSAA-Net, where Fig. 8(a)–(c) show the bitemporal images and the ground truths, respectively. Fig. 8(a) and (b) show that images from the GZ dataset contains more noise than LEVIR and WHU datasets. Therefore, some falsely changed regions (pink color) are apparent in Fig. 8(d)–(g). Compared to the first four comparative methods, the results provided by FDCNN, DSIFN, SRCD-Net, and Trans-CD are improved, as shown in Fig. 8(h)–(k). Also, it can be seen from Fig. 8(d)–(l) that the proposed LGSAA-Net provides better change detection results than comparative methods, which further verifies the advantages of the proposed LGSAA-Net for change detection.
Experimental results on GZ dataset. (a) Pretemporal images. (b) Posttemporal images. (c) Ground truths. (d) FC-EF. (e) FC-di. (f) FC-conc. (g) FCN-PP. (h) FDCNN. (i) DFISN. (j) SRCD-Net. (k) Trans-CD. (l) LGSAA-Net. Note that, the black color represents the unchanged regions, the white color represents the changed regions, the pink color denotes false-detected regions, and the cyan color denotes true-missed regions.
As can be seen from Table III, the proposed method also significantly outperforms all comparative methods on GZ dataset, achieving the highest value of
C. Ablation Studies
To illustrate further the effectiveness of different modules in the proposed network, experiments about various combinations of modules are conducted on LEVIR, WHU, and GZ datasets. Fig. 9(a)–(c) are the bitemporal images and ground truths, respectively. Added modules corresponding to Fig. 9(d)–(i) are abbreviated U-Net (Base) [22], Siamese U-Net (Siam) [65], Transformer-Vit (ViT) [43], MLPPE, multibranch encoding (MB) [65], efficient channel attention (ECA) [32], the convolutional block attention module (CBAM) [36], and SAA, respectively. The ablation schemes include: U-Net based on difference images (Base+DI), Siamese U-Net based on bitemporal images (Base+Siam), Siamese U-Net based on bitemporal images and Transformer-ViT (Base+Siam+ViT), Siamese U-Net based on MLPPE (Base+Siam+MLPPE), Siamese U-Net and MLPPE with multibranch encoding (Base+Siam+MLPPE+MB), Siamese U-Net and MLPPE based on MB and ECA (Base+Siam+MLPPE+MB+ECA), Siamese U-Net and MLPPE based on MB, and CBAM (Base+Siam+MLPPE+MB+CBAM) and LGSAA-Net.
Comparison of ablation experiments on LEVIR, WHU and GZ datasets: (a) pre-temporal images, (b) post-temporal images, (c) ground truths, (d) Base+DI, (e) Base+Siam, (f) Base+Siam+ViT, (g) Base+Siam+MLPPE, (h) Base+Siam+MLPPE+MB, (i) Base+Siam+MLPPE+MB+ECA, (j) Base+Siam+MLPPE+MB+CBAM (k) LGSAA-Net. Note that, the black color represents the unchanged regions, and the white color represents the changed regions, the pink color denotes false-detected regions, and the cyan color denotes true-missed regions.
As shown in Fig. 9(c)–(f), it can be concluded that the feature extraction methods based on bitemporal images can obtain better change detection results than the methods based on difference images. The ViT module does improve the accuracy of change detection, as shown in Fig. 9(f), but the sequence of image features completely replaces features maps, which ignores the contextual structure information of feature maps from original CNN and leads to false regions (pink color). In addition, it can be seen from Fig. 9(f) and (g) that the MLPPE module is beneficial to improving change detection results. On this basis, we added the multibranch encoding strategy to further enhance feature discrimination capabilities for bitemporal images and difference images, as shown in Fig. 9(h). Furthermore, compared with Fig. 9(h)–(k), it shows that the result by adding CBAM module is better than ones adding ECA module, but inferior to ones adding the SAA module, which indicates that the SAA can better respond to feature extraction of feature maps with different resolutions. In conclusion, for change detection task, the proposed LGSAA-Net can obtain clearer changed regions with more complete boundaries, and maintains a high internal compactness in truly changed regions. Table IV shows the quantitative evaluation results of our ablation experiments on LEVIR, WHU, and GZ datasets. It can be seen that the change detection results are improved with different degrees by adding these modules. Obviously, the incorporation of both the MLPPE and the SAA modules can improve the performance of networks on three datasets, which indicates that the proposed LGSAA-Net has a positive impact on change detection.
Discussion
In this section, the discussions about the effectiveness of the SAA and the MLPPE modules, the sensitivity experiments in the MLPPE module, as well as the model complexity are presented to demonstrate further the contributions of our studies.
A. Discussion on the Effectiveness of the SAA and the MLPPE
In order to show the feature extraction process of the deep model, we interpreted what the network learns by visualizing the heatmap of feature maps. In fact, the color of the heatmap reflects the correlation between the specific location information and the whole image, and various colors present the degree of contribution of network for the predicted category. In Fig. 10, the red denotes higher attention values and the blue denotes lower values, where Fig. 10(a)–(c) are the bitemporal images and ground truths, respectively. By comparing and. 10(d)–(f), it can be clearly seen that the SAA and the MLPPE modules can help the proposed network focus on the truly changed targets. Thus, the LGSAA-Net can obtain more discriminative features to guide the network outputting accurate predictions.
Heatmap visualization of LEVIR, WHU and GZ dataset. (a) Pretemporal images. (b) Posttemporal images. (c) Ground truths. (d) Base. (e) SAA. (f) SAA+MLPPE. Red denotes higher attention values and blue denotes lower values.
B. Discussion on the Sensitivity Experiments of the MLPPE
As described in Section III-B, we introduced the MLPPE module to skip-path to effectively fuse low-level details and high-level semantic features and narrow the segmentation gap. Here, the patches strategy of the MLPPE module is adopted to evaluate the change detection results, in which the scale parameter of patches s plays a decisive role in improving the model performance and the model accuracy. To explore the influence of different values on the change detection results, we conducted comparative experiments on three datasets by setting different scale parameter of patches s. As the number of network layers increases, the resolution of feature maps at different layers decreases, the minimum size of patch is set to 7 × 7. Therefore, we set the maximum s = 1, 2, 4, 8, 16 at convolutional layers of encoding stage, and the setting of s at different layers are shown in Table V.
Fig. 11 presents the visual change detection results on several samples of the three datasets. It can be seen that all values of s can detect really changed regions, except for some falsely changed regions. The change detection results are more satisfactory when s = 2, 4. To be more specific, on LEVIR and GZ datasets, they achieve the highest values of F1 when s =4, representing an improvement of 1.14% and 0.80% compared to s = 1. However, it achieves the highest values of
(a) Pretemporal images, (b) post-temporal images, (c) ground truths, the change detection results of the maximum, (d) s = 1, (e) s = 2, (f) s = 4, (g) s = 8, and (h) s = 16 in the MLPPE module.
C. Discussion on the Model Complexity
In practical applications, it is also necessary to consider factors, such as model complexity under the premise of high-precision detection results, so as to facilitate subsequent model deployment. Therefore, we evaluated the model complexity by comparing several methods with the proposed LGSAA-Net using four evaluation metrics, including floating point operations (FLOPs), parameters (Params), model size (Model), and Mean-
Model complexity and change detection accuracy comparison of comparative methods and the proposed LGSAA-Net.
Conclusion
In this work, we proposed the LGSAA-Net and studied change detection in bitemporal VHR remote sensing images. Different from popular change detection networks, the proposed LGSAA-Net can realize the adaptive spatial attention operation by establishing the relationships between feature maps and the convolution kernel scales. Moreover, it can effectively fuse low-level details and high-level semantics to improve feature discrimination ability by utilizing multilayer perceptron combined with patch attention mechanism. Experimental results on three change detection datasets demonstrated that the proposed LGSAA-Net can produce more accurate boundaries and high internal compactness for changed regions than state-of-the-art methods. Overall, the proposed LGSAA-Net achieves a favorable tradeoff in model complexity and change detection accuracy.