Journals & Magazines >IEEE Journal of Selected Topi... >Volume: 15

Local and Global Feature Learning With Kernel Scale-Adaptive Attention Network for VHR Remote Sensing Change Detection

Abstract:

Change detection is an important task of identifying changed information by comparing bitemporal images over the same geographical area. Currently, many existing methods ...Show More

Metadata

Abstract:

Change detection is an important task of identifying changed information by comparing bitemporal images over the same geographical area. Currently, many existing methods based on U-Net and attention mechanism have greatly promoted the development of change detection techniques. However, they still suffer from two main challenges. First, faced with the diversity of ground objects and the flexibility of scale changes, vanilla attention mechanisms cripple spatial flexibility in learning object details due to the same scale convolution kernels at different convolution layers. Second, the complex background and high similarity between changed information and nonchanged information makes it difficult to fuse low-level details and high-level semantic by simple skip-connection in U-Net. To address the above issues, a local and global feature learning with kernel scale-adaptive attention network (LGSAA-Net) is proposed in this article. The proposed network makes two contributions. First, a scale-adaptive attention (SAA) module has been designed to exploit the relationships between feature maps and convolutional kernel scales. The SAA module can achieve better feature discrimination than vanilla attention mechanism. Second, a multilayer perceptron based on patches embedding has been employed by skip-connection to learn the local and global pixel association, which is helpful for achieving globally deep fusion of low-level details and high-level semantics. Finally, experiments and ablation studies are conducted on three datasets of LEVIR/WHU/GZ. Experimental results demonstrate that the proposed LGSAA-Net performs favorably against comparative current approaches and provides more accurate contour and better internal compactness for changed targets, thus verifying the effectiveness and superiority of the proposed LGSAA-Net in VHR remote sensing change detection.

Published in: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing ( Volume: 15)

Page(s): 7308 - 7322

Date of Publication: 23 August 2022

ISSN Information:

DOI: 10.1109/JSTARS.2022.3200997

Funding Agency:

Contents

SECTION I.

Introduction

Change detection is the process of identifying differences in the state of an object or phenomenon by comparing two images at the same geographical area but of different time periods, which can reveal the dynamic changes in the surface and is one of the most important techniques in remote sensing interpretation [1]. As the Earth's surface is constantly evolving, real-time and accurate access to the surface changes is important for understanding of human activities, ecosystem, and their interactions. Recently, change detection based on VHR remote sensing images has been widely applied in land use [2], disaster monitoring [3], urban environmental investigation [4], etc. In change detection tasks, some factors, e.g., anthropogenic behavior, atmospheric conditions, and illumination, may lead to false detected regions [1], [4] and manual change detection is time-consuming and tedious. Under this circumstance, a large number of change detection approaches for remote sensing images have been proposed in recent years.

Existing change detection methods can be categorized roughly into traditional methods and deep learning-based methods. Furthermore, traditional change detection methods can be divided into pixel-based approaches [5], [6], [7] and object-based approaches [8], [9], [10]. The pixel-based approaches usually generate difference images by comparing the spectral or texture information of pixels and obtain results by using pixel classification. Compared to pixel-based approaches, the object-based approaches are working in units of objects and capture the image contextual information by processing homogeneous pixels of same objects. However, these methods usually depend on the hand-crafted features and show some bottlenecks [7], [9], [10]. Specifically, it is difficult to design useful feature extraction operators for traditional methods, since remote sensing images are usually more complex than other natural images [11]. By contrast, deep learning methods [12], [13], [14], especially convolutional neural networks (CNNs), have been widely used in various fields due to their strong feature discrimination abilities [15], [16], [17]. As a result, a large number of deep learning-based approaches [17], [18], [19], [20] have been reported, although they achieve better change detection results by employing various of improved CNNs, they still face some challenges. On the one hand, the prevailing pooling operation in CNNs easily leads to a difficulty of detail feature extraction. On the other hand, a large number of parameters in CNNs may cause overfitting and some unpredictable problems [21]. Therefore, the current change detection methods based on CNNs still have much room for improvement.

The U-Net [22] is a very popular network in medical image segmentation, since it is specially designed for small samples training. Similar to medical images, remote sensing images also have difficulties in sample acquisition and data annotation [23], [24]. However, compared to medical images, remote sensing images often involve higher resolution, more complex image content, and more serious noise interference. Therefore, it is difficult to obtain good change detection results by applying the U-Net directly to remote sensing images [25]. The U-Net treats all feature maps equally, and thus, ignores the fact that different feature maps pay attention to different object regions. To address this problem, various improved U-Nets have been put forward by employing attention mechanism to improve the performance of networks. The receptive fields of image context at different convolution layers are diverse. However, the existing attention mechanisms usually adopt a fixed convolutional kernel scale at different convolutional layers, which is disadvantageous for image details representation of changed targets. Furthermore, U-Nets utilize the skip-connection to realize feature fusion of low-level details and high-level semantics. Although they can improve image feature discrimination abilities, the symmetric fusion ignores the association between shallow-layer and deep-layer features. Consequently, a lot of improved networks are proposed, such as UNet++ [26] and UNet3+ [27]. However, these networks conduct the connection using pixel-by-pixel fusion, which ignores the local and global information integration in an image.

To solve the above problems, a local and global feature learning with kernel scale-adaptive attention network (LGSAA-Net) is proposed for VHR remote sensing change detection in this article. On the one hand, we introduce a scale-adaptive convolution kernels strategy to solve the problem of the difficulty of image detail feature extraction caused by single-scale convolution kernels. On the other hand, for the large semantic gap between low-level details and high-level semantics caused by conventional skip-connection, we adopt the fusion of local and global features to alleviate this problem. The proposed LGSAA-Net achieves a good comprehensive performance in model complexity and change detection accuracy. The main contributions of this article are summarized follows.

To boost the feature learning effect on object details of VHR remote sensing images, a scale-adaptive attention (SAA) module is designed according to the change of feature map scales at different layers. The SAA module can establish the internal correlation between feature maps and convolution kernel scales.
To enhance effectively the local and global feature discrimination abilities of the proposed LGSAA-Net, a multilayer perceptron based on patches embedding (MLPPE) module is proposed. The MLPPE module uses a multilayer perceptron (MLP) to facilitate the global association learning of pixels, while employing the attention mechanism to learn the local correlation of different patches.

The rest of this article is arranged as follows. The related work is reviewed in Section II. Section III gives a detailed description of the proposed LGSAA-Net. The experimental results and analysis are shown in Section IV. Finally, Section V concludes this article.

SECTION II.

Related Work

A. Attention Mechanism on Change Detection

Attention mechanism in vision perception relates to the process of selectively concentrating on parts of the most informative feature discrimination while suppressing the useless ones [28]. Previous literatures [29], [30] show that the attention mechanism can help CNNs to achieve better image classification and semantic segmentation. The most popular attention module is squeeze-and-excitation (SE) [31], which simply squeezes each feature map to model the cross-channel relationships in feature maps and efficiently build interdependencies among channels. To simplify the structure of the channel attention, an efficient channel attention network (ECA-Net) [32] adopts a 1-D convolution filter to compute channel weights. However, channel attention only considers encoding interchannel information but ignores the spatial details of feature maps. In order to capture the spatial details of feature maps and aggregate image contextual information, the gather-excite network (GENet) [33] and the pointwise spatial attention network (PSA-Net) [34] extend the attention mechanism by adopting different spatial attention or designing advanced attention blocks in spatial dimension. Moreover, from the perspective of interpretability, the hybrid model of combining channel attention and spatial attention is more conducive for improving network performance. Therefore, the bottleneck attention module (BAM) [35], the convolutional block attention module (CBAM) [36] and the global context network (GCNet) [37] refine convolutional features independently in channel- and spatial-dimension by cascading these two attentions. In particular, both BAM and CBAM exploit position information by reducing the channel dimension of the input tensor and then computing spatial attention using convolutions.

Compared to the attention modules mentioned above, the self-attention mechanism can effectively model the long-range dependencies by relating different positions of a single data sample. As a result, the self-attention [38] has obvious advantages in modeling long-range dependencies and building spatial- or channelwise attention. Due to the superiority of self-attention, scholars have proposed many improved attention networks, including the nonlocal neural networks (NLNet) [39], the criss-cross attention (CCNet) [40], the dual attention network (DANet) [41], and the segmentation transformer (SETR) [42]. These networks aim to overcome the limitations of convolutional operators that only capture local relationships but fail in modeling long-range dependencies in vision tasks. Unlike in the presented models [39], [40], [41], the SETR [42] adopts a vision transformer (ViT) [43] encoder and two decoders that are designed based upon progressive upsampling and multilevel feature aggregation. Although the SETR exercises stronger reasoning and modeling abilities due to the excellent self-attention, the parallel computing increases the complexity of models, and direct upsampling or deconvolution is not conductive to global feature learning.

In recent years, attention mechanisms have also been widely used in change detection tasks [44], [45]. The Siamese CNN (Siam-Net) [44] incorporates the CBAM to Siamese network to extract adaptively spectrum-spatial features from bitemporal remote sensing images. To mitigate the problem of class imbalance in change detection, the dual task constrained deep Siamese convolutional network [45] constructs dual CBAMs for each bitemporal feature to emphasize change information. CBAM is also commonly used to refine bitemporal features, as Shi et al. [46] proposed a deeply supervised metric method. It utilizes CBAM to make the deep features of different phases more discriminative. In order to extend the advantage of self-attention in capturing long-range dependencies to remote sensing change detection tasks, a series of studies have appeared [47], [48], [49]. However, the above studies adopt fixed receptive fields at different layers, which easily lead to poor feature learning on the spatial details of changed targets. To tackle the problem, we propose a scale-adaptive attention (SAA) module in this article. The SAA module can establish the relationship between feature maps and convolution scales, implements the adaptive scale space operation on the basis of channel attention for change detection, and thus, achieves better feature learning.

B. Skip-Connection on Change Detection

In image semantic segmentation, feature fusion strategy is used to improve the problems of missed details, rough segmentation results, and low precision [50], [51], [52]. To achieve feature fusion, the skip-connection is one of the most important factors that decide the success of encoder–decoder networks in image semantic segmentation [53], [54]. The U-Net [22] applies multiple skip-connections to construct a contracting path and a symmetric expanding signal path. Similar to the U-Net, the SegNet [55] utilizes a small network structure and the skip-connection method to achieve better visual semantics as well as detailed contextures. Although the skip-connection can help the U-Net and the SegNet to achieve high segmentation accuracy, the symmetric fusion employed by skip-connection neglects the association between shallow- and deep-layer features.

In light of above problem, some improved models that can be considered as an extension of the U-Net based on skip-connection, such as the UNet++ [26] and the UNet3+ [27]. The UNet++ [26] uses a series of nested convolutional structure before feature fusion to capture contextual information, while the UNet3+ [27] applies full-scale skip-connection to capture fine-grained detail information and coarse-grained semantic information. However, as these networks achieve feature fusion in a pixel-by-pixel manner, it is not conducive to bridging effectively the semantic gap of feature maps between the encoding stage and the decoding stage. To alleviate this issue, some strategies have been designed and applied to the skip-path to improve network performance, such as the modified U-Net (mU-Net) [56] and the MultiResUnet [57]. They add some additional convolution operations before feature fusing, which reduces the difference between feature maps from encoder and decoder leading to better feature discrimination abilities. In addition, before concatenating the features at each resolution of the encoder with the corresponding features in the decoder, both the attention gate U-Net (Attention U-Net) [58] and the nonlocal U-Nets (nonlocal U-Net) [59] rescale the outputted features of the encoder by using an attention module. Furthermore, they utilize higher-level semantic features to guide the current features for attention selection, but this kind of strategies does not consider the local and global information integration in an image.

Due to the simplicity and superior performance of the skip-connection based on U-shaped structure, popular networks for change detection [25], [60], [61], [62] still depend on the U-shaped architecture. Based on UNet++, Peng et al. [25] emphasized the difference information learning by using skip-connections inside convolution units. Furthermore, Peng et al. [60] designed an improved UNet++ architecture to integrate low-level details and high-level semantics. In addition, an end-to-end LU-Net [61] is designed to leverage both spatiality and temporality characteristics simultaneously. Since change detection networks tend to focus on the extraction of semantic information and ignore the importance of shallow features, Fang et al. [62] proposed a densely connected U-Net. It reduces the loss of shallow location information by the network through compact transmission. It can be seen that those networks mentioned above can improve change detection accuracy by fusing low-level details and high-level semantics. However, due to the large semantic gap between high-level and low-level features, the existing skip-connection methods may result in limited abilities of feature discrimination. Therefore, to narrow the semantic gap, we further adopt a multilayer perceptron to learn the association of global pixels and the relationship of different patches to exploit more useful feature discrimination information.

SECTION III.

Methods

An overview of the proposed LGSAA-Net is shown in Fig. 1. First, the feature extraction is performed on VHR remote sensing images in the first encoding branch. Second, the raw difference images by performing subtracting on bitemporal images are fed into the second encoding branch to extract the difference information. Third, the result of each feature extraction layer in the second encoding branch is fused with the output result from the first encoding branch. Fourth, the subtraction operation is performed on feature maps from the corresponding bitemporal paths. Finally, the fused features are fed into the next encoding layer.

Fig. 1.

Proposed LGSAA-Net for change detection. The backbone of LGSAA-Net is borrowed from the U-Net. Our architecture consists of the encoding stage and the decoding stage. In order to make the network lighter, stack depth-wise separable convolutional (Depth-conv) operators denoted by blue boxes are employed to extract features in the encoding stage. The two encoding branches convert the bitemporal images and the difference image to feature maps, and the number of channels is denoted on top of each box. For change detection, the SAA module denoted by yellow box is laid to each stack depth-conv operators, which can achieve better feature discrimination ability by establishing the internal correlation between feature maps and convolution kernel scales. In addition, the MLPPE module denoted by green box is laid to each skip-path, which facilitates the global association learning of pixels and learns the local correlation of different patches.

Show All

In order to refine the target contour, the SAA module is proposed to establish the relationships between feature maps and convolution kernel scales, which is described in detail in Fig. 3. We also present the structure of the MLPPE, as shown in Fig. 5, which can learn local correlation of different patches and facilitate the global association learning of pixels. In general, the proposed LGSAA-Net can effectively improve its feature discrimination abilities and provide excellent change detection results.

Fig. 2.

Comparison of the single-scale and multiscale convolution kernels on multiresolution images. It is found that the later provides better change detection results than the former due to the employment of multiscale convolution kernels.

Show All

Fig. 3.

SAA module.

Show All

Fig. 4.

Comparison of change detection using different networks. The first column: (a) Raw difference image and ground truth image. From the second to the fourth column, top is heatmaps and the bottom is change detection results from: (b) U-Net; (c) attention U-net; and (d) U-Net+multilayer perceptron.

Show All

Fig. 5.

MLPPE module.

Show All

A. The SAA Module for Change Detection

Multiscale Convolution Kernels: The utilization of multiscale information is an important strategy in image segmentation applications, since multiscale convolutional kernels can learn richer features. Generally, fine-grained sampling can obtain richer detail information, while coarse-grained sampling can extract richer contextual information. The latter is in favor of getting the overall trend of an image information. In addition, the existing spatial attention networks for change detection often utilize convolution kernels with fixed size to harvest the correlation of image spatial position information, which leads to the problem of the limited performance of target contour detection. Fig. 2 shows the comparison of the single-scale and multiscale convolution kernels on multiresolution images. It is clear that the latter provides better change detection results than the former due to the employment of multiscale convolution kernels.

Design of SAA Module: In light of the above discussion, the SAA module is designed based on the scale changes of feature maps in the encoding stage. Specifically, let the output of the previous layer of the network $\boldsymbol{F}\in \mathbb {R} ^{C\times H\times W}$ be the input feature maps of the module as shown in Fig. 3, then the global average pooling and convolutional operation are performed on $\boldsymbol{F}$ to obtain the $A^{c}$ that refers to compressed output score of each channel. To simplify the symbol mark, we use GAP to represent the operation of global average pooling. The specific calculations are defined as follows: $\begin{align*} A^{c}=&s\left(\text{conv}1D\left(\text{GAP}\left(\boldsymbol{F}\right),k^{c} \right) \right) \tag{1} \\ T\left(\cdot \right) =&\text{ROUND}|\frac{\left(\log \left((C,\beta) \right)+\alpha \right) }{2}\vert \tag{2} \end{align*}$ View Sourcewhere, $A^{c}\in \mathbb {R} ^{C\times 1\times 1}$ , $\text{conv}1D$ represents 1-D convolution operation, $s(\cdot)$ stands for the sigmoid function. $k^{c}$ is the size of the convolution kernel. If $T$ is an even number, then $k^{c}=T(\cdot)$ , otherwise $k^{c}=T(\cdot)+1$ . In this article, we set $\alpha$ and $\beta$ to 1 and 2 according to the empirical value in the experiment, respectively. ROUND denotes the rounding operation. The refined output is defined as $\begin{equation*} \boldsymbol{F}^{\boldsymbol {A}^{\boldsymbol c}} =\wedge _{mc} \left(\boldsymbol{F}, A^{c}\right) \tag{3} \end{equation*}$ View Sourcewhere $\wedge _{mc}$ denotes channelwise multiplication. This operation can adaptively adjust the value of $k^{c}$ according to the number of channels, and conveniently obtain channel interactivity information.

Furthermore, the average value and maximum value in channel dimension on the feature maps $\boldsymbol{F}^{\boldsymbol{A}^{\boldsymbol c}}\in \mathbb {R} ^{C\times H\times W}$ are calculated, and the 2-D convolution is performed to complete the space mapping. The calculation formulae are given as follows: $\begin{align*} A^{s} =& s\left(\text{conv}\left(\text{concat}\left(\phi _{\text{ave}}\left({\boldsymbol{F}^{\boldsymbol{A}^{\boldsymbol c}}}\right), \phi _{\text{max}}\left({\boldsymbol{F}^{\boldsymbol{A}^{\boldsymbol c}}} \right)\right), k^{s} \right) \right) \tag{4} \\ k^{s}=&\text{ROUND}\left(\gamma \times | \log \left(H\times W, \varepsilon \right) \vert \right) \tag{5} \end{align*}$ View Sourcewhere $\phi _{\text{ave}}(\cdot)$ represents the channel mean operation, $\phi _{\text{max}}(\cdot)$ denotes channel maximum operation, concat denotes the concatenate operation, and conv stands for the 2-D spatial convolution operation. $k^{s}$ is the size of convolution kernels. In this article, we set $\varepsilon$ and $\gamma$ to 10 and 3 according to the empirical value in the experiment, respectively. $A^{s}$ is the scale adaptive spatial attention weight. Finally, the refined output of the SAA module denoted by $\boldsymbol{F}^{\boldsymbol{s}}$ is calculated as follows: $\begin{equation*} \boldsymbol{F}^{\boldsymbol{s}}= \wedge _{A}\left(\boldsymbol{F},\wedge _{ms}\left(\boldsymbol{F}^{\boldsymbol{A}^{\boldsymbol c}}, A^{s} \right) \right) \tag{6} \end{equation*}$ View Sourcewhere $\wedge _{ms}$ represents elementwise multiplication, and $\wedge _{A}$ denotes elementwise addition. The SAA module employs the combination of channel attention and spatial attention with the characteristics of scale adaption, establishes the relationships between the feature maps and the convolution kernel scales, and realizes the scale-adaptive spatial attention operation.

B. MLPPE Module for Change Detection

Multilayer Perceptron: The current networks [60], [61], [62] improve change detection accuracy by fusing low- and high-level features. However, these methods mainly adopt the pixel-by-pixel fusion strategy, which ignores the integration of local and global information. Thus, most networks usually employ full connection layers in middle and high-level layers, to summarize features globally and help the network effectively learning global information. In light of discussion above, we employ full connection layers in feature fusion stage to summarize features globally and help the network effectively learning global information. As shown in Fig. 4, compared with the U-Net and Attention U-Net [58], the network with multilayer perceptron [63], named U-Net + multilayer perceptron, obtains more intuitive heat maps by fusing low- and high-level features. At the same time, it provides better detection results than the U-Net and the attention U-Net, which shows that multilayer perceptron is helpful for improving change detection results. In order to further improve the network performance, the patches strategy from the original feature maps are utilized in MLPPE module.

Global Features Based on Multilayer Perceptron: In this section, we present the MLPPE module, as shown in Fig. 5. Let feature map $\boldsymbol{F}$ be an input tensor, the input feature maps are reshaped to $\boldsymbol{M}$ to facilitate the next full connection operation. The specific calculations are given as follows: $\begin{align*} \boldsymbol{M}&= \mathcal {L} \left(\mathcal {R} \left(\boldsymbol{F}\right) \right) \tag{7} \\ \boldsymbol{F}^{\boldsymbol{g}}&=\delta \left\lbrace \widehat{\mathcal {R}} \left(\mathcal {L} \left(\sigma \left(\boldsymbol{M}\right) \right) \right) \right\rbrace \tag{8} \end{align*}$ View Sourcewhere $\mathcal {L}(\cdot)$ denotes the linear operation between different layers, $\mathcal {R}(\cdot)$ denotes the reshape operation to realize $\boldsymbol{F}\in \mathbb {R} ^{C\times H\times W}\rightarrow \boldsymbol{M}\in \mathbb {R} ^{C\times HW}$ , $C$ is the number of channels, $H\times W$ is the size of the current input feature maps. $\widehat{\mathcal {R}}(\cdot)$ represents the inverse operation of $\mathcal {R}(\cdot)$ and is used to achieve $\boldsymbol{M}\in \mathbb {R} ^{C\times HW}\rightarrow \boldsymbol{F}^{\boldsymbol{g}}\in \mathbb {R} ^{C\times H\times W}$ . In the MLPPE, we employ two linear layers, and $\delta$ refers to the ReLU function. Furthermore, in order to avoid the problem of linear operations being sensitive to the intermediate input scale, the double-normalization method [64] is used in the MLPPE. $\sigma$ stands for double-normalization operation. $\widehat{\boldsymbol{M}^{\prime}}$ denotes the result of doule-normalization. Here $\sum _{j}\widehat{\boldsymbol{M}_{i,j}^{\prime}}=1$ , where $i$ and $j$ denote index of each dimension, respectively, and $k$ is utilized to index current location information. The calculation formulas are given as follows: $\begin{align*} \widehat{\boldsymbol{M}_{i,j}}=& \frac{\exp \left(\boldsymbol{M}_{\boldsymbol{i,j}}\right)}{\sum _{k}\exp \left(\boldsymbol{M}_{\boldsymbol{k,j}}\right) } \tag{9} \\ \widehat{\boldsymbol{M}_{i,j}^{\prime}}=& \frac{\exp \left(\widehat{\boldsymbol{M}_{i,j}}\right)}{\sum _{k}\exp \left(\widehat{\boldsymbol{M}_{i,k}}\right) }. \tag{10} \end{align*}$ View Source

Local features Based on Patches Channel Information: For VHR images containing rich spatial information and complex contexture, the local spatial details of targets play a vital role in change detection, and the modeling for channel correlation is also beneficial to improving feature discrimination abilities. Therefore, the MLPPE module captures the image attention information with spatial property by learning both patches channel association and image local spatial details, which further boosts the change detection accuracy for VHR remote sensing images. As shown in Fig. 5, $\boldsymbol{F}$ is sequentially divided into $s^{2}$ patches $\lbrace \boldsymbol{P}_{\boldsymbol{1}},\boldsymbol{P}_{\boldsymbol{2}},{\ldots },\boldsymbol{P}_{\boldsymbol{s}^{\boldsymbol 2}}\rbrace$ . In which $\boldsymbol{P}_{\boldsymbol{n}}\in \mathbb {R} ^{C\times H^{\prime}\times W^{\prime}}$ , where $H^{\prime}= H/(s^{2})$ , $W^{\prime}= W/(s^{2})$ , and $1\leqslant n\leqslant {s^{2}}$ , $n, s\in$ N $_+$ . To simplify the symbol mark in Fig. 1(c), we use $\boldsymbol{\Omega }$ to synthesize the calculations of (11) and (12). Then we get the output feature information $\boldsymbol{V}_{\boldsymbol{n}}\in \mathbb {R} ^{C\times 1\times 1}$ of each corresponding patch $\boldsymbol{P}_{\boldsymbol{n}}$ by implementing (11) and (12). The specific calculation formulas are defined as follows: $\begin{align*} \boldsymbol{U}_{\boldsymbol{n}}=& W_{n}\ast \boldsymbol{P}_{\boldsymbol{n}} \tag{11} \\ v_{n}^{m}=&\frac{1}{H^{\prime}\times W^{\prime}} \sum _{n = i}^{H^{\prime}} \sum _{n = i}^{W^{\prime}} u_{n}^{m}(i,j) \tag{12} \end{align*}$ View Sourcewhere $n$ indexes the $n$ th patch of feature maps. $\boldsymbol{U}_{\boldsymbol{n}}\in \mathbb {R} ^{C\times H^{\prime}\times W^{\prime}}$ is the output feature map. Besides, $W_{n}=[w_{n}^{1},w_{n}^{2},\ldots,w_{n}^{C}]$ , $w_{n}^{m}$ is a mth 2-D spatial filter kernel. $\boldsymbol{V}_{\boldsymbol{n}}=[v_{n}^{1},v_{n}^{2},\ldots,v_{n}^{C}]$ , in which $v_{n}^{m}$ denotes mth channel association of $\boldsymbol{V}_{\boldsymbol{n}}$ . Then, two linear layers are applied to establish the channel association of $\boldsymbol{V}_{\boldsymbol{n}}$ and obtain features $\widehat{\boldsymbol{P}_{n} }$ . The specific calculation is defined as follows: $\begin{equation*} \widehat{\boldsymbol{P}_{n} }=s\left(\mathcal {L}\left(\delta \left(\mathcal {L}\left(\boldsymbol{V}_{\boldsymbol{n}}\right) \right) \right) \right) \tag{13} \end{equation*}$ View Sourcesimilarly, to simplify the symbol mark in Fig. 5, we use $\boldsymbol{\Phi }$ to synthesize the calculation of (13), in which $\widehat{\boldsymbol{P}_{n}}\in \mathbb {R} ^{C\times 1\times 1}$ denotes the refined channel output of each patch. $\mathcal {L}(\cdot)$ denotes the linear operation between layers. Besides, $\lbrace \widehat{\boldsymbol{P}_{1}},\widehat{\boldsymbol{P}_{2}},{\ldots },\widehat{\boldsymbol{P}_{s^{2}}}\rbrace$ is the channel correlation on patch-level and the relevance of patch spatial details.

In this section, the globally contextual information of VHR remote sensing images is achieved by integrating the patch embedding result into the multilayer perceptron. Then weights obtained by patch embedding are applied to the patches corresponding to the global feature maps obtained by the multilayer perceptron. Similar to $\boldsymbol{F}^{\boldsymbol{s}}$ in (7)–(8), we compute local patch information results $\lbrace \boldsymbol{P}_{\boldsymbol 1}^{\prime},\boldsymbol{P}_{\boldsymbol 2}^{\prime},\ldots,\boldsymbol{P}_{\boldsymbol {s}^{\boldsymbol 2}}^{\prime}\rbrace$ by performing (7) and (8) on $\lbrace \boldsymbol{P}_{\boldsymbol{1}},\boldsymbol{P}_{\boldsymbol{2}},\ldots,\boldsymbol{P}_{\boldsymbol{s}^{\boldsymbol 2}}\rbrace$ . The calculation formulas are defined as follows: $\begin{align*} \boldsymbol{L}_{\boldsymbol{n}} =\wedge _{A}\left(\boldsymbol{P}_{\boldsymbol n}^{\prime},\wedge _{mc}\left(\boldsymbol{P}_{\boldsymbol n}^{\prime},\widehat{\boldsymbol{P}_{n} }\right) \right) \tag{14} \\ \left\lbrace \boldsymbol{L}_{\boldsymbol{1}},\boldsymbol{L}_{\boldsymbol{2}},{\ldots },\boldsymbol{L}_{\boldsymbol{n}}\right\rbrace \Rightarrow \boldsymbol{F}^{\boldsymbol{l}} \tag{15} \\ \boldsymbol{F}^{\prime}=\wedge _{A}\left(\boldsymbol{F}^{\boldsymbol{g}},\boldsymbol{F}^{\boldsymbol{l}}\right) \tag{16} \end{align*}$ View Sourcewhere $\boldsymbol{L}_{\boldsymbol{n}}$ represents the output results of $n$ th patch feature information, $\wedge _{mc}$ denotes channelwise multiplication, $\wedge _{A}$ denotes elementwise addition, $\Rightarrow$ denotes the operation that can unite all patches, and $\boldsymbol{F}^{\boldsymbol{l}}$ denotes final output with channel weights information and local patch information. Then we obtain the final refined feature fusion output $\boldsymbol{F}^{\prime}\in \mathbb {R} ^{C\times H\times W}$ .

SECTION IV.

Experiments and Analysis

In order to evaluate the proposed method, some state-of-the-art methods, including FC-EF [65], FC-di [65], FC-conc [65], FCN-PP [66], FDCNN [67], DSIFN [68], SRCD-Net [18], Trans-CD [19], are considered as comparative methods in our experiments. We complete the comparisons by the released model codes. Furthermore, we conducted the ablation studies to prove the validity of each component.

A. Experimental Setup

Datasets: In this article, three benchmark datasets, including LEVIR, WHU, and GZ, are used to assess the proposed method. All of these datasets contain raw bitemporal images, and ground truths.

LEVIR Dataset [69] is a building change detection dataset with a spatial resolution of 0.55 m. It contains 637 pairs of bitemporal images with size of 1024 × 1024. These bitemporal images are within a time span of 5 to 14 years and have significant land changes, especially the growth of buildings covered by various types of buildings, such as villas, high apartments, small garages, and large warehouse. The fully annotated LEVIR dataset contains a total of 31 333 individual changed examples. We applied overlapping and nonoverlapping manners to crop the data into image patches with size of 224 × 224, then obtained 11 083 training samples, 2880 validation samples, and 2048 testing samples.

WHU Dataset [70] is a building change detection dataset with a spatial resolution of 0.075 m. It contains one pair of bitemporal images with size of 32 507 × 15 354. We first divide the bitemporal images into four smaller images without overlapping: 32 507 × 12610/18 361 × 2744/7634 × 2744/6511 × 2744. We used the first patch as the training set, the second and third patches as the validation set, and the fourth patch as the testing set. Then, we cropped these data into image patches with size of 224 × 224, obtaining 9637 training samples, 2494 validation samples, and 1600 testing samples.

GZ Dataset [71] is acquired during 2006 and 2019 period. These bitemporal images cover the suburb area of Guangzhou City, China. In order to align the image pairs, it collects 20 pairs of bitemporal images that change with the season varying by BIGEMAP software of Google Earth. These 20 pairs of bitemporal images, which have a spatial resolution of 0.55 m and a size range of 1006 × 1168 pixels to 4936 × 5224 pixels, are divided into three parts: training set (14 pairs) $/$ validation set (3 pairs) $/$ testing set (3 pairs). Then, we cropped these data into image patches with size of 224 × 224. Finally, we obtained 5612 training samples, 1692 validation samples, and 1456 testing samples.

Implementation Details: We implemented the proposed LGSAA-Net with PyTorch and trained it on a NVIDIA GeForce RTX 2080Ti GPU with 11 GB RAM. In this article, the parameter setting of comparative approaches follows original papers. For the proposed LGSAA-Net, we set $s=2$ for the MLPPE module. The Adam optimizer is adopted with a learning rate of $10^{-4}$ as an optimization algorithm, and the batch size of the training data is set to 16.

Evaluation Metrics: To evaluate the performance of the proposed LGSAA-Net, five popular metrics have been adopted, including precision (Pre), recall (Rec), overall error ( $OE$ ), overall accuracy ( $OA$ ), and F1-score ( $F1$ ). Specifically, the Pre denotes the ratio of detected areas that are really changed regions in totally detected regions. The Rec denotes the ratio of detected areas that are really changed regions compare to ground truths. The $OE$ is usually used to evaluate the overall error ratio of object detection. $OE$ , $OA$ , and $F1$ are the overall evaluation indexes of the prediction change detection results. Smaller value of $OE$ and larger values of $OA$ and $F1$ mean that the better prediction change detection results, and vice versa. Moreover, there is a large number of testing images in each of the three datasets (e.g., LEVIR, WHU, and GZ datasets). We use $[\![{\rm pre}]\!]\_{\rm m}, [\![{\rm rec}]\!]\_{\rm m}, [\![{\rm oe}]\!]\_{\rm m}, [\![{\rm oa}]\!]\_{\rm m, f}-[\![{\rm score}]\!]\_{\rm m}$ to define the metrics of the m-th testing sample, in which m denotes the number of testing samples in each dataset. These metrics are defined as follows: $\begin{align*} Pre=&\frac{TP}{TP+FP} \tag{17} \\ Rec=&\frac{TP}{TP+FN} \tag{18} \\ OE=&\frac{FP+FN}{TP+TN+FP+FN} \tag{19} \\ OA=&\frac{TP+TN}{TP+TN+FP+FN} \tag{20} \\ F1=&2\times \frac{\text{Pre}\times \text{Rec}}{\text{Pre}+\text{Rec}} \tag{21} \end{align*}$ View Sourcewhere the $TP$ (true positive) denotes the total number of pixels accurate-detected on really changed regions, the $TN$ (true negative) means the total number of pixels accurate-detected on really unchanged regions, the $FP$ (false positive) stands for the total number of pixels over-detected, and the $FN$ (false negative) is the total number of pixels miss-detected, respectively.

B. Comparison With State-of-The-Art Methods

Comparison on LEVIR Dataset: Fig. 6 shows the change detection results on LEVIR dataset, where Fig. 6(a)(c) are the bitemporal images and the ground truths, respectively. In Fig. 6(d)(f), the first three comparison methods are based on fully convolution network and feature fusion methods. It can be seen that the change detection results provided by FC-di are better than FC-EF and FC-conc, which shows that the Siamese encoder can slightly improve the model accuracy. Also, the results provided by FC-EF are inferior to FC-di, but better than FC-conc, which indicates that FC-EF can extract better discriminative features from bitemporal images than FC-conc. In addition, as shown in Fig. 6(g) and (h), although FCN-PP and FDCNN miss some truly changed regions (cyan color) in sample_2, they achieve better change detection results in sample_1 and sample_3, since the Gaussian pyramid module of FCN-PP possesses a strong feature discrimination ability, and the multiscale and multidepth feature difference maps generated by FDCNN are beneficial for change detection. Thus, FCN-PP and FDCNN provide better change detection results than the first three comparison methods. In contrast, the missed regions (cyan color) in Fig. 6(i)–(k) are greatly reduced, and their internal compactness of objects are improved compared with the results in Fig. 6(d)–(h). Fig. 6(l) shows that the proposed LGSAA-Net achieves the best change detection results with complete boundaries and high internal compactness, since it uses patch embedding and multilayer perceptron to learn local and global pixel association, and the SAA module makes the network learn the feature map information more reasonably.

Fig. 6.

Experimental results on LEVIR dataset. (a) Pretemporal images. (b) Posttemporal images. (c) Ground truths. (d) FC-EF. (e) FC-di. (f) FC-conc. (g) FCN-PP. (h) FDCNN. (i) DFISN. (j) SRCD-Net. (k) Trans-CD. (l) LGSAA-Net. Note that, the black color represents the unchanged regions, the white color represents the changed regions, the pink color denotes false-detected regions, and the cyan color denotes true-missed regions.

Show All

The quantitative evaluation results on LEVIR dataset are summarized in Table I. It can be seen that FC-di obtains higher value of $F1$ among the first three comparative methods, since FC-di explicitly guides the network to compare the differences between the bitemporal images. The last three comparative methods show more satisfactory results. Among them, SRCD-Net obtains the highest value of $F1$ 89.81%, respectively, since the stacked attention module is in favor of capturing changed information. The third-ranked DFISN obtains $F1$ 88.90% due to the effectiveness of deep supervision for change detection. It is worth noting that the proposed LGSAA-Net achieves the lowest value of $OE$ and the highest value of $F1$ . Besides, the proposed LGSAA-Net obtains an extra 1.35% on F1 than the best result from comparative approaches due to the useful discriminative features provided by our LGSAA-Net.

TABLE I Quantitative Evaluation Results on LEVIR Dataset

Comparison on WHU Dataset: Fig. 7 shows the change detection results on WHU dataset, where Fig. 7(a)–(c) correspond to the bitemporal images and the ground truths, respectively. These changed targets mainly concentrated on buildings and suburban houses. In Fig. 7(a) and (b), the contrast of changed targets in bitemporal images is quite low, which may affect the accuracy of change detection. Fig. 7(d)–(g) contain obvious falsely changed regions (pink color). In contrast, the results in Fig. 7(h)–(k) provided by FDCNN, DSIFN, SRCD-Net, Trans-CD, and the proposed LGSAA-Net are better than the results in Fig. 7(d)–(g). Notably, the proposed LGSAA-Net can accurately detect the contour information of small changed targets more accurately, and it obtains good change detection results in Fig. 7(l) that are close to ground truths.

Fig. 7.

Experimental results on WHU dataset. (a) Pretemporal images. (b) Posttemporal images. (c) Ground truths. (d) FC-EF. (e) FC-di. (f) FC-conc. (g) FCN-PP. (h) FDCNN. (i) DFISN. (j) SRCD-Net. (k) Trans-CD. (l) LGSAA-Net. Note that, the black color represents the unchanged regions, the white color represents the changed regions, the pink color denotes false-detected regions, and the cyan color denotes true-missed regions.

Show All

Table II, respectively, shows the quantitative evaluation results on WHU datasets. Compared to the LEVIR dataset, the first three comparative methods show similar performance on WHU dataset. FC-di provides higher value of $F1$ among the first three comparative methods, since FC-di considers the differences of bitemporal images in the encoding stage. Different from the results provided by FCN-PP and FDCNN on LEVIR dataset, FCN-PP outperforms FDCNN, since FCN-PP employs Gaussian pyramid to improve the ability of feature learning of models. In addition, SRCD-Net obtains the highest value of $F1$ among the last three comparative methods, DFISN and Trans-CD provide similar accuracy on $F1$ . Notably, we can see that the proposed LGSAA-Net obtains the highest value of $F1$ . Furthermore, the proposed LGSAA-Net achieves a performance improvement of 1.73% on $F1$ than the best result in comparative approaches, which further illustrates the advantage of the proposed LGSAA-Net for change detection.

TABLE II Quantitative Evaluation Results on WHU Dataset

Comparison on GZ Dataset: Fig. 8 shows the change detection results on GZ dataset to further demonstrate the superiority and generalizability of the proposed LGSAA-Net, where Fig. 8(a)–(c) show the bitemporal images and the ground truths, respectively. Fig. 8(a) and (b) show that images from the GZ dataset contains more noise than LEVIR and WHU datasets. Therefore, some falsely changed regions (pink color) are apparent in Fig. 8(d)–(g). Compared to the first four comparative methods, the results provided by FDCNN, DSIFN, SRCD-Net, and Trans-CD are improved, as shown in Fig. 8(h)–(k). Also, it can be seen from Fig. 8(d)–(l) that the proposed LGSAA-Net provides better change detection results than comparative methods, which further verifies the advantages of the proposed LGSAA-Net for change detection.

Fig. 8.

Experimental results on GZ dataset. (a) Pretemporal images. (b) Posttemporal images. (c) Ground truths. (d) FC-EF. (e) FC-di. (f) FC-conc. (g) FCN-PP. (h) FDCNN. (i) DFISN. (j) SRCD-Net. (k) Trans-CD. (l) LGSAA-Net. Note that, the black color represents the unchanged regions, the white color represents the changed regions, the pink color denotes false-detected regions, and the cyan color denotes true-missed regions.

Show All

As can be seen from Table III, the proposed method also significantly outperforms all comparative methods on GZ dataset, achieving the highest value of $F1$ . Different from the results on LEVIR and WHU datasets, FC-conc performs higher accuracy than FC-EF on GZ dataset, this indicates that FC-conc on different datasets shows inconsistent performance. FC-di also achieves best segmentation accuracy among the first three comparative methods, since FC-di considers the differences of bitemporal images in the encoding stage. Also, SRCD-Net obtains higher value of $F1$ than the scores obtained by other comparative methods, since the stacked attention module in SRCD-Net is in favor of capturing changed information. Compared with SRCD-Net, the proposed LGSAA-Net obtains an extra raising 1.14% on $F1$ owing to better feature learning of the SAA module and effectively fusion of low-level details and high-level semantics of the MLPPE module. From analysis above, the proposed LGSAA-Net is effective for obtaining accurate change detection results.

TABLE III Quantitative Evaluation Results on GZ Dataset

C. Ablation Studies

To illustrate further the effectiveness of different modules in the proposed network, experiments about various combinations of modules are conducted on LEVIR, WHU, and GZ datasets. Fig. 9(a)–(c) are the bitemporal images and ground truths, respectively. Added modules corresponding to Fig. 9(d)–(i) are abbreviated U-Net (Base) [22], Siamese U-Net (Siam) [65], Transformer-Vit (ViT) [43], MLPPE, multibranch encoding (MB) [65], efficient channel attention (ECA) [32], the convolutional block attention module (CBAM) [36], and SAA, respectively. The ablation schemes include: U-Net based on difference images (Base+DI), Siamese U-Net based on bitemporal images (Base+Siam), Siamese U-Net based on bitemporal images and Transformer-ViT (Base+Siam+ViT), Siamese U-Net based on MLPPE (Base+Siam+MLPPE), Siamese U-Net and MLPPE with multibranch encoding (Base+Siam+MLPPE+MB), Siamese U-Net and MLPPE based on MB and ECA (Base+Siam+MLPPE+MB+ECA), Siamese U-Net and MLPPE based on MB, and CBAM (Base+Siam+MLPPE+MB+CBAM) and LGSAA-Net.

Fig. 9.

Comparison of ablation experiments on LEVIR, WHU and GZ datasets: (a) pre-temporal images, (b) post-temporal images, (c) ground truths, (d) Base+DI, (e) Base+Siam, (f) Base+Siam+ViT, (g) Base+Siam+MLPPE, (h) Base+Siam+MLPPE+MB, (i) Base+Siam+MLPPE+MB+ECA, (j) Base+Siam+MLPPE+MB+CBAM (k) LGSAA-Net. Note that, the black color represents the unchanged regions, and the white color represents the changed regions, the pink color denotes false-detected regions, and the cyan color denotes true-missed regions.

Show All

As shown in Fig. 9(c)–(f), it can be concluded that the feature extraction methods based on bitemporal images can obtain better change detection results than the methods based on difference images. The ViT module does improve the accuracy of change detection, as shown in Fig. 9(f), but the sequence of image features completely replaces features maps, which ignores the contextual structure information of feature maps from original CNN and leads to false regions (pink color). In addition, it can be seen from Fig. 9(f) and (g) that the MLPPE module is beneficial to improving change detection results. On this basis, we added the multibranch encoding strategy to further enhance feature discrimination capabilities for bitemporal images and difference images, as shown in Fig. 9(h). Furthermore, compared with Fig. 9(h)–(k), it shows that the result by adding CBAM module is better than ones adding ECA module, but inferior to ones adding the SAA module, which indicates that the SAA can better respond to feature extraction of feature maps with different resolutions. In conclusion, for change detection task, the proposed LGSAA-Net can obtain clearer changed regions with more complete boundaries, and maintains a high internal compactness in truly changed regions. Table IV shows the quantitative evaluation results of our ablation experiments on LEVIR, WHU, and GZ datasets. It can be seen that the change detection results are improved with different degrees by adding these modules. Obviously, the incorporation of both the MLPPE and the SAA modules can improve the performance of networks on three datasets, which indicates that the proposed LGSAA-Net has a positive impact on change detection.

TABLE IV Quantitative Evaluation Results for Ablation Experiments on LEVIR, WHU, and GZ Datasets

SECTION V.

Discussion

In this section, the discussions about the effectiveness of the SAA and the MLPPE modules, the sensitivity experiments in the MLPPE module, as well as the model complexity are presented to demonstrate further the contributions of our studies.

A. Discussion on the Effectiveness of the SAA and the MLPPE

In order to show the feature extraction process of the deep model, we interpreted what the network learns by visualizing the heatmap of feature maps. In fact, the color of the heatmap reflects the correlation between the specific location information and the whole image, and various colors present the degree of contribution of network for the predicted category. In Fig. 10, the red denotes higher attention values and the blue denotes lower values, where Fig. 10(a)–(c) are the bitemporal images and ground truths, respectively. By comparing and. 10(d)–(f), it can be clearly seen that the SAA and the MLPPE modules can help the proposed network focus on the truly changed targets. Thus, the LGSAA-Net can obtain more discriminative features to guide the network outputting accurate predictions.

Fig. 10.

Heatmap visualization of LEVIR, WHU and GZ dataset. (a) Pretemporal images. (b) Posttemporal images. (c) Ground truths. (d) Base. (e) SAA. (f) SAA+MLPPE. Red denotes higher attention values and blue denotes lower values.

Show All

B. Discussion on the Sensitivity Experiments of the MLPPE

As described in Section III-B, we introduced the MLPPE module to skip-path to effectively fuse low-level details and high-level semantic features and narrow the segmentation gap. Here, the patches strategy of the MLPPE module is adopted to evaluate the change detection results, in which the scale parameter of patches s plays a decisive role in improving the model performance and the model accuracy. To explore the influence of different values on the change detection results, we conducted comparative experiments on three datasets by setting different scale parameter of patches s. As the number of network layers increases, the resolution of feature maps at different layers decreases, the minimum size of patch is set to 7 × 7. Therefore, we set the maximum s = 1, 2, 4, 8, 16 at convolutional layers of encoding stage, and the setting of s at different layers are shown in Table V.

TABLE V Comparison of the Different Patches Scale s in MLPPE Module

Fig. 11 presents the visual change detection results on several samples of the three datasets. It can be seen that all values of s can detect really changed regions, except for some falsely changed regions. The change detection results are more satisfactory when s = 2, 4. To be more specific, on LEVIR and GZ datasets, they achieve the highest values of F1 when s =4, representing an improvement of 1.14% and 0.80% compared to s = 1. However, it achieves the highest values of $F1$ when s = 2 on WHU dataset. In addition, as the value of s continues to increase, the accuracy of the model begins to decrease, since small patches in feature maps with large resolution may reduce the correlation between patches with larger distances. To sum up, considering the model size and performance comprehensively, we set s to 2 in the MLPPE module.

Fig. 11.

(a) Pretemporal images, (b) post-temporal images, (c) ground truths, the change detection results of the maximum, (d) s = 1, (e) s = 2, (f) s = 4, (g) s = 8, and (h) s = 16 in the MLPPE module.

Show All

C. Discussion on the Model Complexity

In practical applications, it is also necessary to consider factors, such as model complexity under the premise of high-precision detection results, so as to facilitate subsequent model deployment. Therefore, we evaluated the model complexity by comparing several methods with the proposed LGSAA-Net using four evaluation metrics, including floating point operations (FLOPs), parameters (Params), model size (Model), and Mean- $F1$ . Mean- $F1$ denotes the average value of $F1$ on LEVIR, WHU, and GZ datasets. As shown in Fig. 12 and Table VI, the model complexity of FC-EF, FC-di, and FC-conc is relatively low, since the backbone networks of these methods are shallower than the U-Net and its variants. DFISN uses a deep supervisory strategy to achieve change detection tasks, effectively improving the change detection accuracy, but increasing model complexity. Furthermore, FCN-PP leverages the Gaussian pyramid module, so it corresponds to the larger model size, while the complexity of SRCD-Net and Trans-CD is similar. SRCD-Net has a smaller model size due to small size of stacked attention module. In Trans-CD, a modified ViT module is added to U-Net, which results in a bit larger model size than U-Net. It is worth noting that FC-EF performs satisfyingly regarding FLOPs, Params, and model size. However, the Mean- $F1$ of FC-EF on three datasets is lower among these comparative methods. Finally, the proposed LGSAA-Net achieves a favorable tradeoff in model complexity and change detection accuracy relative to all comparative methods. Significantly, the proposed LGSAA-Net offers the highest Mean- $F1$ with favorable segmentation accuracy, reaching 93.19% on three datasets.

TABLE VI Quantitative Comparison of the Comparative Methods and the LGSAA-Net on Model Complexity and Accuracy

Fig. 12.

Model complexity and change detection accuracy comparison of comparative methods and the proposed LGSAA-Net.

Show All

SECTION VI.

Conclusion

In this work, we proposed the LGSAA-Net and studied change detection in bitemporal VHR remote sensing images. Different from popular change detection networks, the proposed LGSAA-Net can realize the adaptive spatial attention operation by establishing the relationships between feature maps and the convolution kernel scales. Moreover, it can effectively fuse low-level details and high-level semantics to improve feature discrimination ability by utilizing multilayer perceptron combined with patch attention mechanism. Experimental results on three change detection datasets demonstrated that the proposed LGSAA-Net can produce more accurate boundaries and high internal compactness for changed regions than state-of-the-art methods. Overall, the proposed LGSAA-Net achieves a favorable tradeoff in model complexity and change detection accuracy.

References is not available for this document.

Local and Global Feature Learning With Kernel Scale-Adaptive Attention Network for VHR Remote Sensing Change Detection

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction

Related Work