Journals & Magazines >IEEE Journal of Selected Topi... >Volume: 17

Local–Global Multiscale Fusion Network for Semantic Segmentation of Buildings in SAR Imagery

Abstract:

The extraction of buildings from synthetic aperture radar (SAR) images poses a challenging task in the realm of remote sensing (RS). In recent years, convolutional neural...Show More

Metadata

Abstract:

The extraction of buildings from synthetic aperture radar (SAR) images poses a challenging task in the realm of remote sensing (RS). In recent years, convolutional neural networks (CNNs) have rapidly advanced and found application in the field of RS. Researchers have investigated the potential of CNNs for the semantic segmentation of SAR images, bringing excellent improvements. However, the semantic segmentation of buildings in SAR images still encounters challenges due to the high similarity between features of ground objects and buildings in SAR images, as well as the variability in building structures. In this article, we propose the local–global multiscale fusion network (LGMFNet), based on a dual encoder–decoder structure, for the semantic segmentation of buildings in SAR images. The proposed LGMFNet introduces an auxiliary encoder with a transformer structure to address the limitation of using the main encoder with a CNN structure for global modeling. To embed global dependencies hierarchically into the CNN, we designed the global–local semantic aggregation module (GLSM). The GLSM serves as a bridge between the dual encoders to achieve semantic guidance and coupling from the local to the global level. Furthermore, to bridge the semantic gap between different scales, we designed the multiscale feature fusion network (MSFN) as the decoder. MSFN achieves the interactive fusion of semantic information between various scales by constructing the multiscale feature fusion module. Experimental results demonstrate that the proposed LGMFNet achieves the mIoU of 91.17% on the BIGSARDATA 2023 AISAR competition dataset, outperforming the second-best method by a margin of 0.78%. This evidences the superiority of LGMFNet in comparison to other state-of-the-art methods.

Published in: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing ( Volume: 17)

Page(s): 7410 - 7421

Date of Publication: 19 March 2024

ISSN Information:

DOI: 10.1109/JSTARS.2024.3379403

Funding Agency:

Contents

SECTION I.

Introduction

Acquiring information on built-up areas is crucial for evaluating the effects of human activities. Buildings represent a significant topographic object class in urban areas and constitute a crucial data layer in geographic information systems (GIS). The extraction of information about buildings from remote sensing (RS) images is a key aspect of GIS and a challenging task in the RS field. Currently, building extraction is extensively utilized in various domains, including building damage monitoring [1], [2], ground object mapping [3], [4], and urban development planning [5], [6].

Synthetic aperture radar (SAR) imagery is widely used in earth observation because it can acquire data regardless of sunlight conditions and is insensitive to weather conditions [7]. In theory, the apparent scattering characteristics of buildings guarantee effective segmentation of SAR imagery. However, interpreting scenes using SAR imagery is exceptionally challenging due to the inherent complexity of SAR imagery induced by the speckle effect, as well as radiometric distortions mostly resulting from the side-looking geometry [8]. Therefore, the utilization of semantic segmentation techniques is necessary to enhance the readability of SAR imagery [9].

Traditional methods commonly employed in the SAR image semantic segmentation field include support vector machines (SVMs) [10] and conditional random fields (CRFs) [11]. Nevertheless, traditional methods rely on manually designed feature extractors [12], necessitating specialized knowledge and a complex parameter tuning process, which often leads to poor generalization ability and robustness. Over recent years, the rapid advancement of convolutional neural networks (CNNs) has offered technical assistance for the semantic segmentation of SAR images. CNNs exhibit high tolerance to geometric distortions in images and possess powerful capabilities for local feature extraction. In particular, the semantic segmentation method based on the full convolutional network (FCN) [13] achieves end-to-end pixel-level segmentation. Subsequently, the U-Net [14] structure, based on an encoder–decoder frame, became a popular configuration for semantic segmentation owing to its outstanding performance.

While CNN-based methods have shown substantial advancements in semantic segmentation tasks, they still struggle to effectively address the complexity inherent in built-up areas in SAR images [15], [16]. On the one hand, in intricate geographic environments, certain ground objects and buildings exhibit similar characteristics, displaying high backscatter values and comparable local textures, which usually lead to false alarms. On the other hand, the extraction capability can be significantly affected by severe multiscale feature problems due to the complex diversity of building structures.

Integrating more global contexts as cues for semantic reasoning is essential for mitigating similar feature interference in the semantic segmentation of SAR image buildings [17]. CNNs inherently excel in spatial location representation owing to the characteristics of convolution. However, the locality of convolution limits CNNs from effectively capturing global contextual information. This constraint arises because each convolutional kernel exclusively attends to local pixels within its receptive field, thus being incapable of capturing long-distance dependencies. Existing approaches tackle this issue by employing the attention mechanism [18], [19]. Despite the ability of CNNs to capture global relationships through the attention mechanism, this approach relies on aggregating global semantics from local features obtained by CNNs instead of directly encoding global dependencies. Consequently, obtaining comprehensive global semantic information from SAR images using CNNs alone is challenging. Recently, the widespread use of transformers in computer vision tasks has provided a novel solution for global modeling [20], [21], [22]. A multihead self-attention mechanism [23] is utilized by the transformer to effectively gather global semantics and model global dependencies. This enables the transformer to completely dissociate from convolution, thereby encoding global relationships directly.

Establishing an efficient method for multiscale fusion proves effective in addressing the multiscale segmentation challenges posed by SAR image buildings [24]. Several related research works have demonstrated the effectiveness of the spatial pyramid structure as a powerful tool for investigating multiscale feature challenges [25], [26]. Within the spatial pyramid structure, dilation convolution expands the limited receptive field of convolution, allowing the extraction of multiscale feature information by employing various dilation rates. Nevertheless, the grid effect induced by dilation convolution in the spatial pyramid structure will cause certain semantic information loss, which in turn affects the overall segmentation performance. In addition, the multilayer feature fusion strategy is commonly employed for investigating multiscale problems [27]. The multilayer feature fusion strategy typically establishes skip connections and residual connections between feature maps of distinct layers to achieve semantic interaction, simultaneously restoring the resolution of feature maps layer by layer. However, the majority of these methods solely establish bottom-up unidirectional information flow interaction, neglecting bidirectional semantic interaction. This leads to the inefficient exploitation of multiscale semantic information.

In this article, we propose a dual encoder–decoder framework for semantic segmentation of SAR image buildings called local–global multiscale fusion network (LGMFNet). In the dual encoder, the main encoder utilizes a CNN branch for extracting local semantics, while the auxiliary encoder employs a transformer branch for extracting global semantics. In the decoder, a multiscale feature fusion network (MFFN) is designed as the decoder of LGMFNet to optimize the utilization of multiscale features extracted by the dual encoder.

The primary contributions of the article are as follows:

A novel dual encoder–decoder semantic segmentation framework named LGMFNet is designed for achieving precise segmentation of buildings in SAR imagery. The transformer-based auxiliary encoder effectively addresses the lack of global modeling capability exhibited by the CNN-based main encoder.
To capture more discriminative features, the global–local semantic aggregation module (GLSM) is designed to extract semantic correlations from global semantics as a global clue to guide local semantics. This method mitigates segmentation errors resulting from features with high similarity in the building segmentation of SAR imagery.
To achieve effective multiscale feature interaction, we construct MFFN as a decoder for the network. Within the MFFN, we employ the multiscale feature fusion module (MFFM) to improve the capacity for precise identification and localization of multiscale target regions. This is accomplished by establishing information interactions at different levels of feature maps. MFFN significantly improves the multiscale segmentation challenge of buildings in SAR imagery, particularly for smaller buildings.

The following sections of this article are structured as follows. Section II offers a brief overview of the work related to LGMFNet. Section III describes the specific structure of LGMFNet. Section IV discusses the details and results of the experiments. Finally, the article is summarized in Section V.

SECTION II.

Related Work

A. Semantic Segmentation Methods for SAR Images Based on Deep Learning

In recent years, driven by the rapid advancement of deep learning, an increasing number of studies have leveraged deep learning for the interpretation of RS scenes [28], [29]. Deep learning methods are progressively supplanting traditional approaches, offering robust technical support for SAR image segmentation. The researcher tries to distinguish different categories of ground objects in SAR images using deep learning-based semantic segmentation methods.

Henry et al. [30] conducted a thorough assessment of the capabilities of CNNs for segmenting roads in SAR pictures. The findings indicate that while the CNNs may not possess efficiency in road segmentation, they might yield satisfactory outcomes through suitable modifications. Feng et al. [31] proposed a network that improves the semantic segmentation of SAR vehicle images by adding a fully convolutional backbone network and two decoupled head branches. For the segmentation of ships in SAR images, Zhang et al. [32] introduced a network called MAI-SE-Net to address the challenge of multiscale segmentation. In response to the limitations posed by geometric distortion in SAR images pertaining to buildings, Peng et al. [33] introduced the LRFFNet. In addressing the challenges posed by scattering noise and complex scattering phenomena in SAR image segmentation, Wu et al. [15] designed MS-FCN, which employs two distinct structures, FCN and U-Net, for semantic segmentation at high resolution SAR images. In contrast, Ding et al. [16] introduced a network named MP-ResNet. This approach establishes parallel multiscale branches to learn relationships between semantics, thereby enhancing the embedding of local discriminative features. The use of these deep learning-based semantic segmentation methods represents a significant leap forward in the fine segmentation of SAR images, making substantial contributions to the interpretation of SAR images.

B. Semantic Segmentation Methods Based on Transformer

Initially utilized for machine translation, the transformer has subsequently found applications in various domains within natural language processing (NLP) [34]. The standard transformer module comprises three components, namely multihead self-attention (MSA), multilayer perceptron (MLP), and layer normalization. MSA is essential for establishing long-distance dependencies between sequences of input and output.

Recent research highlights the adaptability of transformers to computer vision tasks. Dosovitskiy et al. [35] introduced ViT, a method that pioneered the use of a pure transformer structure as a feature extractor for image classification tasks. The success of ViT in classification tasks has prompted researchers to investigate its applicability in semantic segmentation tasks. Building upon ViT, Zheng et al. [20] introduced SETR for semantic segmentation tasks, proving the potential of ViT in dense prediction tasks. Considering the huge computational complexity of SETR, Xie et al. [21] eliminated the design of positional coding, improved the self-attention mechanism, and proposed an efficient and lightweight SegFormer. Liu et al. [22] introduced the Swin transformer with a hierarchical structure like CNN. This architecture confines the self-attention mechanism to nonoverlapping windows across distinct layers and achieves information interaction through sliding window operations. Following this, the swin transformer has garnered significant attention in the realms of RS image segmentation and medical image segmentation [36], [37], owing to its exceptional performance in segmentation tasks. The aforementioned works showcase that the transformer can be effectively applied to image semantic segmentation with satisfactory results.

However, all existing transformer-based semantic segmentation methods focus on segmenting optical images, with limited exploration of transformers for semantic segmentation in SAR images. Therefore, in this work, we attempt to investigate the potential of transformers for semantic segmentation of buildings in SAR images.

SECTION III.

Method

In this section, we initially present the overall architecture of the proposed LGMFNet. Then, we sequentially introduce the dual encoder backbone, comprising the CNN-based main encoder and the transformer-based auxiliary encoder. Subsequently, GLSM and MSFN are introduced.

A. Overall Architecture

The proposed LGMFNet method follows the dual encoder–decoder framework, as seen in Fig. 1. Within the dual encoder, there is a CNN-based main encoder branch and a transformer-based auxiliary encoder branch. Specifically, the semantic features output from each stage of the main encoder and the auxiliary encoder are labeled as $\mathbf{M}_{n}$ and $\mathbf{A}_{n}$ , respectively. Subsequently, the GLSM aggregates the semantic feature output from the dual-branch encoder, and the obtained results are denoted by $\mathbf{F}_{n}$ . The aggregated semantic $\mathbf{F}_{n}$ at different scales is then fed into the MSFN decoder, facilitating multiscale feature fusion with the assistance of MSFM. Through layer-by-layer application of MSFM, the fusion results at different levels are generated (denoted by $\mathbf{D}_{n}$ ), while recovering the feature map resolution. Finally, a 3 × 3 convolutional layer and linear interpolation upsampling are applied to obtain the ultimate prediction mask of semantic segmentation.

Fig. 1.

Overall architecture of the proposed LGMFNet.

Show All

B. Dual Encoder Backbone

1) CNN-Based Main Encoder

In order to extract local semantics, we employ the MSCAN encoder in SegNeXt [38] as the main encoder for the proposed network. MSCAN utilizes the CNN structure and incorporates the traditional pyramid structure architecture. Fig. 2 illustrates the specific structure of MSCAN, which bears resemblance to ViT in terms of structure. However, it distinguishes itself by incorporating a multiscale convolutional attention (MSCA) module. As shown in Fig. 2(a), the main process of MSCA can be divided into three parts. Initially, a depthwise convolution is employed to capture localized semantics. Subsequently, a multibranch depth strip convolution is harnessed to capture multiscale contextual information. Conclusively, a 1 × 1 convolution assumes the role of modeling interchannel relationships. Significantly, the output of this 1 × 1 convolution serves directly as the attention weights, thereby effecting a reweighting of inputs into the MSCA module. The whole process of MSCA can be expressed as follows:

$\begin{align*} \text{Attention}=&\text{Conv}_{1\times 1}\left(\sum _{i=0}^{3}\text{Scale}_{i}(DWConv(F))\right) \tag{1}\\ \text{Output}=&Attention \odot F \tag{2} \end{align*}$ View Source

where

$DWConv$

stands for depthwise convolution.

$F$

and

$\odot$

denote the input feature and elementwise multiplication operation, respectively.

$Scale_{i} (i\in {0, 1, 2, 3})$

represents the

$i$

th depthwise strip convolution branches in Fig. 2(a). Within every depthwise strip convolution branch, two depthwise strip convolutions are employed to mimic the standard depthwise convolution with large kernels. Specifically, to imitate a standard convolution with a kernel size of

$k \times k$

, we can achieve this by performing a pair of strip convolutions, one with a size of

$k \times 1$

and another with a size of

$1 \times k$

. Furthermore, the majority of buildings depicted in SAR images are striped. Therefore, strip convolution can serve as a supplement to grid convolution, aiding in the extraction of striped built-up area characteristics.

$Fig. 2. - Structure of the main encoder core module. (a) MSCA structure diagram. (b) Structure of the MSCAN encoder. The depth strip convolution is represented in the figure as $d, a \times b$, employing a convolution kernel size of $a \times b$.$

Fig. 2.

Structure of the main encoder core module. (a) MSCA structure diagram. (b) Structure of the MSCAN encoder. The depth strip convolution is represented in the figure as $d, a \times b$ , employing a convolution kernel size of $a \times b$ .

Show All

By stacking a series of building blocks, the CNN-based encoder known as MSCAN is produced. MSCAN employs a widely used hierarchical framework consisting of four stages with feature resolution reduced by a factor of 2 in the order of $\frac{H}{4}\times \frac{W}{4}$ , $\frac{H}{8}\times \frac{W}{8}$ , $\frac{H}{16}\times \frac{W}{16}$ , and $\frac{H}{32}\times \frac{W}{32}$ resolution. As illustrated in Fig. 1, the main encoder initiates processing of the input image with a stem convolution operation. Each CNN stage in the main encoder in Fig. 1 comprises a downsampling block and a stack of building blocks as described above. The downsampling block includes a convolutional layer with a stride of 2 and a kernel size of $\text{3} \times \text{3}$ , which is then followed by a batch normalization (BN) layer. The proposed method utilizes the MSCAN-L configuration in the main encoder, with the parameters $L$ in the four stages corresponding to Fig. 2 set as $\lbrace$ 3, 5, 27, 3 $\rbrace$ , and $C_{i} (i=1, 2, 3, 4)$ corresponding to Fig. 1 set as $\lbrace$ 64, 128, 320, 512 $\rbrace$ , respectively.

2) Transformer-Based Auxiliary Encoder

As to the auxiliary encoder for the proposed method, we employ the MiT encoder in SegFormer [21] to extract global semantics. MiT utilizes the transformer architecture, similar to ViT. However, MiT differentiates itself by initially partitioning the image into smaller patches of dimensions 4 × 4 through the use of the overlap patch embeddings operation. Smaller patches are advantageous for semantic segmentation tasks that include dense prediction. As illustrated in Fig. 1, similarly to the architecture of CNN-based MSCAN, MiT likewise consists of four stages and has the ability to produce multilevel features, respectively, with sizes of $\frac{H}{4}\times \frac{W}{4}$ , $\frac{H}{8}\times \frac{W}{8}$ , $\frac{H}{16}\times \frac{W}{16}$ , and $\frac{H}{32}\times \frac{W}{32}$ . Specifically, in each of the transformer stages, $N$ pairs of efficient self-attention and mix feed-forward network (Mix-FFN) are applied consecutively. These operations are then followed by an overlap patch merging operation for downsampling and channel mapping. In the auxiliary encoder of the proposed method, we utilize the MiT-B5 configuration, with the parameters $N$ in the four stages set as $\lbrace$ 1, 2, 5, 8 $\rbrace$ . The output feature map channels for each transformer stage align with the corresponding CNN stage, which is $\lbrace$ 64, 128, 320, 512 $\rbrace$ .

During the multihead self-attention procedure, the dimensions of each head's $Q$ , $K$ , and $V$ are identical, represented as $N \times C$ , where $N = W \times H$ denotes the sequence length. This can be formulated as

$\begin{equation*} Att(Q,K,V) = \text{Softmax}\left(\frac{{QK}^{\top }}{\sqrt{d_{\text{head}}}}\right)V. \tag{3} \end{equation*}$ View Source

However, the above equation introduces a significant amount of computational complexity ( $O(N^{2})$ ). For this purpose, Mit uses efficient self-attention to reduce the computational complexity. By employing a reduction ratio, denoted as $R$ , this process shortens the sequence, which can be expressed as follows:

$\begin{align*} \hat{K} =& \text{Reshape} \left(\frac{N}{R}, C\cdot R\right)(K) \tag{4} \\ K =& \text{Linear} (C\cdot R, C)(\hat{K}) \tag{5} \end{align*}$ View Source

where

$K$

represents the sequence that needs to be reduced. The channel of input tensor is represented by

$C$

$Reshape(\cdot)$

denotes the operation of altering the shape of the input, while

$Linear(\cdot)$

refers to the linear layer that modifies the channel dimension of the input. In this way, the complexity of the self-attention mechanism in MiT can be thereby lowered from

$O(N^{2})$

$O(\frac{N^{2}}{R})$

Unlike ViT, which uses location coding to introduce location information, MiT incorporates the Mix-FFN to capture location relationships. This structure accounts for the impact of zero padding on leaking location information by directly employing a 3 × 3 convolution in the feed-forward network. The calculation of Mix-FFN is presented as follows:

$\begin{equation*} X_{\text{out}} = MLP(\sigma (\omega (MLP(X_{in}))))+X_{\text{in}} \tag{6} \end{equation*}$ View Source

where

$\omega$

and

$\sigma$

represent the 3 × 3 convolution operation and the GELU activation function, respectively. MLP(

$\cdot$

) stands for MLP. The input feature is represented by

$X_{\text{in}}$

, whereas the output feature is represented by

$X_{\text{out}}$

C. Global-Local Semantic Aggregation Module

Both global and local features are pivotal in the processing of SAR images [39]. However, the convolutional kernels in CNN-based encoders are limited by their locality, preventing the capture of global dependencies. As a solution, we have developed GLSM, a method that embeds global dependencies into local features while achieving the efficient fusion of both local and global semantics. This method results in a substantial reduction in the segmentation errors caused by high similarity features during building segmentation in SAR imagery.

The structure of GLSM is illustrated in Fig. 3. The local semantic $\mathbf{M}_{n}\in {{\mathbb {R}}^{C\times H\times W}}$ represents the output feature of the $n$ th stage of the main encoder, while the global semantic $\mathbf{A}_{n}\in {{\mathbb {R}}^{C\times H\times W}}$ represents the output feature of the $n$ th stage of the auxiliary encoder. Initially, the GLSM selectively collects visual primitives from the feature map $\mathbf{A}_{n}$ to produce $N$ global semantic descriptors, represented as $V\in {{\mathbb {R}}^{C\times N}}$ . The global semantic descriptor provides a comprehensive description of the relationship between the different spatial locations of global semantics, which can be calculated as follows:

$\begin{equation*} V = \varphi (\mathbf{A}_{n}) \otimes \mathbf{Z}^{\top } \tag{7} \end{equation*}$ View Source

where

$\mathbf{Z}=Softmax(\delta (\mathbf{A}_{n}))\in {{\mathbb {R}}^{HW\times N}}$

represents the attention map. The process involves utilizing a 1 × 1 convolutional layer (denoted as

$\delta$

) on the global semantic

$\mathbf{A}_{n}$

to convert the channel count to

$N$

, followed by applying the softmax function in the spatial dimension. In addition,

$\varphi$

and

$\otimes$

stand for the 1 × 1 convolution operation and matrix multiplication operation, respectively.

Fig. 3.

Detailed design of the proposed GLSM.

Show All

Subsequently, a 1 × 1 convolutional layer is applied to the local semantic $\mathbf{M}_{n}$ to adjust its channel count to $N$ , and the unique representations of all positions are transformed into $N$ -dimensional vectors using the softmax function. Afterward, the resultant vector is fused with global semantic descriptors in order to embed global cues into the local features. The fused outcome is represented as $\mathbf{S}\in {{\mathbb {R}}^{C\times H\times W}}$ . Ultimately, a sequence of aggregation operations has been executed on the global semantic $\mathbf{A}_{n}$ and the local semantic $\mathbf{M}_{n}$ , along with $\mathbf{S}$ , to produce the final result of the GLSM, referred to as $\mathbf{F}_{n}\in {{\mathbb {R}}^{C\times H\times W}}$ . The entire process outlined above can be articulated as follows:

$\begin{align*} \mathbf{S}=&\mathbf{V}\otimes Softmax(\gamma (\mathbf{M}_{n})) \tag{8}\\ \mathbf{F}_{n}=&\lambda (\epsilon (\mathbf{A}_{n} \oplus \mathbf{M}_{n})\oplus \mathbf{S}) \tag{9} \end{align*}$ View Source

where the elementwise addition operation is denoted by

$\oplus$

. The

$\text{1} \times \text{1}$

convolution operation is denoted by

$\gamma$

and

$\epsilon$

, respectively.

$\lambda$

stands for 3 × 3 convolution operation together with the BN and ReLU layers.

D. Multiscale Semantic Fusion Network

Establishing an effective mechanism for multiscale fusion is crucial to accurately recognize multiscale targets and improving the precision of segmentation [24]. To fully utilize the multiscale semantic features extracted by the dual encoder, we designed the MSFN as our decoder to gradually recover the spatial resolution layer by layer. This helps us to optimize the full flow of semantic information across different scales. Semantic interactions between layers are established through upsampling, downsampling, and skip connections. For this purpose, we utilize the MSFM to more effectively achieve the fusion of multiscale semantics.

The specific structure of the MFSN is illustrated in Fig. 1. We first scale the GLSM output feature mappings with inconsistent sizes using upsampling and downsampling operations. After unifying the sizes of all feature maps, we input the semantics of various scales into MSFM for feature fusion. For simplicity and clarity, we present the structure of the first MSFM of the MSFN decoder in Fig. 4. Fig. 4 illustrates the semantic features, including $\mathbf{F}_{1}\in {{\mathbb {R}}^{C_{1}\times \frac{H}{4}\times \frac{W}{4}}}$ , $\mathbf{F}_{2}\in {{\mathbb {R}}^{C_{2}\times \frac{H}{8}\times \frac{W}{8}}}$ , $\mathbf{F}_{3}\in {{\mathbb {R}}^{C_{3}\times \frac{H}{16}\times \frac{W}{16}}}$ , and $\mathbf{D}_{1}\in {{\mathbb {R}}^{C_{4}\times \frac{H}{32}\times \frac{W}{32}}}$ , each with distinct scales. We employ upsampling and downsampling operations on them to scale them to a uniform size $({{\mathbb {R}}^{C_{3}\times \frac{H}{16}\times \frac{W}{16}}})$ . Subsequently, we concatenate these four feature maps along the channel dimension. Finally, a 3 × 3 convolution, BN, and ReLU are applied to produce the ultimate fusion result, $\mathbf{D}_{2}\in {{\mathbb {R}}^{C_{3}\times \frac{H}{16}\times \frac{W}{16}}}$ . Here, upsampling and downsampling operations are implemented by the transpose convolution and the maximum pooling layer, respectively, and then the ReLU activation function is used with a 1 × 1 convolution to adjust the number of feature map channels. The remaining two fusion schemes of MSFM follow a similar approach. More precisely, uniformly adjust the size of $\mathbf{F}_{1}$ , $\mathbf{F}_{2}$ , $\mathbf{F}_{3}$ , and $\mathbf{D}_{2}$ to the same dimensions $({{\mathbb {R}}^{C_{2}\times \frac{H}{8}\times \frac{W}{8}}})$ . Then, input these resized feature maps into the MSFM to produce $\mathbf{D}_{3}\in {{\mathbb {R}}^{C_{2}\times \frac{H}{8}\times \frac{W}{8}}}$ . Likewise, uniformly resizing $\mathbf{F}_{1}$ , $\mathbf{F}_{2}$ , and $\mathbf{F}_{3}$ , along with $\mathbf{D}_{3}$ , to ${{\mathbb {R}}^{C_{1}\times \frac{H}{4}\times \frac{W}{4}}}$ , then feeding them into the MSFM to generate $\mathbf{D}_{4}\in {{\mathbb {R}}^{C_{1}\times \frac{H}{4}\times \frac{W}{4}}}$ . Refer to the MSFN structure in Fig. 1 for detailed information.

Fig. 4.

Detailed design of the proposed MSFM.

Show All

The design of MSFN achieves efficient multiscale feature fusion while recovering spatial resolution layer by layer. This approach ensures that multiscale semantics are fully flowed throughout the decoder. Furthermore, this design substantially mitigates the loss of spatial information in the network, enhancing the capability of the semantic segmentation task to accurately identify and localize multiscale target regions. Consequently, MSFN effectively enhances the segmentation of multiscale buildings in SAR images, significantly addressing issues related to the missed and incorrect recognition of both large building fragments and small buildings.

SECTION IV.

Experiments and Results

A. Dataset

In order to verify the effectiveness of our proposed method on the task of semantic segmentation of SAR built-up area images, we employ the benchmark dataset created by the Chinese Academy of Sciences (CAS) team derived from SpaceNet 6 [40]. The dataset consists of copolarized band SAR images (HH and VV) from SpaceNet 6, which have been cropped to nonoverlapping patches with a size of 512 × 512 pixels. The dataset is annotated with two categories, namely buildings and background. In total, the dataset comprises 17 844 images, with 13 383 allocated for training and the remaining 4 461 used for validation. Moreover, this dataset serves as the competition dataset for the BIGSARDATA 2023 AI SAR Image Segmentation Contest (AISAR Contest) in the SAR built-up area extraction track. The BIGSARDATA 2023 AISAR Contest has been held with the objective of advancing research in semantic segmentation for built-up area extraction and addressing the key challenges. This initiative is a collaborative effort spearheaded by the Aerospace Information Research Institute of CAS, Southwest Jiaotong University, Beijing University of Chemical Technology, Jiangsu University, and Nanjing University of Science and Technology. The process and results of the competition have been announced at the BIGSARDATA 2023 conference (https://bigsardata2023.casconf.cn/), organized by the Aerospace Information Research Institute of CAS.

B. Implementation Details

1) Training Settings

The experiments utilize a single NVIDIA Tesla P100 with 16 GB of video memory to train and assess all methods. The PyTorch framework is employed for implementation. In order to ensure experimental fairness, uniform parameter settings are applied across all methods. Specifically, the maximum training epoch is configured to 100 epochs. Constrained by hardware limitations, the batch size is fixed at 2. The models are trained utilizing the adaptive moment estimation (AdamW) optimizer [41], wherein the initial learning rate is established at $\text{6}\times \text{10}^{-5}$ , and the weight decay is set to 0.01. In addition, various standard data augmentation methods are utilized, such as scaling the photos with a ratio ranging from 0.5 to 2, random flipping, photometric warping, and normalizing.

2) Evaluation Metrics

In our experimental analyses, mean Intersection over Union (mIoU) and mean F1-score (mF1) are essential metrics for evaluating the results of semantic segmentation. The IoU is calculated by dividing the intersection of true and predicted values by their union for each category. On the other hand, the F1-score provides a comprehensive assessment of both recall and precision for each category. The IoU and F1-score are averaged across all categories to serve as the ultimate evaluation metrics. The computation algorithms for these metrics, which are generated from the confusion matrix, are stated as follows:

$\begin{gather*} IoU=\frac{TP}{TP+FP+FN} \tag{10}\\ \text{precision}=\frac{TP}{TP+FP} \tag{11}\\ \text{recall}=\frac{TP}{TP+FN} \tag{12}\\ F1\text{-}\text{score}=2\times {\frac{precision\times recall}{precision+recall}} \tag{13} \end{gather*}$ View Source

where true positive is indicative of the number of pixels where the predicted and actual conditions are in agreement, denoted as

$TP$

. False positive (

$FP$

) refers to the number of pixels where the predicted condition holds true, but the actual condition is false. Conversely, false negative (

$FN$

) signifies the number of pixels where the predicted condition is false, while the actual condition is true.

3) Loss Function

Semantic segmentation of SAR built-up areas involves complex pixel-level predictions. We employ the standard cross-entropy loss as the loss function for the proposed method in our experiments. The loss function is formally expressed in mathematical terms as follows:

$\begin{equation*} \text{Loss}=-\frac{1}{N}\sum _{i}^{N}\sum _{j}^{M}[\,p_{ij}\log (\hat{p}_{ij})+(1-p_{ij})\log (1-\hat{p}_{ij})] \tag{14} \end{equation*}$ View Source

where

$N$

represents the total number of samples and

$M$

represents the total number of categories concerned. Furthermore,

$p$

represents the true values of the segmentation, while

$\hat{p}$

represents the corresponding predictions.

C. Ablation Study

To validate the efficacy of the different components in the proposed LGMFNet, we conducted ablation experiments to assess the contribution of each component to the segmentation of buildings in SAR images. Specifically, we verified the efficacy of the dual encoder structure, GLSM, and MSFN decoder, respectively. The outcomes of the ablation trials, both with and without the integration of each component, are presented in Table I. In particular, LGMFNet (w/o dual encoder) signifies a design that eliminates the dual encoder structure from LGMFNet. This involves removing the auxiliary encoder and utilizing solely the semantic features produced by the main encoder, rather than the aggregated features from the dual branch encoder. LGMFNet (w/o GLSM) is represented by replacing GLSM with an element summation operation on the coded features of the dual encoder, aiming to achieve feature fusion for local and global semantics. LGMFNet (w/o MSFN) denotes the use of a U-Net structure decoder instead of our proposed MSFN decoder structure. Ablation experiments were conducted using the dataset from the BIGSARDATA 2023 AISAR Contest.

TABLE I Ablation Experiment of the Proposed Components on the BIGSARDATA 2023 AISAR Contest Dataset

1) Effect of Dual Encoder Structure

Upon examining the data in Table I, it indicates a significant decline in semantic segmentation performance after removing the auxiliary encoder from LGMFNet. Specifically, when compared to LGMFNet, the mIoU of LGMFNet (w/o dual encoder) decreases by 1.41% from 91.17% to 89.76%, and the mF1 decreases by 0.83% from 95.24% to 94.41%. Clearly, the removal of the dual encoder structure results in LGMFNet relying solely on CNNs to extract local semantic relations, rendering it incapable of modeling the entire scene. These findings demonstrate that the dual encoder can effectively mitigate the limitations of CNN in extracting global context. This enhances the accuracy of semantic segmentation in SAR image buildings.

2) Effect of Global-Local Semantic Aggregation Module

As shown in Table I, the mIoU of LGMFNet (w/o GLSM) decreases from 91.17% to 90.04%, while the mF1 decreases from 95.24% to 94.58% in comparison to LGMFNet. This indicates that utilizing the elemental summation operation instead of GLSM results in a decrease of 1.13% in mIoU and 0.66% in mF1. In order to more intuitively illustrate the specific contribution of GLSM to the network, we visualize the segmentation effects of LGMFNet with and without GLSM, as depicted in Fig. 5. Observing the figure, it is evident that in the absence of GLSM, sizable regions of buildings with high similarity features to the background are erroneously segmented into the background. This highlights that GLSM can fully utilize the coded features extracted by the dual encoder, effectively embedding global relations into local features and coupling them. Clearly, GLSM aggregates more discriminative information through cascading hierarchically, significantly alleviating interference caused by the high similarity of ground object features in SAR images with building segmentation.

Fig. 5.

Visualization of segmentation results without and with using GLSM in the LGMFNet. (a) Image. (b) Ground truth. (c) LGMFNet (w/o GLSM). (d) LGMFNet.

Show All

3) Effect of Multiscale Semantic Fusion Network

The data in Table I reveal that the mIoU of LGMFNet (w/o MSFN) reduces from 91.17% to 90.63%, and the mF1 reduces from 95.24% to 94.92% when compared to LGMFNet. This reduction corresponds to a decrease of 0.54% and 0.32% for mIoU and mF1, respectively. This suggests that employing the U-Net structure instead of MSFN as the decoder for the network negatively impacts segmentation accuracy. Similarly, we present the visualization results of the ablation experiments with and without using MSFN as a decoder in Fig. 6. In the absence of MSFN, the network fails to segment multiscale buildings meticulously, resulting in unclear outlines or even being lost for some smaller buildings. This demonstrates that MSFN can bridge the semantic gap between various scales and establish effective multiscale information interaction. This effectively addresses the multiscale segmentation challenge of buildings in SAR images, particularly enhancing the segmentation effect on large building fragments and small buildings.

Fig. 6.

Visualization of segmentation results without and with using MSFN in the LGMFNet. (a) Image. (b) Ground truth. (c) LGMFNet (w/o MSFN). (d) LGMFNet.

Show All

D. Comparison With State-of-the-Art Methods

We have conducted a thorough analysis of the proposed LGMFNet in comparison with state-of-the-art methods based on deep learning. The methods encompass FCN [13], DeepLabV3+ [25], DANet [18], SegNeXt [38], SETR [20], UPerNet [42], and SegFormer [21]. The first four methods rely solely on CNNs, while the last three methods include transformer architectures. Among them, FCN, DeepLabV3+, and DANet employ ResNet-101 [43] as their backbone. SegNeXt employs MSCAN-L, UPerNet employs Swin-B, SETR employs ViT-L, and Segformer employs MiT-B5 as their respective backbones.

1) Evaluation in Accuracy

Table II presents the quantitative results of our proposed method and state-of-the-art deep learning methods on the BIGSARDATA 2023 AISAR Contest dataset. The data in the table revealed that the proposed LGMFNet outperforms other methods, achieving the highest mIoU and mF1 at 91.17% and 95.24%, respectively. Furthermore, our proposed method attains the highest accuracy in both background and building categories. In comparison to the suboptimal UPerNet, LGMFNet surpasses the mIoU and mF1 by 0.78% and 0.46%, respectively. Notably, UPerNet utilizes the Swin transformer as its backbone, leveraging the superiority of transformers in global modeling. In addition, the accuracy of CNN-based SegNeXt ranks third. This is attributed to its structural design, which draws inspiration from the transformer, and the unique decoder design, which enables it to extract global context. These results highlight the critical importance of utilizing global semantics. The design of the dual encoder in LGMFNet effectively addresses the deficiency in global semantics extraction, resulting in a substantial improvement in the segmentation of buildings in SAR images.

TABLE II Comparison of Segmentation Results on the BIGSARDATA 2023 AISAR Contest Dataset With State-of-the-Art Methods.

To visually compare these methods, we generate visualizations of the predicted results for each method. The visualized results can be divided into two groups. In the first group, we select cases where buildings share high-similarity features with the background for prediction, as depicted in Fig. 7. In the second group, we chose cases with numerous and densely packed small buildings, which is the difficulty of the multiscale segmentation problem for buildings, as shown in Fig. 8. Upon observing Fig. 7, it is clear that in areas of the image where features are easily confused, certain methods result in significant classification errors due to a lack of discriminative information. In contrast, LGMFNet consistently produces relatively accurate predictions when handling high-similarity features. This can be attributed to the effectiveness of GLSM in utilizing both global and local semantics to acquire more discriminative information. In addition, the visualization results in Fig. 8 demonstrate that LGFMNet holds a greater advantage in addressing the multiscale segmentation problem for buildings. Particularly for large building fragments and small buildings, the proposed method excels at segmenting their edges and contours. This capability is attributed to MSFN, which bridges the gap between multiscale semantics and establishes effective interactions among multiscale features.

Fig. 7.

Visualization of inference results on the first set of experiments.

Show All

Fig. 8.

Visualization of inference results on the second set of experiments.

Show All

2) Evaluation in Efficiency

In order to conduct thorough comparisons, we completely assess the efficiency of each model by evaluating their floating-point operations (FLOPs) and parameters. The evaluation results for method efficiency are shown in Table III. It can be observed that, under the same experimental setup, models incorporating the transformer architecture typically exhibit larger parameters than pure CNN models. This difference arises because the transformer architecture incorporates a self-attention mechanism that requires a significant quantity of parameters to model global dependencies. In contrast, CNNs commonly share parameters across local receptive fields, thereby decreasing parameters. Notably, despite employing a dual encoder in the backbone of LGMFNet, leading to increased parameters and computational complexity, both its parameters and FLOPs remain within an acceptable range. This is attributed to the fact that both dual encoders employ a lightweight model as their backbone, ensuring the applicability of LGMFNet in specific scenarios where algorithmic efficiency is needed.

TABLE III Comparison of Model Parameters and FLOPS

E. Comparison With Existing Methods Designed for SAR Imagery

We compare the proposed LGMFNet with semantic segmentation methods designed for SAR images, namely MS-FCN [15] and MP-ResNet [16]. The experimental results are presented in Table IV. As can be seen, the segmentation performance of our method significantly surpasses that of the other two SAR image segmentation methods. Analyzing the reasons for this situation, the model structure employed by MS-FCN is overly simplistic and fails to leverage multiscale information, resulting in a substantial loss of semantic information. In contrast, MP-ResNet enhances local feature extraction by expanding the receptive field and employing a multiscale feature fusion strategy. However, MP-ResNet is constrained by the absence of CNN global modeling capability for encoding global context information. Therefore, these methods prove inadequate as solutions to the challenging task of segmenting buildings in SAR images.

TABLE IV Comparison Between the Proposed LGMFNet and Other Methods Designed for SAR Images on the BIGSARDATA 2023 AISAR Contest Dataset

SECTION V.

Conclusion

In this article, we propose a novel dual encoder–decoder architecture network called LGMFNet, which aims to provide precise semantic segmentation of buildings in urban scenes using SAR images. To overcome the limitations of CNNs in global modeling, we employ a design consisting of dual-branch encoders. Within the dual encoder, the CNN-based main encoder and the transformer-based auxiliary encoder are utilized to extract the local context and global context, respectively. In addition, we designed GLSM to extract more discriminative features. Within the decoder, we introduce MSFN to facilitate efficient multiscale feature interaction. Extensive experiments demonstrate that LGMFNet achieves superior performance in terms of mIoU and mF1 metrics compared to other state-of-the-art methods on the BIGSARDATA 2023 AISAR Contest dataset. Experimental results confirm the efficacy of the proposed method in effectively mitigating interference caused by similar features in the semantic segmentation of buildings in SAR images. Furthermore, it demonstrates that our method effectively addresses the multiscale segmentation challenges arising from the complexity of building structures.

However, although LGMFNet improves the semantic segmentation of buildings in SAR images, its adaptability to applications beyond SAR urban scenes has not been proven. Our future research will focus on investigating the semantic segmentation of SAR imagery in different application scenarios. We hope this work will inspire further researchers to explore the potential of deep learning in SAR image processing.

References is not available for this document.

MIT Libraries

MIT Libraries

Local–Global Multiscale Fusion Network for Semantic Segmentation of Buildings in SAR Imagery

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction

Related Work

A. Semantic Segmentation Methods for SAR Images Based on Deep Learning

B. Semantic Segmentation Methods Based on Transformer