Processing math: 0%
Semantic Segmentation of Remote Sensing Images With Transformer-Based U-Net and Guided Focal-Axial Attention | IEEE Journals & Magazine | IEEE Xplore

Semantic Segmentation of Remote Sensing Images With Transformer-Based U-Net and Guided Focal-Axial Attention


Abstract:

In the field of remote sensing, semantic segmentation of unmanned aerial vehicle (UAV) imagery is crucial for tasks such as land resource management, urban planning, prec...Show More

Abstract:

In the field of remote sensing, semantic segmentation of unmanned aerial vehicle (UAV) imagery is crucial for tasks such as land resource management, urban planning, precision agriculture, and economic assessment. Traditional methods use convolutional neural networks (CNNs) for hierarchical feature extraction but are limited by their local receptive fields, restricting comprehensive contextual understanding. To overcome these limitations, we propose a combination of transformer and attention mechanisms to improve object classification, leveraging their superior information modeling capabilities to enhance scene understanding. In this article, we present Swin-based focal axial attention network (SwinFAN), a U-Net framework featuring a Swin transformer as encoder, equipped with a novel decoder that introduces two new components for enhanced semantic segmentation of urban remote sensing images. The first proposed component is a guided focal-axial (GFA) attention module that combines local and global contextual information, enhancing the model's ability to discern intricate details and complex structures. The second component is an innovative attention-based feature refinement head (AFRH) designed to improve the precision and clarity of segmentation outputs through self-attention and convolutional techniques. Comprehensive experiments demonstrate that the accuracy of our proposed architecture significantly outperforms state-of-the-art models. More specifically, our method achieves mean intersection over union (mIoU) improvements of 1.9% on UAVid, 3.6% on Potsdam, 1.9% on Vaihingen, and 0.8% on LoveDA.
Page(s): 18303 - 18318
Date of Publication: 27 September 2024

ISSN Information:

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Urban scene segmentation from remote sensing imagery is a challenging topic in image processing, due to the complex and diverse nature of covered landscapes. High-resolution images are now increasingly available and offer detailed semantic and spatial datasets, crucial for applications such as land cover mapping, change detection, and environmental monitoring [1].

Deep learning advancements, particularly convolutional neural networks (CNNs), have significantly advanced semantic segmentation research and applications [2]. CNNs excel in representing local context information, significantly outperforming traditional machine learning techniques in feature representation and pattern recognition [3]. However, despite their effectiveness, CNNs inherently struggle with global contextual information [4], crucial for accurate per-pixel classification in complex urban scenes recorded from unmanned aerial vehicles (UAVs).

The integration of various attention mechanisms into CNNs marked a significant evolution, encompassing spatial [5], temporal [6], spectral [7], cross-modal [8], or graph [9] attention networks. This evolution led to the incorporation of self-attention mechanisms and paved the way for the adaptation of transformer networks to image processing tasks, especially for segmenting urban scenes [10]. Originally designed for natural language processing, transformer models have distinguished themselves from traditional approaches by focusing on attention mechanisms without relying on convolutional layers [11].

These developments led to the introduction of global-local network architectures [12], which aim to reconcile detailed features with complex spatial relationships present in urban scenes. This approach has been used for a large range of applications such as object classification [13], urban land cover mapping [14], [15], change detection [16], [17], and environmental monitoring [18].

In this article, to address the limitations and deficiencies of conventional CNN architectures and inspired by recent advancements in attention mechanisms, we introduce swin-based focal axial attention network (SwinFAN). The proposed framework employs a U-Net architecture, where the Swin transformer acts as an encoder and the decoder is composed of two novel components, a guided focal-axial (GFA) attention module and an attention-based feature refinement head (AFRH).

The main contributions of our work are as follows.

  1. We propose the GFA attention component, a hybrid attention mechanism that integrates a focal module with an axial attention system. The focal attention targets key salient regions, while the axial attention spans broader areas, ensuring a detailed and comprehensive understanding of both local and global features in the imagery.

  2. We introduce the AFRH, the final component of our decoder. This mechanism uses self-attention and convolutional techniques to merge spatial information with contextual data, significantly improving the clarity and accuracy of the segmentation output.

  3. Our methodology demonstrates superior performance compared to state-of-the-art models, being rigorously tested across several key public datasets, including UAVid [19], ISPRS Potsdam and Vaihingen [20], and LoveDA [21]. Our proposed network excels in diverse imaging conditions, showing robustness and adaptability for both oblique and nadir views, making it suitable for a wide range of remote sensing applications.

The rest of this article is organized as follows. In Section II, we survey semantic segmentation techniques, particularly of remote sensing imagery, discussing the evolution from CNN-based methods to transformer-based architectures, and the significance of global-local contextual models. In Section III, we elaborate on our proposed architecture, explaining how the Swin transformer functions as the encoder in a U-Net configuration, supported by our GFA attention module and the AFRH within the decoder. Section IV presents a comparative analysis of our model against current state-of-the-art methodologies. Additionally, it includes ablation studies that evaluate the individual contributions of each component in our proposed architecture, underscoring their effectiveness for the task of semantic segmentation of remote sensing images. Finally, Section VI concludes this article.

SECTION II.

Related Work

A. CNN-Based Semantic Segmentation Networks

CNNs have been widely used for semantic segmentation in remote sensing, leveraging their strong feature extraction capabilities to analyze complex spatial patterns in high-resolution images [22]. A key development in CNN-based semantic segmentation is the DeepLabV3+ framework [23], which uses an encoder–decoder structure with atrous separable convolutions. This design effectively captures multiscale contextual information, addressing the segmentation of objects at varying scales in high-resolution imagery without significantly increasing computational demands [24].

This field also emphasizes foreground-focused segmentation with methods like FarSeg [25], which uses a foreground-aware relation network to highlight contextual relationships, and FactSeg [26], employing a foreground activation technique for accurate segmentation of small objects in detailed imagery.

Pyramid structures and feature fusion are also pivotal, exemplified by PSPNet [27] with its pyramid pooling module to gather multiscale contextual information, SemanticFPN [28], which integrates feature pyramids with segmentation, and hybrid networks that merge PSPNet with superpixel techniques for enhanced land cover classification in multispectral imagery [15]. SwiftNet [29] further demonstrates efficient segmentation through pyramidal fusion, optimizing accuracy and computational efficiency.

The U-Net architecture, originally designed for medical image segmentation, has expanded to address various remote sensing tasks with its encoder–decoder design and skip connections, enhancing detail and context integration [30], [31], [32]. Adaptations include deeper layers and residual connections to manage the complex variability of UAV data.

Challenges in CNN-based semantic segmentation include difficulties in segmenting small objects in dense urban settings [33], [34], high computational demands [35], [36] limiting real-time or low-capacity platform use, and model generalization issues across different environments [37], [38], which necessitate large training datasets and adaptable models for consistent performance.

B. Attention and Transformer-Based Semantic Segmentation Models

In CNNs, context refers to spatial information captured by convolutional filters at various layers, while attention mechanisms in neural networks, like those in transformer-based models, selectively prioritize input sequence elements based on task relevance, dynamically adjusting focus. Key developments in the field of remote sensing include the integration of attention within architectures such as A^{2}FPN [39], which enhances focus on salient features, DANet [40] employing self-attention for spatial and channel context, and RDNet [41] using a reverse difference mechanism for effective segmentation across diverse object sizes in urban aerial imagery.

Transformer-based models have significantly impacted semantic segmentation by utilizing self-attention to capture long-range dependencies and global contexts more effectively than CNNs. Examples include BANet [42] combining transformers with convolution layers, BoTNet [43] embedding self-attention in ResNet, and various configurations of the Swin transformer such as SwinTF-FPN [44], SwinUperNet [45], and AerialFormer [46], showcasing the adaptability of transformers in analyzing UAV datasets.

Combining CNNs with transformers, models like ICTNet [47], FT-UNetFormer [48], and CMTFNET [49] merge local feature extraction with global contextual insights, enhancing segmentation in complex environments. Additionally, models such as SegFormer [50] and TransUNet [51] improve multiscale feature representation using diverse transformer structures for detailed spatial analysis.

However, the deployment of attention and transformer models in semantic segmentation faces challenges, particularly the high computational demands of processing high-resolution images, which require significant power and memory resources [52]. These models also need large, diverse training datasets to effectively learn complex image patterns [53], presenting significant hurdles in acquiring well-annotated data, especially in scenarios where such resources are scarce or expensive.

C. Global Local Attention Architectures

In remote sensing semantic segmentation, the integration of global and local analyses significantly enhances model performance by melding broad scene insights with detailed recognition. Models like CANet [54] and CMTFNET [49] exemplify this approach; CANet achieves real-time segmentation through a context aggregation network that combines local and global information, while CMTFNET merges CNNs with multiscale transformers to balance nuanced detail with contextual understanding.

UNetFormer [48] introduced a global-local attention mechanism, blending attention-based global context capture with detailed local analysis, improving accuracy and context-awareness. Similarly, EDDformer [55] and EMRT [56] demonstrate transformers' ability to encode both broad and specific spatial features, enhancing multiscale representation.

While focusing on segmentation accuracy, these approaches also prioritize computational efficiency, a critical factor given the often extensive and complex datasets used in remote sensing. Models like ICTNet [47] and GLOTS [57] aim to balance computational demands with effective segmentation outcomes.

Despite the challenges associated with global-local approaches, such as computational complexity, heightened memory requirements, dependence on extensive training data, risks of overfitting, and optimization hurdles [58], their advantages significantly outweigh these issues. Building on the strengths of global-local methodologies like UNetFormer [48], we introduce two novel components to improve semantic segmentation: the GFA block and the AFRH. Unlike the predefined separation of global and local contexts in the global-local attention block (GLTB) of UNetFormer, our proposed GFA module dynamically fuses multiscale focal and axial attention features to more effectively capture both local details and broader context. Our AFRH component differentiates itself from the feature refinement head (FRH) in UNetFormer by incorporating self-attention, which enables more refined feature adjustment and enhances the segmentation results' precision and clarity.

SECTION III.

Methodology

A. Network Overview

The proposed SwinFAN architecture, depicted in Fig. 1, combines a Swin transformer encoder with a GFA attention decoder within a four-level U-Net structure for UAV image semantic segmentation. The architecture includes several Swin Blocks that progressively downsample the image, reducing its spatial resolution while increasing the feature dimension. This downsampling process allows the model to capture higher level semantic information at various scales. The attention-based decoder then progressively upsamples these features, refining the segmentation with the help of GFA attention and weighted sum operations to ensure precise boundary delineation and detailed scene comprehension.

Fig. 1. - SwinFAN U-Net architecture. The encoder captures features at multiple scales using Swin transformer blocks, while the decoder employs GFA attention for integrating global and local attention features. As a final step, the AFRH refines the output to produce a precise segmentation map.
Fig. 1.

SwinFAN U-Net architecture. The encoder captures features at multiple scales using Swin transformer blocks, while the decoder employs GFA attention for integrating global and local attention features. As a final step, the AFRH refines the output to produce a precise segmentation map.

To combine the features extracted by the Swin transformer block in the encoder with the ones generated by the attention-based decoder module, we perform a weighted sum operation, similar to [48]. This combines the two output features based on their contributions, with the following formula: \begin{equation*} \text{FF} = \alpha \cdot \text{SW} + (1 - \alpha) \cdot \text{GFA} \tag{1} \end{equation*} View SourceRight-click on figure for MathML and additional features.where \text{FF} is the fused feature resulting after the weighted sum operation, \text{SW} is the feature produced by the Swin transformer, \text{GFA} is the feature produced by GFA attention block, and \alpha is matrix of values, with each value corresponding to the attention weight at a specific spatial location.

B. Swin-Based Encoder

The Swin transformer is a hierarchical architecture that diverges from traditional CNNs by partitioning images into patches and employing self-attention mechanisms within these patches across multiple scales [45]. It introduces shifted windowing in its layers to manage self-attention computation, effectively capturing local and global contextual information while maintaining computational efficiency. By merging patches at successive stages and alternating the window partitions, the Swin transformer ensures exhaustive coverage across the image, enabling the integration of multiscale features essential for complex vision tasks.

In mathematical terms, given an input image of size H \times W and patch size P \times P, the number of patches N is \frac{H \times W}{P \times P}. Each patch is linearly embedded into D-dimensional space. Window-based multihead self-attention (W-MSA) is applied to each window within the input feature X \begin{equation*} \text{W-MSA}(Q, K, V) = \text{Softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)V \tag{2} \end{equation*} View SourceRight-click on figure for MathML and additional features.where Q, K, V are query, key, and value matrices, d_{k} is the key dimension, and T is the transpose operation.

Shifted window partitioning is introduced in alternate layers by shifting the feature map x^{l} before self-attention. The shift size S is half the window size M/2 \begin{equation*} x^{l}_{\text{shifted}}(i, j) = x^{l}(i + S, j + S) \tag{3} \end{equation*} View SourceRight-click on figure for MathML and additional features.where i, j denote the spatial positions within the feature map. SW-MSA is computed within shifted windows using \begin{equation*} \text{SW-MSA}(Q, K, V) = \text{Softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)V. \tag{4} \end{equation*} View SourceRight-click on figure for MathML and additional features.After SW-MSA, the output is shifted back to the original position. This shifting allows capturing interactions between features in different windows, increasing the receptive field without high computation. A two-layer MLP with GELU nonlinearity follows W-MSA and SW-MSA blocks. LayerNorm (LN) is applied before each block, and residual connections enhance training stability, as represented in Fig. 2(a).

Fig. 2. - Key building blocks of the SwinFAN architecture. (a) Swin block: Consists of a series of layer normalization, multihead self-attention mechanisms (window-based and shifted window), and multilayer perceptrons. (b) GFA attention block: Implements an elementwise multiplication of focal module which refines features at two scales to emphasize local detail and of axial attention, which sequentially targets the feature map's vertical and horizontal axes to capture global dependencies.
Fig. 2.

Key building blocks of the SwinFAN architecture. (a) Swin block: Consists of a series of layer normalization, multihead self-attention mechanisms (window-based and shifted window), and multilayer perceptrons. (b) GFA attention block: Implements an elementwise multiplication of focal module which refines features at two scales to emphasize local detail and of axial attention, which sequentially targets the feature map's vertical and horizontal axes to capture global dependencies.

C. Guided Focal-Axial Attention Decoder

In our proposed decoder, the GFA attention module employs the focal module to generate a mask, which is then multiplied elementwise with the output of the axial attention applied sequentially along the height and width dimensions of the feature map \begin{align*} &\text{Guided focal axial mask}(x) \\ &\quad = \text{FocalModule}(x) \odot \text{AxialAttention}(x) \tag{5} \end{align*} View SourceRight-click on figure for MathML and additional features.where \odot denotes the elementwise dot product (or Hadamard product). This mechanism allows the model to focus on finer details in areas of interest while still capturing the larger contextual information along each axis.

The focal module is a computational unit designed to enhance the processing of spatial features within an input feature map through the creation of a spatially adaptive weight mask. This mask helps identify and emphasize significant regions by reducing the resolution of the feature map, thereby simplifying data complexity and reducing computational demands. This allows the network to process larger blocks of information more efficiently. Additionally, the module enlarges the network's receptive field, enabling it to focus on broader, more impactful areas. This expanded view is crucial for maintaining the context necessary for understanding the overall scene structure. The broader regions identified are then refined in detail by axial attention, ensuring that the network effectively captures both the broad and detailed aspects of the image.

This is achieved through analyzing the input across multiple scales or levels of granularity, leveraging convolutional operations to assess and prioritize areas based on their contextual relevance or information density. Our implementation of the focal mechanism, as seen in Fig. 2(b), downscales the feature map and applies convolution, then upscales it back. Mathematically, it can be represented as \begin{align*} &\text{FocalMask}(x) \\ &\quad =\!\frac{1}{n} \sum _{i=1}^{n} \text{Upscale}\!\left(\text{Conv}\!\left( \text{Downscale}\left(x, \frac{1}{2^{i}}\!\right)\! \right), \text{size}(x) \right). \tag{6} \end{align*} View SourceRight-click on figure for MathML and additional features.Here, x is the input feature map, n is the number of scales, and operations include downscaling, convolution, and upscaling. We perform bilinear interpolation, and we employ two scales, which means the image is downsized to half and quarter. The final output is an average of scaled feature maps.

Axial attention decomposes the standard attention mechanism into axial components, focusing along one dimension at a time (height or width) [59]. Height attention computes self-attention for each vertical slice, aggregating contextual information along the feature map's columns, while width attention mirrors this process across each horizontal slice, integrating contextual information along rows. The axial attention mechanism, applied consecutively along the height and width dimensions, can be represented as follows: \begin{gather*} H = \text{Softmax}\left(\frac{\text{QKV}_{\text{height}}(x)}{\sqrt{d_{k}}} \right) \tag{7}\\ P_{H} = \text{Proj}_{\text{height}}(H) \tag{8}\\ W = \text{Softmax}\left(\frac{\text{QKV}_{\text{width}}(P_{H})}{\sqrt{d_{k}}} \right) \tag{9}\\ \text{AxialAttention}(x) = \text{Proj}_{\text{width}}(W). \tag{10} \end{gather*} View SourceRight-click on figure for MathML and additional features.Here, \text{QKV}_{\text{height}} and \text{QKV}_{\text{width}} refer to the computation of queries, keys, and values for the height and width dimensions, respectively, \text{Softmax} represents the softmax operation applied to the scaled dot-products, \text{Proj}_{\text{height}} and \text{Proj}_{\text{width}} are linear projections for each dimension, and d is the dimension of each head's key/query vectors.

D. Attention-Based Feature Refinement Head

In our proposed network, the AFRH enhances the quality of feature representations, as the last step before the output processing stage. As illustrated in Fig. 3, the implementation begins with a preconvolution layer that aligns the channel dimensions of the residual input with the decode channels, preparing the features for fusion. The module then concatenates these aligned residual features with the upsampled input features.

Fig. 3. - Architecture of the AFRH.
Fig. 3.

Architecture of the AFRH.

Subsequently, the feature map undergoes processing through a self-attention mechanism, which integrates a convolutional layer with a sigmoid activation function to produce attention weights. Important features are enhanced, while less significant ones are diminished, resulting in a feature map enriched with semantic significance and detailed spatial information. The refined features are then processed through a combination of depthwise separable convolution and shortcut convolution connection, used for preserving the integrity of the initial feature representations and mitigating the vanishing gradient problem, whilst providing an alternate pathway for gradient flow during backpropagation.

E. Loss Function

In our architecture, the training process is guided by a hybrid loss function that combines the strengths of cross-entropy loss and dice loss. This integrated approach is designed to improve pixel-level classification precision as well as the accuracy of spatial overlap, both vital for the effectiveness of image segmentation tasks.

The joint loss function is defined as \begin{equation*} \text{Loss} = L_{\text{ce}} + L_{\text{dice}}. \tag{11} \end{equation*} View SourceRight-click on figure for MathML and additional features.

The cross-entropy loss is given by \begin{equation*} L_{ce} = - \frac{1}{N}\sum _{n=1}^{N} \sum _{k=1}^{K} y_{nk} \log (\hat{y}_{nk}) \tag{12} \end{equation*} View SourceRight-click on figure for MathML and additional features.where N is the number of samples, K is the number of classes, y_{nk} is the true label, and \hat{y}_{nk} is the predicted probability for each class k in sample n. The label smoothing aspect of cross entropy loss contributes to the stability of the training process and enhances the model's ability to generalize well to unseen data.

Similarly, the dice loss is defined as \begin{equation*} L_{\text{dice}} = 1 - \frac{2}{N} \sum _{n=1}^{N} \sum _{k=1}^{K} \frac{y_{nk} \hat{y}_{nk}}{y_{nk} + \hat{y}_{nk}}. \tag{13} \end{equation*} View SourceRight-click on figure for MathML and additional features.The incorporation of dice loss makes the training process less susceptible to issues arising from class imbalance, which is common in segmentation tasks of UAV imagery.

SECTION IV.

Experimental Results

A. Datasets

1) UAVid

The UAVid dataset [19] contains urban scene recordings captured from UAVs and specifically tailored for semantic segmentation challenges within aerial images. The dataset contains high-resolution images of sizes 3840×2160 and 4096×2160, which provide detailed views of urban scenes from two cities (Germany and China), at an oblique camera angle. It includes a diverse range of urban landscapes and it contains high-quality annotations for eight classes (building, road, tree, vegetation, moving car, static car, human, and clutter). The dataset is divided into 200 images for training, 70 images for validation, and 150 images for testing.

2) Potsdam

The Potsdam dataset [60] is a collection of aerial imagery, extensively used for urban scene analysis and semantic segmentation in remote sensing research. It includes 38 high-resolution images (6000×6000) of the Potsdam area in Germany, recorded from a nadir view. The dataset comprises a variety of urban elements annotated in five classes (impervious surfaces, buildings, low vegetation, trees, and cars). We use the same data distribution as in [39], [48], employing 14 images for testing (2_13, 2_14, 3_13, 3_14, 4_13, 4_14, 4_15, 5_13, 5_14, 5_15, 6_13, 6_14, 6_15, 7_13), image 2_10 for validation, and the remaining 22 images for training (except 7_10 with error annotations).

3) Vaihingen

The Vaihingen dataset [61] consists of 33 high-resolution aerial imagery (6000×6000) from Vaihingen, Germany, from a nadir viewpoint. The dataset is characterized by its detailed annotations and diverse urban features, including labeled annotations for five classes (impervious surfaces, building, low vegetation, tree, and car). We utilize the same data distribution as in [39] and [48], using 15 images for training (1, 3, 5, 7, 11, 13, 15, 17, 21, 23, 26, 28, 32, 34, 37), image 30 for validation, and the remaining 17 images for testing.

4) LoveDA

The LoveDA dataset [21] is a collection of remote sensing images tailored for semantic segmentation in urban, rural, and mixed scenes. It features high-resolution satellite imagery, capturing diverse landscapes and urban features from three cities from China (Changzhou, Nanjing, and Wuhan). This dataset is designed to address challenges in land-cover classification and urban scene analysis. LoveDA provides rich annotations for seven classes (background, building, road, water, barren, forest, agriculture). In total, it contains 5987 high-resolution optical remote sensing images of size 1024×1024, from which 2522 images are used for training, 1669 images for validation, and 1796 images for testing.

B. Implementation Details

In the development of our network, we chose the PyTorch framework due to its flexibility and extensive support for deep learning applications. The computational experiments were performed on a single NVIDIA GeForce RTX 3090 GPU with 24 GB of memory. For network optimization, we employed the AdamW optimizer for its weight decay and stability benefits, starting with a 6e-4 learning rate and using cosine annealing for dynamic adjustments over 45 epochs and a batch size of 8. This method balances computational efficiency and gradient accuracy.

In preprocessing, we cropped images from the UAVid, Potsdam, and Vaihingen datasets to 1024×1024. For UAVid, we used random flips and brightness adjustments for augmentation, plus horizontal and vertical flips in testing for enhanced model robustness. For the LoveDA, Potsdam, and Vaihingen datasets, our augmentation included random flips and scaling (0.5–1.5), with random flips and multiscaling in testing to ensure the model's adaptability to various sizes and conditions. Multiscale testing, utilized in multiple works [48], [55], [62], enhances the robustness and generalization of semantic segmentation models by exposing them to objects at various scales, simulating real-world UAV flight conditions.

C. Evaluation Metrics

In our experiments, we report on several key performance metrics, including the mean intersection over union (mIoU), mean F1 score (F1), and overall accuracy (OA). For the mIoU, the formula is as follows: \begin{equation*} \text{mIoU} = \frac{1}{C} \sum _{c=1}^{C} \frac{\text{TP}_{c}}{\text{TP}_{c} + \text{FP}_{c} + \text{FN}_{c}}. \tag{14} \end{equation*} View SourceRight-click on figure for MathML and additional features.

The F1 score is a harmonic mean of Precision and Recall, given by \begin{gather*} \text{Precision} = \frac{1}{C} \sum _{c=1}^{C} \frac{\text{TP}_{c}}{\text{TP}_{c} + \text{FP}_{c}} \tag{15}\\ \text{Recall} = \frac{1}{C} \sum _{c=1}^{C} \frac{\text{TP}_{c}}{\text{TP}_{c} + \text{FN}_{c}} \tag{16}\\ F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}. \tag{17} \end{gather*} View SourceRight-click on figure for MathML and additional features.

Lastly, the OA is defined as \begin{equation*} \text{OA} = \frac{\sum _{c=1}^{C} \text{TP}_{c}}{\sum _{c=1}^{C} {\text{TP}_{c}} + {\text{TN}_{c}} + \text{FP}_{c} + \text{FN}_{c}}. \tag{18} \end{equation*} View SourceRight-click on figure for MathML and additional features.

In these formulas, TP, TN, FP, and FN, represent the True Positives, True Negatives, False Positives, and False Negatives, while C is the number of classes in the dataset. These metrics collectively provide a comprehensive evaluation of our model's performance across various aspects of semantic segmentation accuracy.

D. Models for Comparison

We selected the following state-of-the-art networks to compare our proposed methodology with the following.

1) CNN-Based

These include well-established architectures that have been widely recognized for their performance in semantic segmentation tasks, such as BEDSN [63], C-PNet [64], DCA [65], DeepLabV3+[23], FactSeg [26], FarSeg [25], Mamba-UNet [66], MSD [19], PSPNet [27], SE-UNet [67], SemanticFPN [28], SwiftNet [29], and UNet++[68].

2) Attention-Based

Networks that have shown promise in capturing contextual information and improving segmentation accuracy: A^{2}-FPN [39], CANet [54], DANet [40], MFNet [69], MTANet [70], RDNet [41], and UNetFormer [48].

3) Transformer-Based

Architectures that have the ability to capture long-range dependencies in images: BANet [42], BoTNet [43], CG-Swin [62], CMTFNET [49], CoaT [71], CSTUNet [72], DC-Swin [73], DSCT [74], EDDformer [55], Efficient-L [52], EMRT [56], FT-UNetFormer [48], GLOTS [57], ICTNet [47], SegFormer [50], Segmenter [75], ST-UNet [76], STransFuse [77], SwinB-CNN+BD [78], SwinTF-FPN [44], SwinUperNet [45], TransUNet [51], and WSSS [79].

4) Global-Local Architectures

These models incorporate global and local attention mechanisms, making them particularly relevant for evaluating the impact of such architectures on semantic segmentation tasks: CMTFNET [49], EDDformer [55], EMRT [56], FT-UNetFormer [48], GLOTS [57], ICTNet [47], LGBSwin [80], and UNetFormer [48].

E. Results on the UAVid Dataset

In our evaluation on the official UAVid benchmark,1 presented in Table I, our methodology achieves the highest mIoU of 70.6%, surpassing competitors such as MFNet [69] and RDNet [41], which scored 68.7% and 68.2%, respectively. Our proposed network excels particularly in building, road, and human categories with scores of 89.5%, 82.6%, and 33.0%. This demonstrates robust capability in handling less represented classes, such as pedestrians. While our model leads in several categories, it faces strong competition in static car and clutter from EDDformer [55] and MFNet [69], where it ranks second in the clutter category, a segment noted for its complexity. The comparative analysis of CNN-based, attention-based, and global-local architectures using the UAVid test dataset underscores the evolving strategies and methodologies in semantic segmentation. CNN-based models such as C-PNet [64], MSD [19], and SwiftNet [29], have proven to be fundamentally effective, excelling in specific segmentation tasks and affirming the continued relevance of convolutional approaches within the domain. Attention-based models such as MFNet [69], RDNet [41], and UNetFormer [48], through the incorporation of attention mechanisms, have shown to significantly enhance segmentation accuracy by improving the model's capacity for contextual understanding.

TABLE I Performance Evaluation on UAVid Test Dataset
Table I- Performance Evaluation on UAVid Test Dataset

Moreover, global-local architectures, particularly highlighted by EDDformer [55] and UNetFormer [48], achieve a synthesis of wide-ranging scene comprehension and precise detail analysis, illustrating the benefits of integrating both global and local attention mechanisms for improved segmentation precision. Relative to its predecessors, SwinFAN demonstrates superior capabilities in handling complex segmentation challenges. We also notice that our SwinFAN architecture, which employs Swin-Base as backbone, outperforms other transformer-based models with different backbones, such as ViT-Tiny, CoaT-Mini, and MiT-B1.

In assessing the effectiveness of various U-Net-based architectures for semantic segmentation on the UAVID dataset, our analysis highlights the superior performance of the SwinFAN model. Our proposed network consistently outperforms other architectures such as SE-UNet [67], UNet++ [68], Mamba-UNet [66], and UNetformer [48] across a wide range of categories, including buildings, roads, and vegetation, as evidenced by its leading IoU scores. Particularly noteworthy is SwinFAN's ability to handle complex classes like moving cars and static cars, where it surpasses competing models by a significant margin.

We notice that classifying static cars is particularly challenging, especially when their appearance closely resembles moving cars. This often leads to misclassifications, as the model struggles to differentiate between the two classes based solely on static visual cues. However, this challenge can be addressed through various techniques, such as augmenting the training dataset with virtual data, as demonstrated in our previous work [81], or embedding temporal information [82].

Results can be visualized in Fig. 4 for China, and Fig. 5 for Germany, where we chose UNetFormer as a benchmark for its architectural resemblance and enhancements over the traditional U-Net framework, specifically its integration of global-local attention mechanisms, aligning closely with our proposed methodology.

Fig. 4. - Comparative qualitative analysis of semantic segmentation performance of UNetFormer and SwinFAN, on the UAVid test set in China, from sequences 22, 25, 26, and 40. The white box highlights the various improvements our proposed model brings, such as dynamic car detection, and improved object boundaries.
Fig. 4.

Comparative qualitative analysis of semantic segmentation performance of UNetFormer and SwinFAN, on the UAVid test set in China, from sequences 22, 25, 26, and 40. The white box highlights the various improvements our proposed model brings, such as dynamic car detection, and improved object boundaries.

Fig. 5. - Comparative qualitative analysis of semantic segmentation performance of UNetFormer and SwinFAN, on the UAVid test set in Germany, from sequences 29, 30, and 38. With the white box, we highlighted the improved road, car, and human detections.
Fig. 5.

Comparative qualitative analysis of semantic segmentation performance of UNetFormer and SwinFAN, on the UAVid test set in Germany, from sequences 29, 30, and 38. With the white box, we highlighted the improved road, car, and human detections.

F. Results on the Potsdam and Vaihingen Datasets

In the Potsdam test results presented in Table II, our proposed model showcases superior performance, achieving the highest scores in several key categories, including impervious surfaces, buildings, cars, and the overall metrics of F1, mIoU, and OA. This illustrates SwinFAN's capability in accurately segmenting diverse urban elements, particularly in high-resolution aerial imagery.

TABLE II Performance Evaluation on Potsdam Test Dataset
Table II- Performance Evaluation on Potsdam Test Dataset

Comparatively, other leading models like MFNet [69], CG-Swin [62], and DC-Swin [73] also perform well, particularly in categories like building and tree. However, our model's comprehensive handling of image features obtains high scores across all categories, highlighting its robustness and precision in complex urban landscapes. In the Vaihingen test results, as detailed in Table III, our network leads across multiple metrics.

TABLE III Performance Evaluation on Vaihingen Test Dataset
Table III- Performance Evaluation on Vaihingen Test Dataset

While models like EDDformer [55] and ICTNet [47] perform well in specific categories, SwinFAN consistently excels across all areas, demonstrating its versatility. The focal-axial attention system strategically amplifies important features while suppressing less relevant information, which significantly contributes to the model's ability to interpret and reconstruct detailed aerial urban scenes.

The comparative analysis of semantic segmentation on the Potsdam and Vaihingen datasets highlights the strengths of various architectures. CNN-based models offer a robust foundation, attention-based architectures improve accuracy with contextual awareness, and global-local architectures integrate broad and detailed views for enhanced precision. SwinFAN outperforms these traditional models, achieving top scores across key metrics on both datasets.

Additionally, when comparing network performance with other transformer backbones, SwinFAN outperforms GLOTS [57] with a ViT-Base backbone and Segmenter [75] using a ViT-Tiny backbone. Against models with Swin-Base architectures like STransFuse [77], SwinB-CNN+BD [78], DSCT [74], and FT-UNetFormer [48], our architecture demonstrates superior effectiveness in each category, indicating more efficient leverage of the Swin-Base architecture. It also shows enhanced performance over Swin-Small-based architectures such as SwinUperNet [45], SwinTF-FPN [44], CG-Swin [62], DC-Swin [73], and ICTNet [47].

Fig. 6 displays results on full images from the Potsdam and Vaihingen datasets, while Fig. 7 zooms into specific areas, highlighting segmentation details.

Fig. 6. - Semantic segmentation results of full scale images from Vaihingen (top row, image id 2) and Potsdam (bottom row, image id 3_14), comparing the ground truth annotations with the results of UNetFormer and our proposed methodology.
Fig. 6.

Semantic segmentation results of full scale images from Vaihingen (top row, image id 2) and Potsdam (bottom row, image id 3_14), comparing the ground truth annotations with the results of UNetFormer and our proposed methodology.

Fig. 7. - Details of semantic segmentation results of images from Vaihingen (top row, image id 6) and Potsdam (bottom row, image id 3_14). The magenta box highlights the improvements our network brings over UNetFormer for objects such as cars and background.
Fig. 7.

Details of semantic segmentation results of images from Vaihingen (top row, image id 6) and Potsdam (bottom row, image id 3_14). The magenta box highlights the improvements our network brings over UNetFormer for objects such as cars and background.

G. Results on the LoveDA Dataset

We report the results on the official LoveDA benchmark,2 which can be seen in Table IV. Analyzing these test results, our proposed model outperforms other networks, achieving the highest mIoU score of 53.2%. It excels in categories like building, road, water, and agriculture, reflecting its strong segmentation capabilities in diverse landscapes. SwinFAN's performance is particularly notable against established models like SwinUperNet [45] and DC-Swin [73], underscoring its effectiveness in handling the varied terrains and features present in the LoveDA dataset, as seen in Fig. 8. Additionally, SwinFAN's high speed of 196.1 frames per second (FPS), while maintaining accuracy, further emphasizes its suitability for real-time applications in remote sensing.

TABLE IV Performance Evaluation on LoveDA Test Dataset
Table IV- Performance Evaluation on LoveDA Test Dataset
Fig. 8. - Experimental results comparing semantic segmentation outputs of multiple networks, for the LoveDA dataset, for images with IDs 3002, 3055, 3572, 3579, and 3746.
Fig. 8.

Experimental results comparing semantic segmentation outputs of multiple networks, for the LoveDA dataset, for images with IDs 3002, 3055, 3572, 3579, and 3746.

SwinFAN distinguishes itself by setting new performance benchmarks, surpassing CNN-based networks, attention-based models, and global-local architectures. CNN-based models like DeepLabV3+ [23] and PSPNet [27] demonstrate respectable capabilities in handling various segmentation tasks. They are surpassed by attention-centric models, such as UNetFormer [48], which enhance the model's sensitivity to contextual nuances, achieving notable improvements in segmentation accuracy. Moreover, Global-Local Architectures, represented by EDDformer [55] and EMRT [56], combine wide-ranging scene understanding with detailed analysis, illustrating the power of integrated attention strategies for capturing both broad and fine-grained features. Within this framework, SwinFAN emerges as a leading solution, demonstrating enhanced segmentation capabilities that surpass the established benchmarks set by its predecessors.

In the comparative analysis on the LoveDA test dataset, SwinFAN, utilizing a Swin-base backbone, outperforms networks with other transformer-based backbones. Against ViT-Tiny (Segmenter [75]) and Swin-Tiny (DC-Swin [73]), SwinFAN shows enhanced segmentation accuracy and a significant increase in processing speed. Despite the similar transformer architecture, SwinFAN surpasses SwinUperNet [45] in mIoU and speed, emphasizing its optimized performance and architectural advancements.

We observe that our network faces challenges in accurately classifying the Barren class due to several factors. This class typically lacks distinct textural or structural features, making it sensitive to scale variations. Additionally, architectural differences contribute to performance variance, potentially hindering the capture of minimal Barren class features. Moreover, class imbalances prioritize learning from more feature-rich classes, thereby reducing model sensitivity to barren areas.

H. Ablation Studies

1) Decoder Attention Mechanism

To determine the best decoder approach for effectively combining the focal module with axial attention in our architecture, we carried out a series of comparative experiments across various methodologies \begin{gather*} \text{Sequential axial-focal} = \text{Focal}(\text{Axial}(X)) \tag{19}\\ \text{Sequential focal-axial} = \text{Axial}(\text{Focal}(X)) \tag{20}\\ \text{Parallel axial-focal} = \text{Axial}(X) + \text{Focal}(X) \tag{21}\\ \text{Guided axial-focal} = \text{Axial}(X) \odot \text{Focal}(X) \tag{22}\\ \text{Guided focal-axial} = \text{Focal}(X) \odot \text{Axial}(X). \tag{23} \end{gather*} View SourceRight-click on figure for MathML and additional features.

Sequential axial-focal first captures long-range dependencies, then refines them with multiscale context. In contrast, sequential focal-axial starts with multiscale context before focusing on long-range dependencies. Parallel attention runs both mechanisms independently, merging their outcomes to harness long-range and multiscale features simultaneously. Guided axial-focal uses a focal mask to enhance multiscale features before addressing long-range dependencies, while GFA reverses this process, concentrating on long-range dependencies before multiscale refinement.

To compare these variants, we opted for ResNet18 as the backbone due to its computational efficiency. For evaluation, we utilized the UAVid dataset given its small size and benchmark accessibility. Among the tested approaches, the GFA attention method emerges as the most effective, achieving the highest mIoU of 68.1% (see Table V). This strategy excels in key categories such as road, tree, moving car, static car, human, and clutter. In contrast, the other options demonstrate comparatively lower effectiveness overall. This evaluation enables us to pinpoint the optimal approach for integrating the focal module with axial attention mechanism in our U-Net architecture, ensuring effective segmentation performance.

TABLE V Comparison of Decoder Strategies for Our Proposed U-Net Architecture on UAVid Test Dataset, With ResNet18 Backbone
Table V- Comparison of Decoder Strategies for Our Proposed U-Net Architecture on UAVid Test Dataset, With ResNet18 Backbone

2) FRH Implementation

Our study evaluates the performance of our FRH implementation compared to that of UNetFormer [48]. In this experiment, we utilized the Swin transformer as the encoder and GFA attention as the decoder. Table VI presents the results of this comparative analysis across classes from the UAVid dataset. Our AFRH implementation showcases improvements in each category, particularly in road, moving car, static car, and human. Notably, significant enhancements can be observed for small-sized objects such as static car, moving car, and human, indicating the proficiency of our module in handling fine-grained details. Overall, our refinement head contributes to a 1.1% increase in mIoU.

TABLE VI Comparison of FRH Implementations, Highlighting the Improvement Our Proposed Module Brings to Each Class
Table VI- Comparison of FRH Implementations, Highlighting the Improvement Our Proposed Module Brings to Each Class

3) Computational Evaluation of Swin-Based Architectures

In the presented analysis, as depicted in Table VII, various architectures utilizing the Swin-Base backbone are evaluated on the Potsdam dataset. Our SwinFAN architecture demonstrated superior performance, after only 45 training epochs. This performance advantage underscores the architectural optimizations that facilitate efficient feature extraction and integration, particularly in complex urban landscapes. Conversely, larger models like SwinB-CNN+BD [78], which underwent an extensive training duration of 160 000 epochs, and CSTUNET [72], trained for 100 epochs, indicate varying degrees of parameter efficiency and training effectiveness, obtaining lower results than our proposed method. The former, despite its extensive training, did not report mIoU, suggesting potential overfitting or inefficiencies in handling specific segmentation classes.

TABLE VII Performance Comparison of Architectures Employing Swin-Base Backbones, on the Potsdam Dataset, Detailing Parameters in Millions and Number of Training Epochs
Table VII- Performance Comparison of Architectures Employing Swin-Base Backbones, on the Potsdam Dataset, Detailing Parameters in Millions and Number of Training Epochs

Reducing the number of training epochs is advantageous as it directly correlates with lower computational costs and faster model development cycles. It also reduces the environmental impact associated with extensive computational tasks. Moreover, a model that achieves high accuracy with fewer epochs has a better generalization capability, indicating robustness across diverse datasets without the need for prolonged or intensive training schedules.

SECTION V.

Discussion

Our proposed SwinFAN model represents a significant advancement in the field of remote sensing image segmentation, particularly for complex urban environments captured from UAVs. The combination of transformer-based architectures, GFA attention, and feature refinement modules facilitates better contextual understanding and precision in object classification compared to traditional CNN-based models and other attention mechanisms.

A. Performance Across Datasets

Our model consistently demonstrated superior performance across several well-established datasets, such as UAVid, ISPRS Potsdam, ISPRS Vaihingen, and LoveDA, achieving notable improvements in mIoU and accuracy over state-of-the-art models. For instance, in the UAVid dataset, SwinFAN achieved a 1.9% improvement in mIoU compared to previous methods, underscoring the model's ability to accurately segment smaller and more complex urban elements such as moving cars and pedestrians. This success can be attributed to the hybrid attention mechanism (GFA), which effectively merges local and global contextual information, enabling the model to focus on critical details without losing broader spatial relationships.

Similarly, in the Potsdam and Vaihingen datasets, SwinFAN excelled in detecting intricate structures like buildings and vegetation, often surpassing other models in terms of overall accuracy and F1 scores. In these datasets, the model's ability to discern between objects of varying scales was particularly important, given the diverse urban and natural elements present in the high-resolution aerial imagery. The attention mechanisms in SwinFAN enabled it to handle these variations effectively, especially when segmenting fine details like cars or pedestrians.

B. Key Architectural Contributions

The novel GFA attention module proved critical for improving both local and global context comprehension. By dynamically fusing focal and axial attention, the GFA module enhances the model's ability to capture both intricate and large-scale scene elements, making it particularly effective in urban environments where objects may vary significantly in size and detail.

Additionally, the AFRH played a pivotal role in improving the clarity of segmentation outputs. This module's use of self-attention combined with convolutional techniques allowed for enhanced spatial information refinement, significantly improving segmentation accuracy, especially in challenging datasets like LoveDA. The refinement head was instrumental in handling small objects like cars and pedestrians, which often pose challenges for standard attention mechanisms.

C. Challenges and Future Directions

Despite these successes, SwinFAN encountered challenges in classifying certain types of areas, particularly those lacking distinct textures or structural features, such as roads or barren landscapes. These areas often had fewer unique visual features, making it difficult for the model to differentiate between them and other categories. Furthermore, the imbalance of training samples across different categories sometimes led to reduced precision for underrepresented classes.

To address these limitations, future work could explore incorporating techniques to handle imbalanced datasets, such as data augmentation or more sophisticated loss functions that emphasize less represented categories. Additionally, expanding SwinFAN's capabilities to 3-D datasets or integrating temporal information from UAV videos could further enhance its performance in dynamic environments, such as disaster monitoring or land-use change detection.

Overall, our proposed SwinFAN architecture sets a strong foundation for future research in remote sensing, combining the strengths of transformer-based architectures with attention mechanisms tailored for high-resolution aerial imagery. The model's scalability and adaptability to diverse imaging conditions also highlight its potential for broader applications beyond traditional semantic segmentation tasks.

SECTION VI.

Conclusion

In this article, we construct a framework for improved semantic segmentation of remote sensing images, called SwinFAN. Based on the U-Net model, our architecture employs the Swin transformer as its encoder and features a novel decoder structure, comprised of two innovative components: a GFA attention module and an AFRH. Together, these elements enhance our model's proficiency in identifying salient regions and maintaining essential spatial relationships, thereby facilitating more accurate detection of smaller objects within complex images. We extensively evaluate our methodology across four public datasets, including UAVid, ISPRS Vaihingen and Potsdam, and LoveDA. Our results demonstrate that SwinFAN consistently surpasses the performance of existing state-of-the-art networks, showcasing the versatility and adaptability of our model across diverse imaging conditions, including oblique and nadir camera angles.

Our network demonstrates strong overall performance but faces some challenges when classifying areas with less distinct textural or structural features, making them more sensitive to scale variations. Additionally, the learning process tends to favor more well-represented labels due to class imbalances in the training dataset. In the future, we aim to refine these aspects, expand our work to 3-D datasets, and explore specific applications in remote sensing, including change detection, damage assessment, and visual question answering.

References

References is not available for this document.