Journals & Magazines >IEEE Journal of Selected Topi... >Volume: 18

Semantic Segmentation of High-Resolution Remote Sensing Imagery via an End-to-End Graph Attention Network With Superpixel Embedding

Abstract:

Semantic segmentation of high-resolution remote sensing images is crucial in ecological evaluation, natural resource surveys, etc. Compared with CNN-based and transformer...Show More

Metadata

Abstract:

Semantic segmentation of high-resolution remote sensing images is crucial in ecological evaluation, natural resource surveys, etc. Compared with CNN-based and transformer-based methods, graph neural networks (GNNs) have drawn increasing attention because they can flexibly model topologies of arbitrary irregular objects on graphs. Researchers typically use superpixels as graph nodes to reduce image noise and computational complexity. However, most superpixel-based GNN methods view superpixel segmentation as a data preprocessing step. This results in fixed graphs input to GNNs and overlooks the effects of undersegmentation. In addition, these methods often employ one graph construction approach, which makes them susceptible to interclass similarity (ICS) or intraclass variability (ICV), leading to segmentation inaccuracies. To address these issues, we propose an end-to-end graph attention network with superpixel embedding (SEGAT) to achieve semantic segmentation with well-delineated boundaries. We first use a learnable neural network, the superpixel generation module (SGM), to generate superpixels, which is cotrained with subsequent graph segmentation module (GSM) to refine boundaries continuously. Dynamically fine superpixels produce dynamically optimized graphs and mitigate undersegmentation errors. To reduce the interference of ICS and ICV, we then use the GSM to construct local and global graphs based on superpixel spatial positions and feature similarity, respectively, and update superpixel features and graph structure. Finally, updated superpixel features are classified for superpixel-wise classification, which is then mapped back to pixel features through the pixel-superpixel association map. Extensive experiments on three datasets, Vaihingen, Potsdam, and UAVid, demonstrate that SEGAT can outperform state-of-the-art methods.

Published in: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing ( Volume: 18)

Page(s): 7236 - 7252

Date of Publication: 14 February 2025

ISSN Information:

DOI: 10.1109/JSTARS.2025.3542255

Funding Agency:

Contents

SECTION I.

Introduction

High-resolution remote sensing (HRS) image semantic segmentation involves assigning an object type to each image pixel, which is crucial for applications such as natural resource management, ecological evaluation, urban planning, and crop assessment. Despite extensive research, achieving accurate semantic segmentation of HRS images remains an ongoing scientific challenge.

Traditional methods [1], [2], [3], [4], [5] rely on manual feature design and struggle to represent high-level semantic information in HRS images. Deep learning-based approaches have effectively addressed this issue, introducing many promising methods. Most methods have focused on CNN-based and transformer-based research [6], [7], [8], while the exploration of graph neural networks (GNNs) for HRS image semantic segmentation remains relatively limited.

Among these studies, CNN-based methods typically rely on regular grids of images [see Fig. 1(b)] [9], [10], [11], [12], [13]. These methods take advantage of the local receptive field and translation invariance of convolutions to extract features. While they leverage local information, their ability to capture global context is limited. To address this limitation, transformer-based methods have emerged and gained widespread attention [14], [15], [16]. Researchers divide the image into patches, convert them into sequences as input [see Fig. 1(c)], and use the attention mechanism to perform pairwise interactions, effectively capturing the global context of the image.

Although CNN-based and transformer-based methods have made significant progress in semantic segmentation, they still have shortcomings. The shapes of most geographic objects in HRS images are irregular. Using pixel-based regular grids or patch-based sequences to process these shapes is inflexible and inadequate. This limits the ability to understand and represent the topological relationships between objects, leading to missing boundaries and a higher risk of misclassification. In contrast, graph-based representations can flexibly model topologies between objects without being constrained by their geometric shapes or positions. As shown in Fig. 1(d), the image is segmented into superpixels aligned with boundaries, and a graph is constructed based on these superpixels.

Therefore, some researchers have turned their attention to GNNs, which can adaptively learn kernel parameters from object distributions and utilize strong correlations between objects to model topologies on graphs. For instance, Mou et al. [17] and Qin et al. [18] proposed nonlocal graph convolutional networks (GCNs) and spectral–spatial GCNs, respectively, for hyperspectral image classification. However, representing each pixel as a graph node significantly increases computational costs. To avoid this issue, researchers increasingly use superpixels as graph nodes, which reduces computational complexity and helps minimize noise. For instance, some research [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29] segment data into superpixels using simple linear iterative clustering (SLIC) [30], subsequently using the superpixels as primitives for graph operations. Similarly, Shi et al. [31] used the Felzenswalb and Huttenlocher (FH) algorithm [32] to generate superpixels, which were then used to construct graphs as input to a deep GCN for small waterbody extraction. Ouyang et al. [33] first segmented images into superpixels and then used a deep semantic segmentation module to extract features as the semantic initialization of graph nodes. The constructed graphs were subsequently input into a GCN for classification. In addition, Wu et al. [34] selected simple noniterative clustering (SNIC) [35] to generate superpixels and then fed them into GNNs to perform brain tissue segmentation. Although the methods mentioned above have achieved good results, superpixel segmentation is merely a preprocessing step due to the use of non-differentiable superpixel algorithms, meaning that superpixel generation and GNNs are executed separately. Therefore, Zhang et al. [36] introduced the differentiable superpixel sampling networks (SSN) [37], which successfully integrates with a mixhop GCN for hyperspectral image classification. Meanwhile, Eliasof et al. [38] utilized an unsupervised CNN [39] for superpixel segmentation, which was jointly trained with GNNs for unsupervised natural image semantic segmentation. While the above-mentioned two methods overcome the end-to-end integration issue in hyperspectral and natural images, they do not belong to the HRS images we focus on. Furthermore, to the best of our knowledge, there is currently no prior work addressing this issue in HRS images. Therefore, achieving an end-to-end integration of superpixel generation and GNNs in HRS images remains a significant challenge.

In summary, superpixel-based GNN methods still face two major limitations. First, in HRS image semantic segmentation, researchers primarily use superpixel segmentation methods as preprocessing steps, resulting in fixed graphs as input to GNNs. This process introduces undersegmentation errors, which can degrade segmentation performance. Second, these methods depend solely on a single graph construction approach, making them susceptible to interclass similarity (ICS) or intraclass variability (ICV), ultimately leading to inaccurate segmentation.

In light of the issues mentioned above, we combine superpixel generation with graph attention networks (GATs) [40] within a unified pipeline. First, we employ a learnable neural network as the superpixel generation module (SGM), which is directly coupled with the subsequent graph segmentation module (GSM). Second, the GSM constructs graphs and updates both graph structures and superpixel features. This module comprises $L$ local graph modules (LMs) and $N$ global graph modules (GMs), where the number of modules determines the frequency of graph structure updates. The LM constructs graphs based on the spatial positions of superpixels, ensuring that nearby superpixels are more closely related than distant ones. In contrast, the GM constructs graphs based on superpixel feature similarity to enhance the understanding of global semantics. GATs within these modules enable more flexible and adaptive edge weight learning and information aggregation. Finally, the updated superpixel features are classified at the superpixel level and then mapped back to pixel-level features using the pixel-superpixel association map, achieving the final semantic segmentation.

The main contributions are summarized as follows.

We propose an end-to-end graph attention network with superpixel embedding (SEGAT) framework to achieve HRS image semantic segmentation with refined boundaries. Dynamically fine superpixels make constructed graphs align with object boundaries through collaborative training, reducing undersegmentation errors and effectively improving segmentation performance.
We propose a strategy that utilizes two graph construction methods to obtain more accurate graph structures, reducing segmentation inaccuracies caused by the interference of ICS or ICV. This strategy comprehensively considers strong correlations between adjacent objects and the similarity of long-range superpixel features, constructing local and global graphs based on superpixel spatial positions and features, respectively.
We design a GSM composed of LM and GM. This module flexibly updates graph structures and learns local and global superpixel features. Using the graph block with multihead GATs, adaptive edge weight learning is achieved, along with the enhancement and updating of superpixel features.
We conduct extensive experiments on three publicly available HRS datasets, and our method outperforms competitive approaches.

Fig. 1.

Image representation of different methods. (a) Image. (b) Pixel-based regular grid representation (CNNs). (c) Patch-based sequence representation (transformers). (d) Superpixel-based graph representation (GNNs). We map the image as superpixel features with homogeneous semantics that match the geometric shapes of objects. Based on these features, graph can better represent the topological relationships of objects. Since CNN and transformer-based methods cannot utilize graph data, using GNNs is an ideal approach.

Semantic Segmentation of High-Resolution Remote Sensing Imagery via an End-to-End Graph Attention Network With Superpixel Embedding

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction

Related Work

A. Superpixel

B. Semantic Segmentation With CNNs or Transformers

C. Semantic Segmentation With GNNs

Methodology

A. Overall Architecture

B. Superpixel Generation Module

C. Graph Segmentation Module

1) Local Graph Construction

2) Global Graph Construction

3) Graph Block

D. Classification

E. Loss Function

Experiments

A. Datasets

B. Implementation Details

C. Evaluation Metrics

D. Ablation Study

1) Ablation Study for the SGM

2) Ablation Study for SEGAT's Each Component

3) Ablation Study for the Number of Superpixels

4) Ablation Study for the Number of the Nearest Neighbors in the GM

5) Ablation Study for the Number of GATs' Heads

6) Ablation Study for Parameter \lambda\lambda of Loss Function

7) Computational Complexity Comparison With Other Models

E. Comparing With State-of-the-Art (SOTA) Methods

Discussion

Conclusion

References

IEEE Account

Purchase Details

Profile Information

Need Help?

6) Ablation Study for Parameter $\lambda$ of Loss Function