Processing math: 100%
Semantic Segmentation of High-Resolution Remote Sensing Imagery via an End-to-End Graph Attention Network With Superpixel Embedding | IEEE Journals & Magazine | IEEE Xplore

Semantic Segmentation of High-Resolution Remote Sensing Imagery via an End-to-End Graph Attention Network With Superpixel Embedding


Abstract:

Semantic segmentation of high-resolution remote sensing images is crucial in ecological evaluation, natural resource surveys, etc. Compared with CNN-based and transformer...Show More

Abstract:

Semantic segmentation of high-resolution remote sensing images is crucial in ecological evaluation, natural resource surveys, etc. Compared with CNN-based and transformer-based methods, graph neural networks (GNNs) have drawn increasing attention because they can flexibly model topologies of arbitrary irregular objects on graphs. Researchers typically use superpixels as graph nodes to reduce image noise and computational complexity. However, most superpixel-based GNN methods view superpixel segmentation as a data preprocessing step. This results in fixed graphs input to GNNs and overlooks the effects of undersegmentation. In addition, these methods often employ one graph construction approach, which makes them susceptible to interclass similarity (ICS) or intraclass variability (ICV), leading to segmentation inaccuracies. To address these issues, we propose an end-to-end graph attention network with superpixel embedding (SEGAT) to achieve semantic segmentation with well-delineated boundaries. We first use a learnable neural network, the superpixel generation module (SGM), to generate superpixels, which is cotrained with subsequent graph segmentation module (GSM) to refine boundaries continuously. Dynamically fine superpixels produce dynamically optimized graphs and mitigate undersegmentation errors. To reduce the interference of ICS and ICV, we then use the GSM to construct local and global graphs based on superpixel spatial positions and feature similarity, respectively, and update superpixel features and graph structure. Finally, updated superpixel features are classified for superpixel-wise classification, which is then mapped back to pixel features through the pixel-superpixel association map. Extensive experiments on three datasets, Vaihingen, Potsdam, and UAVid, demonstrate that SEGAT can outperform state-of-the-art methods.
Page(s): 7236 - 7252
Date of Publication: 14 February 2025

ISSN Information:

Funding Agency:


SECTION I.

Introduction

High-resolution remote sensing (HRS) image semantic segmentation involves assigning an object type to each image pixel, which is crucial for applications such as natural resource management, ecological evaluation, urban planning, and crop assessment. Despite extensive research, achieving accurate semantic segmentation of HRS images remains an ongoing scientific challenge.

Traditional methods [1], [2], [3], [4], [5] rely on manual feature design and struggle to represent high-level semantic information in HRS images. Deep learning-based approaches have effectively addressed this issue, introducing many promising methods. Most methods have focused on CNN-based and transformer-based research [6], [7], [8], while the exploration of graph neural networks (GNNs) for HRS image semantic segmentation remains relatively limited.

Among these studies, CNN-based methods typically rely on regular grids of images [see Fig. 1(b)] [9], [10], [11], [12], [13]. These methods take advantage of the local receptive field and translation invariance of convolutions to extract features. While they leverage local information, their ability to capture global context is limited. To address this limitation, transformer-based methods have emerged and gained widespread attention [14], [15], [16]. Researchers divide the image into patches, convert them into sequences as input [see Fig. 1(c)], and use the attention mechanism to perform pairwise interactions, effectively capturing the global context of the image.

Although CNN-based and transformer-based methods have made significant progress in semantic segmentation, they still have shortcomings. The shapes of most geographic objects in HRS images are irregular. Using pixel-based regular grids or patch-based sequences to process these shapes is inflexible and inadequate. This limits the ability to understand and represent the topological relationships between objects, leading to missing boundaries and a higher risk of misclassification. In contrast, graph-based representations can flexibly model topologies between objects without being constrained by their geometric shapes or positions. As shown in Fig. 1(d), the image is segmented into superpixels aligned with boundaries, and a graph is constructed based on these superpixels.

Therefore, some researchers have turned their attention to GNNs, which can adaptively learn kernel parameters from object distributions and utilize strong correlations between objects to model topologies on graphs. For instance, Mou et al. [17] and Qin et al. [18] proposed nonlocal graph convolutional networks (GCNs) and spectral–spatial GCNs, respectively, for hyperspectral image classification. However, representing each pixel as a graph node significantly increases computational costs. To avoid this issue, researchers increasingly use superpixels as graph nodes, which reduces computational complexity and helps minimize noise. For instance, some research [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29] segment data into superpixels using simple linear iterative clustering (SLIC) [30], subsequently using the superpixels as primitives for graph operations. Similarly, Shi et al. [31] used the Felzenswalb and Huttenlocher (FH) algorithm [32] to generate superpixels, which were then used to construct graphs as input to a deep GCN for small waterbody extraction. Ouyang et al. [33] first segmented images into superpixels and then used a deep semantic segmentation module to extract features as the semantic initialization of graph nodes. The constructed graphs were subsequently input into a GCN for classification. In addition, Wu et al. [34] selected simple noniterative clustering (SNIC) [35] to generate superpixels and then fed them into GNNs to perform brain tissue segmentation. Although the methods mentioned above have achieved good results, superpixel segmentation is merely a preprocessing step due to the use of non-differentiable superpixel algorithms, meaning that superpixel generation and GNNs are executed separately. Therefore, Zhang et al. [36] introduced the differentiable superpixel sampling networks (SSN) [37], which successfully integrates with a mixhop GCN for hyperspectral image classification. Meanwhile, Eliasof et al. [38] utilized an unsupervised CNN [39] for superpixel segmentation, which was jointly trained with GNNs for unsupervised natural image semantic segmentation. While the above-mentioned two methods overcome the end-to-end integration issue in hyperspectral and natural images, they do not belong to the HRS images we focus on. Furthermore, to the best of our knowledge, there is currently no prior work addressing this issue in HRS images. Therefore, achieving an end-to-end integration of superpixel generation and GNNs in HRS images remains a significant challenge.

In summary, superpixel-based GNN methods still face two major limitations. First, in HRS image semantic segmentation, researchers primarily use superpixel segmentation methods as preprocessing steps, resulting in fixed graphs as input to GNNs. This process introduces undersegmentation errors, which can degrade segmentation performance. Second, these methods depend solely on a single graph construction approach, making them susceptible to interclass similarity (ICS) or intraclass variability (ICV), ultimately leading to inaccurate segmentation.

In light of the issues mentioned above, we combine superpixel generation with graph attention networks (GATs) [40] within a unified pipeline. First, we employ a learnable neural network as the superpixel generation module (SGM), which is directly coupled with the subsequent graph segmentation module (GSM). Second, the GSM constructs graphs and updates both graph structures and superpixel features. This module comprises L local graph modules (LMs) and N global graph modules (GMs), where the number of modules determines the frequency of graph structure updates. The LM constructs graphs based on the spatial positions of superpixels, ensuring that nearby superpixels are more closely related than distant ones. In contrast, the GM constructs graphs based on superpixel feature similarity to enhance the understanding of global semantics. GATs within these modules enable more flexible and adaptive edge weight learning and information aggregation. Finally, the updated superpixel features are classified at the superpixel level and then mapped back to pixel-level features using the pixel-superpixel association map, achieving the final semantic segmentation.

The main contributions are summarized as follows.

  1. We propose an end-to-end graph attention network with superpixel embedding (SEGAT) framework to achieve HRS image semantic segmentation with refined boundaries. Dynamically fine superpixels make constructed graphs align with object boundaries through collaborative training, reducing undersegmentation errors and effectively improving segmentation performance.

  2. We propose a strategy that utilizes two graph construction methods to obtain more accurate graph structures, reducing segmentation inaccuracies caused by the interference of ICS or ICV. This strategy comprehensively considers strong correlations between adjacent objects and the similarity of long-range superpixel features, constructing local and global graphs based on superpixel spatial positions and features, respectively.

  3. We design a GSM composed of LM and GM. This module flexibly updates graph structures and learns local and global superpixel features. Using the graph block with multihead GATs, adaptive edge weight learning is achieved, along with the enhancement and updating of superpixel features.

  4. We conduct extensive experiments on three publicly available HRS datasets, and our method outperforms competitive approaches.

Fig. 1. - 
            Image representation of different methods. (a) Image. (b) Pixel-based regular grid representation (CNNs). (c) Patch-based sequence representation (transformers). (d) Superpixel-based graph representation (GNNs). We map the image as superpixel features with homogeneous semantics that match the geometric shapes of objects. Based on these features, graph can better represent the topological relationships of objects. Since CNN and transformer-based methods cannot utilize graph data, using GNNs is an ideal approach.
Fig. 1.

Image representation of different methods. (a) Image. (b) Pixel-based regular grid representation (CNNs). (c) Patch-based sequence representation (transformers). (d) Superpixel-based graph representation (GNNs). We map the image as superpixel features with homogeneous semantics that match the geometric shapes of objects. Based on these features, graph can better represent the topological relationships of objects. Since CNN and transformer-based methods cannot utilize graph data, using GNNs is an ideal approach.

Fig. 2. - 
          Whole network architecture. The SEGAT consists of three components: the SGM, the GSM, and the classification stage.
Fig. 2.

Whole network architecture. The SEGAT consists of three components: the SGM, the GSM, and the classification stage.

Fig. 3. - 
          Superpixel generation module (SGM).
Fig. 3.

Superpixel generation module (SGM).

Fig. 4. - 
          Illustration of $\mathcal {N}$. This image shows the initial superpixel grid. To obtain the pixel-superpixel associations for each pixel $\mathbf {p}$ in the yellow box, we only consider 9 surrounding superpixels within the green box.
Fig. 4.

Illustration of \mathcal {N}. This image shows the initial superpixel grid. To obtain the pixel-superpixel associations for each pixel \mathbf {p} in the yellow box, we only consider 9 surrounding superpixels within the green box.

Fig. 5. - 
          Graph segmentation module (GSM).
Fig. 5.

Graph segmentation module (GSM).

Fig. 6. - 
          Illustration of superpixel spatial positions.
Fig. 6.

Illustration of superpixel spatial positions.

Fig. 7. - 
          Illustration of superpixel feature similarity.
Fig. 7.

Illustration of superpixel feature similarity.

Fig. 8. - 
          Illustration of classification.
Fig. 8.

Illustration of classification.

Fig. 9. - 
          Visualization of the feature maps of each component. 1) Image. 2) Ground truth. 3) SGM. 4) SGM+1 (LM). 5) SGM+1 (GM). 6) SGM+1 (LM)+1 (GM). 7) SGM+1 (LM)+2 (GM). 8) SGM+1 (LM)+3 (GM). 9) SGM+(no-arc) LM+(no-arc) GM.
Fig. 9.

Visualization of the feature maps of each component. 1) Image. 2) Ground truth. 3) SGM. 4) SGM+1 (LM). 5) SGM+1 (GM). 6) SGM+1 (LM)+1 (GM). 7) SGM+1 (LM)+2 (GM). 8) SGM+1 (LM)+3 (GM). 9) SGM+(no-arc) LM+(no-arc) GM.

Fig. 10. - 
          Model performance w.r.t. superpixels' numbers.
Fig. 10.

Model performance w.r.t. superpixels' numbers.

Fig. 11. - 
          Visualization of different number of superpixels.
Fig. 11.

Visualization of different number of superpixels.

Fig. 12. - 
          Impact of the number of nearest neighbors in the GM on the model's performance. The baseline performance is illustrated with dashed lines to show the variation of performance.
Fig. 12.

Impact of the number of nearest neighbors in the GM on the model's performance. The baseline performance is illustrated with dashed lines to show the variation of performance.

Fig. 13. - 
          Impact of $\lambda$ on superpixels. A larger $\lambda$ value reduces spatial variance in superpixels, whereas a smaller $\lambda$ value leads to higher spatial variance.
Fig. 13.

Impact of \lambda on superpixels. A larger \lambda value reduces spatial variance in superpixels, whereas a smaller \lambda value leads to higher spatial variance.

Fig. 14. - 
          Sensitivity analysis of $\lambda$ in the loss function. The baseline performance is illustrated with dashed lines to show the variation of performance.
Fig. 14.

Sensitivity analysis of \lambda in the loss function. The baseline performance is illustrated with dashed lines to show the variation of performance.

Fig. 15. - 
          Comparison of SEGAT with SOTA networks on the Vaihingen dataset.
Fig. 15.

Comparison of SEGAT with SOTA networks on the Vaihingen dataset.

Fig. 16. - 
          Visualization of the superpixel generation map on the Vaihingen dataset.
Fig. 16.

Visualization of the superpixel generation map on the Vaihingen dataset.

Fig. 17. - 
          Comparison of SEGAT with SOTA networks on the Potsdam dataset.
Fig. 17.

Comparison of SEGAT with SOTA networks on the Potsdam dataset.

Fig. 18. - 
          Visualization of the superpixel generation map on the Potsdam dataset.
Fig. 18.

Visualization of the superpixel generation map on the Potsdam dataset.

Fig. 19. - 
          Comparison of SEGAT with SOTA networks on the UAVid dataset.
Fig. 19.

Comparison of SEGAT with SOTA networks on the UAVid dataset.

SECTION II.

Related Work

A. Superpixel

Superpixels were introduced by Ren and Malik in 2003 [41], which are formed by grouping perceptually similar image pixels, reducing the number of image primitives for subsequent processing [42].

Traditional superpixel algorithms typically oversegment images by analyzing low-level features such as color properties and spatial distances [42]. The Watershed algorithm [43] identifies and segments objects based on the intensity of image edges. It considers areas of high gradient as edges and areas of low gradient as the interior of the same object. SLIC [30] adopts k-means clustering to group the nearest pixels concerning color and spatial distance by converting Commission International Eclairage Lab (CIELab) color space. In addition to SLIC, clustering-based algorithms include linear spectral clustering (LSC) [44], manifold simple linear iterative clustering (MSLIC) [45], and SNIC [35], among others. In addition, graph-based algorithms such as FH [32] and entropy rate superpixels (ERS) [46] view the image as an undirected graph and divide it into superpixels based on edge weights. Due to the non-differentiability of the above-mentioned methods, it is difficult to couple with downstream networks.

Driven by advancements in deep learning, superpixel algorithms with deep neural networks (DNNs) have emerged in succession. SSN [37] is the first end-to-end trainable algorithm. Nevertheless, this algorithm is not a pure DNN. It only uses a CNN to extract features, which are input to an iterative k-means clustering to achieve superpixel segmentation. Inspired by calculating pixel-superpixel associations in SSN, the simple fully convolutional network (SFCN) [47] employs a standard fully convolutional network (FCN) architecture to train a DNN, directly predicting the pixel-superpixel association map for superpixel segmentation. This method is a pure DNN algorithm that can learn superpixels on regular grids. LSN-Net [48] introduces a noniterative lifelong learning strategy with unsupervised CNN, reducing spatial and temporal complexity of superpixel generation.

Compared with traditional superpixel algorithms, DNN-based methods can automatically learn higher-dimensional features and more effectively handle HRS images. Furthermore, unlike other DNN-based algorithms, SFCN [47] is a pure DNN algorithm that integrates feature extraction and superpixel segmentation in one step, making it run faster and more easily integrated into downstream networks than SSN [37]. In addition, because SFCN is supervised by segmentation labels, it is more capable of producing superpixel segmentation results that match object boundaries than unsupervised LSN-Net [48]. For these reasons, we use the idea of SFCN [47], which directly predicts pixel-superpixel association map, and use UNet [9] to optimize superpixel segmentation.

B. Semantic Segmentation With CNNs or Transformers

In recent years, extensive studies have been conducted using both CNN-based and transformer-based approaches to achieve semantic segmentation of remote sensing (RS) images. In CNN-based methods, researchers have enhanced feature map quality and segmentation performance by enlarging the receptive field, integrating multiscale features, and employing attention mechanisms. For instance, Nguyen et al. [49] improved the capture of scale variations in images by fine-tuning the atrous convolution rates within the atrous spatial pyramid pooling module of DeepLabV3+ and optimized segmentation performance by using a feature aggregation network that consolidates features at different scales. Ma et al. [50] designed a multiscale network that integrates the local class-aware module with the global class-aware module, achieving effective segmentation. Yang et al. [51] developed an attention-fused network that fuses multilevel features by using attention modules, addressing the issue of multipath and multilevel feature fusion. In addition, segmentation performance has been improved by modifying traditional single-branch encoder–decoder architectures [9], [52], [53]. For example, HBSeNet [54] utilizes a dual-path network that combines the spatial and context paths, yielding effective segmentation. Similarly, Li et al. [55] also designed a network with a bilateral architecture for RS image semantic segmentation.

In transformer-based methods, we can classify them into pure transformer methods and hybrid designs combining CNNs and transformers. Pure transformers refer to methods where both the encoder and decoder are transformers. For instance, Strudel et al. [15] proposed Segmenter, a convolution-free, fully transformer-based model for semantic segmentation. This model uses output embeddings from image patches to get class labels through a linear or mask transformer decoder. Cao et al. [56] developed a U-shaped network that integrates the Swin transformer with the UNet architecture. Compared to pure transformer methods, hybrid designs combining CNNs and transformers have become more prevalent in semantic segmentation. Zeng et al. [57] proposed MSGCNet, a hybrid architecture that leverages the strengths of both CNNs and transformers, using efficient cross-attention to facilitate interaction among multiscale features of the encoder. Li et al. [58] designed an attention-focused feature enhancement network, which integrates a ResNet50-based encoder with a parallel multistage feature enhancement group, a global multiscale attention mechanism in the decoder, and a feature-weighted fusion module, effectively addressing challenges like complex structures and occlusions. Chen et al. [59] developed a hybrid architecture that combines a ResNet50 encoder with a transformer-based decoder constructed by channel-spatial transformer block and global cross-fusion module for optimizing representation performance. With advances in technology, a model called Mamba [60], which can model long-range relationships through linear computations, has gained attention. To address the limitations of CNNs in modeling long-range dependencies and the high computational complexity of transformers, researchers have gradually applied Mamba to the semantic segmentation of RS images, such as RS3Mamba [61] and RSM-CD [62].

Despite significant progress made by the aforementioned semantic segmentation methods, they still have certain limitations. Many geographical objects in HRS images have irregular shapes, and using pixel-based grids (CNNs) or patch-based sequences (transformers) to represent these objects is both limited and insufficient. These limitations hinder the accurate representation of topological relationships between objects, increasing boundary errors and misclassification. In contrast, GNNs depend on graph-based representations, which offer greater flexibility in modeling the topologies of irregular objects. Therefore, we use GNNs to achieve HRS image semantic segmentation.

C. Semantic Segmentation With GNNs

GNNs learn from graph-structured data and are applied in transportation networks, biomedicine, recommender systems, social networks, computer vision, etc. The concept of GNNs was first proposed by Gori et al. [63] in 2005 and further elaborated by Scarselli et al. [64] and Gallicchio et al. [65]. Subsequently, with the advancement of GNNs, numerous studies have proposed various models, including GCNs [66] and GATs [40]. GCNs apply the convolution operator of CNNs to graphs, updating the features of each node by aggregating neighbor information with the same weight. Unlike GCNs, GATs use attention mechanisms to dynamically learn the importance of neighboring nodes by assigning weights based on their relevance to the current node. Therefore, GATs can more flexibly capture dependencies between nodes. Compared to CNN-based and transformer-based methods, GNNs can treat images as undirected graphs and achieve effective and flexible topological modeling of objects with arbitrary shapes. Consequently, some researchers have explored GNN-based approaches for RS image semantic segmentation.

In these approaches, researchers primarily use superpixels as a preprocessing step to reduce computational complexity. Wan et al. [26] and Liu et al. [24] used SLIC to preprocess the images into superpixels, which were then used as graph nodes in the GCN for hyperspectral image classification. Similar methods have been proposed by works [22], [23], [28], [29], [67]. Wang et al. [27] proposed a method using multiscale superpixels segmented with SLIC, which were then incorporated into a weighted GCN for land cover classification of polarimetric synthetic aperture radar images. Diao et al. [21] also employed SLIC for superpixel segmentation of HRS images, using these superpixels as input to an attention GNN. However, the accuracy of these methods heavily depends on the quality of superpixel segmentation. If undersegmentation occurs, correcting this issue within the GNN framework is difficult.

Furthermore, these approaches consider only a single graph construction method. For instance, Diao et al. [21] and Eliasof et al. [38] used the k-nearest neighbor algorithm based on superpixel feature for graph construction. However, relying solely on feature similarity to determine the relationship between objects is susceptible to ICS or ICV. Wu et al. [34] and Zhang et al. [36] utilized spatial adjacency relationships for graph construction, where adjacent superpixels were assigned edge weights of 1, and nonadjacent superpixels were assigned 0. Similar methods were employed by the researchers in [24] and [29]. Some studies [26], [31], [33] used the spatial position of superpixels to construct graphs and assigned edge weights based on the spectral similarity between adjacent superpixels. While these methods satisfy the principle that closer objects have stronger correlations, they ignore long-range semantics.

Therefore, based on the above-mentioned issues, we propose an HRS image semantic segmentation method that implements end-to-end superpixel-based graph segmentation by using two graph construction approaches. This method dynamically optimizes superpixels to provide more accurate object boundaries while providing more robust topological graph structure and contextual information.

SECTION III.

Methodology

A. Overall Architecture

The SEGAT architecture, illustrated in Fig. 2, comprises three main components: the superpixel generation module (SGM), the graph segmentation module (GSM), and the classification stage. The process begins with image input, which the SGM processes to generate the pixel-superpixel association map \mathbf {Q}, pixel features \mathbf {P}, and initial superpixels. The superpixel feature \mathbf {S} is calculated using \mathbf {Q}, \mathbf {P}, and initialized superpixels. Next, the local graph module (LM) in the GSM utilizes \mathbf {Q} to calculate the adjacency matrix \mathbf {A} based on the spatial positions of the superpixels \mathbf {S}, forming the local topological graph (\mathbf {S}, \mathbf {A}). Graph blocks then enhance and update the superpixel features based on this graph structure. Subsequently, the global graph module (GM) in the GSM calculates the adjacency matrix \mathbf {A^{\prime }} based on the similarity of the updated superpixel features \mathbf {S^{\prime }}, creating a new global topological graph (\mathbf {S^{\prime }}, \mathbf {A^{\prime }}). The graph blocks once again enhance and update the superpixel features based on graph (\mathbf {S^{\prime }}, \mathbf {A^{\prime }}). The number of LM and GM modules determines how many times the graph structure is updated. Finally, the superpixel features \mathbf {S^{\prime \prime }} obtained through the GSM undergo classification, achieving the final semantic segmentation. The entire process is an end-to-end workflow. Further details on SEGAT will be elaborated in the following sections.

B. Superpixel Generation Module

The primary purpose of the SGM is to obtain the pixel-superpixel association map \mathbf {Q} and superpixel features \mathbf {S}. Fig. 3 portrays the SGM.

Specifically, we use a UNet with a resnet101 backbone for superpixel segmentation. HRS images X\in \mathbb {R}^{H\times W \times B} are used as input for UNet, yielding pixel features \mathbf {P}. Based on the \mathbf {P}, the \mathbf {Q} is obtained through a convolutional operation and softmax layer. We can represent the mappings of \mathbf {Q} and \mathbf {P} as tensors Q\in \mathbb {R}^{H\times W \times |\mathcal {N}|} and P\in \mathbb {R}^{H\times W \times D}, respectively. Here, \mathcal {N} denotes the set of surrounding superpixels of a certain pixel p, where |\mathcal {N}|=9. Note that assigning each pixel of an image to an initialized superpixel leads to high computational costs, so the search is constrained to 9 grid cells surrounding a pixel p, as shown in Fig. 4. D represents the dimension of hidden features, where D=20.

In the process of calculating superpixel features \mathbf {S}, V_{s}(\mathbf {p}) is the probability of a pixel p belonging to each superpixel s\in \mathbb {\mathcal {N}}, such that \sum _{s\in \mathcal {N}}V_{s}(\mathbf {p})=1. Each pixel is assigned to the superpixel with the highest probability: S_{\text{highest}} = \arg \max _{s} \, V_{s}(\mathbf {p}). We initialize superpixels \text{sp}\_\text{index}\in \mathbb {R}^{H\times W \times |\mathcal {N}|}. Subsequently, the \mathbf {Q}, \mathbf {P}, and \mathbf {sp\_{i}ndex} are flattened into Q\in \mathbb {R}^{T_{p}\times |\mathcal {N}|}, P\in \mathbb {R}^{T_{p}\times D}, and \text{sp}\_\text{index}\in \mathbb {R}^{T_{p}\times |\mathcal {N}|}, respectively, which T_{p} represents the pixels' total numbers. Then, the \mathbf {S} is obtained based on \mathbf {Q}, \mathbf {P}, and \mathbf {sp\_{i}ndex}. g(\mathbf {p}) denotes the pixel feature value of a certain pixel p. The feature value of \mathbf {S} is denoted by the symbol h(\mathbf {s}). The superpixel feature value is computed as follows: \begin{equation*} h(\mathbf {s}) = \frac{\sum _{\mathbf {p} : s \in \mathcal {N}} g(\mathbf {p}) \cdot V_{s}(\mathbf {p})}{\sum _{\mathbf {p} : s \in \mathcal {N}} V_{s}(\mathbf {p})}. \tag{1} \end{equation*}

View SourceRight-click on figure for MathML and additional features.

Based on the above-mentioned Formula (1), the superpixel features \mathbf {S}\in \mathbb {R}^{T_{s}\times D} are obtained, where T_{s} represents the total number of superpixels \begin{equation*} T_{s} = \frac{H \times W}{M^{2}}. \tag{2} \end{equation*}

View SourceRight-click on figure for MathML and additional features. H and W denote height and width of an image, respectively. M represents the segmentation scale parameter, where M is set to 16.

C. Graph Segmentation Module

Using superpixels as nodes in the GSM can significantly reduce the computational complexity. GATs rely on internode features and can assign different weights to neighbors of each node, allowing for more flexible and adaptive information aggregation. Therefore, we use the GATs to aggregate superpixel features. Our GSM is detailed in Fig. 5.

The whole GSM comprises two modules: the LM and the GM, with stacks of identical layers, L=1 and N=3, respectively. Each module is composed of graph construction and the graph block containing GATs with the transformer architecture. Specific details are described as follows.

1) Local Graph Construction

Superpixel segmentation is a collection of pixels characterized by similar colors, textures, and proximity. As shown in Fig. 6, superpixels close to each other are likely to belong to the same category. To better preserve local location information, we construct a local graph structure based on superpixels' spatial positions. The binary adjacency matrix \mathbf {A} is built to represent the topological relationships of each superpixel position. If two superpixels share a boundary, they are adjacent, and the value is 1; otherwise, the value is 0 \begin{equation*} A_{ij} = \left\lbrace \begin{array}{ll}1, & \text{if } i \text{ and } j \text{ are adjacent} \\ 0, & \text{otherwise.} \end{array}\right. \tag{3} \end{equation*}

View SourceRight-click on figure for MathML and additional features.

2) Global Graph Construction

While spatially adjacent superpixels may belong to the same category, determining their category solely based on local positional relationships is insufficient. As shown in Fig. 7, “cars” in the same category may not be adjacent. Consequently, it is necessary to disregard positional relationships and construct the global graph structure from the perspective of superpixel feature similarity.

We use the k-nearest neighbors algorithm to compute the similarity between superpixel features. Smaller values show more remarkable similarity between superpixel features and vice versa. For each superpixel, the feature distance between each pair is calculated as follows: \begin{equation*} F_{\text{sim}} = \text{distance}(h(s_{i}), h(s_{j})) = \Vert h(s_{i}) - h(s_{j}) \Vert _{2}. \tag{4} \end{equation*}

View SourceRight-click on figure for MathML and additional features.

Here, F_{\text{sim}} denotes the distance between superpixel features. h(s_{i}) is the feature value of the ith superpixel, and h(s_{j}) is the feature value of the jth superpixel.

We built the binary adjacency matrix \mathbf {A^{\prime }} to represent the feature distance relationships between superpixels. Based on the calculated feature distance values F_{\text{sim}}, we select the 9 superpixels closest to the ith superpixel and assign a value of 1 to their corresponding positions in the adjacency matrix; all others are assigned a value of 0 \begin{equation*} A^{\prime }_{ij} = \left\lbrace \begin{array}{ll}1, & \text{select only the 9 superpixels closest to} \\ & \text{the } i\text{th superpixel in feature distance} \\ 0, & \text{otherwise.} \end{array}\right. \tag{5} \end{equation*}

View SourceRight-click on figure for MathML and additional features.

3) Graph Block

A superpixel generation map is viewed as a graph \mathcal {G} = (\mathcal {V}, \mathcal {E}) in the GSM, where \mathcal {V} represents the group of nodes, and \mathcal {E} represents the group of edges between nodes. These correspond to the superpixel node features matrix and adjacency matrix. GAT layer' input is a set of superpixel node features, \mathcal {S} = \lbrace \mathcal {S}_{1}, \mathcal {S}_{2}, \mathcal {S}_{3}, \ldots, \mathcal {S}_{E} \rbrace, S_{i}\in \mathbb {R}^{K}, S\in \mathbb {R}^{E\times K}, where E is superpixel nodes' numbers, K is features' numbers in each superpixel node.

The input features S\in \mathbb {R}^{E\times K} are transformed into higher dimensional features S^{\prime }\in \mathbb {R}^{E\times K^{\prime }} through a learnable linear transformation to obtain richer semantic representation. Next, the attention weight matrix H_{\text{score}}\in \mathbb {R}^{E\times E} is obtained by self-attention. The calculation of the attention scores is as Formula (6) \begin{equation*} e_{ij} = \text{LeakyReLU}\left(\mathbf {a}^{T} \left[ \mathbf {W} \mathbf {S}_{i} \parallel \mathbf {W} \mathbf {S}_{j} \right] \right). \tag{6} \end{equation*}

View SourceRight-click on figure for MathML and additional features.

Here, e_{ij} represents the importance of node js features to node i, \mathbf {W} represents the initial weight matrix W\in \mathbb {R}^{K^{\prime }\times K} used for linear transformations, and \mathbf {a} represents the initial learnable attention weight vector \mathbf {a}\in \mathbb {R}^{\text{2}\;K^{\prime }}.

When calculating attention scores, all structural information needs to be ignored, and the attention scores between all node pairs should be calculated. After calculating, since not all superpixel nodes are interconnected, the adjacency matrix representing the graph structure is necessary. By performing masked attention, the structural information is added to the attention weight matrix H_{\text{score}}. To make attention scores easily comparable across different nodes, the softmax function is used to normalize them between a node i and all its neighboring nodes j, which can be mathematically expressed as Formula (7) \begin{equation*} \alpha _{ij} = \text{softmax}_{j}(e_{ij}) = \frac{\exp (e_{ij})}{\sum _{m \in \mathcal {N}_{i}} \exp (e_{im})}. \tag{7} \end{equation*}

View SourceRight-click on figure for MathML and additional features.

In the graph, attention scores e_{ij} are calculated only for nodes j\in \mathcal {N}_{i}, \mathcal {N}_{i} denotes some neighborhood of node i.

The final attention weight matrix H_{\text{final}} is obtained using Formula (7) above. Using the normalized attention scores, the linear combination of corresponding features is calculated to serve as each node's output features \begin{equation*} S_{i}^{\prime } = \sigma \left(\sum _{j \in \mathcal {N}_{i}} \alpha _{ij} \mathbf {W} \mathbf {S}_{j} \right). \tag{8} \end{equation*}

View SourceRight-click on figure for MathML and additional features. \sigma is the nonlinear function ReLU.

Then, the attention weight matrix is computed multiple times using the multihead attention mechanism. The features from several independent attention mechanisms (with n=3 heads), which perform (8), are then concatenated to produce the following output feature representation \begin{equation*} S_{i}^{\prime } = \bigg \Vert _{n=1}^{N} \sigma \left(\sum _{j \in \mathcal {N}_{i}} \alpha _{ij}^{n} \mathbf {W}^{n} \mathbf {S}_{j} \right) \tag{9} \end{equation*}

View SourceRight-click on figure for MathML and additional features.where \alpha _{ij}^{n} represents normalized attention scores calculated by the nth attention mechanism \alpha ^{n}. \mathbf {W}^{n} denotes the corresponding weight matrix of input linear transformation.

Finally, calculate the weight matrix with the concatenated features and perform (8). To achieve better results, we integrated the GAT within the transformer architecture to further enhance and update the superpixel features.

D. Classification

The classification process is shown in Fig. 8. Since superpixel features preserve well-delineated boundaries, we first perform superpixel-wise classification to retain this fine boundary information, followed by pixel-wise classification.

Initially, based on the superpixel features output by the GSM, we apply a linear layer to classify these features. Then, Using (10), we map the classified superpixel features back to the pixel features, thereby achieving semantic segmentation \begin{equation*} \hat{g(\mathbf {p})} = \sum _{s \in \mathcal {N}} h(\mathbf {s}) \cdot V_{s}(\mathbf {p}). \tag{10} \end{equation*}

View SourceRight-click on figure for MathML and additional features.

Here, \hat{g(\mathbf {p})} denotes the pixel feature value of a certain pixel p obtained by inverse mapping. h(\mathbf {s}) represents the superpixel feature value, and V_{s}(\mathbf {p}) is the probability of a pixel p belonging to each superpixel.

E. Loss Function

To obtain superpixels that align with semantic boundaries and achieve more accurate semantic segmentation during network optimization, we incorporated four individual loss functions into the overall loss to optimize the parameters of the entire network.

The reconstruction loss and compactness loss are calculated based on the predicted pixel-superpixel association map \mathbf {Q} to supervise superpixel segmentation. It involves separately computing the superpixel feature value h(\mathbf {s}) and location L(\mathbf {s}) of semantic labels. Similar to calculating the superpixel feature, the same method is also used to calculate the superpixel location. As shown in (11) \begin{equation*} L(\mathbf {s}) = \frac{\sum _{\mathbf {p} : s \in \mathcal {N}} f(\mathbf {{l}}) \cdot V_{s}(\mathbf {p})}{\sum _{\mathbf {p} : s \in \mathcal {N}} V_{s}(\mathbf {p})}. \tag{11} \end{equation*}

View SourceRight-click on figure for MathML and additional features.

Here, f(\mathbf {{l}}) represents the pixel position by image coordinates. Then, we map the superpixel location back to the pixel location using the following (12), obtaining f^{\prime }(\mathbf {l}) \begin{equation*} f^{\prime }(\mathbf {{l}}) = \sum _{s \in \mathcal {N}} L(\mathbf {s}) \cdot V_{s}(\mathbf {p}). \tag{12} \end{equation*}

View SourceRight-click on figure for MathML and additional features.

Therefore, the reconstruction loss and the compactness loss are calculated as follows: \begin{align*} \mathcal {L}_{\text{recon}} =& \text{CE}(g(\mathbf {p}), g^{\prime }(\mathbf {p})) \tag{13} \\ \quad \mathcal {L}_{\text{compact}} =& \sum _{\mathbf {p}} \Vert f(\mathbf {{l}}) - f^{\prime }(\mathbf {{l}}) \Vert _{2}. \tag{14} \end{align*}

View SourceRight-click on figure for MathML and additional features.

Here, g(\mathbf {p}) denotes semantic labels' one-hot encoding vector and g^{\prime }(\mathbf {p}) represents the pixel feature value obtained by inverse mapping.

In addition, the prediction results are refined using both cross-entropy loss \mathcal {L}_{1} and Dice loss \mathcal {L}_{2} to optimize semantic segmentation \begin{align*} \mathcal {L}_{1} =& \text{CE}(\text{output}, \text{label}) \tag{15} \\ \mathcal {L}_{2} =& \text{Dice}(\text{output}, \text{label}). \tag{16} \end{align*}

View SourceRight-click on figure for MathML and additional features.

The overall loss function is utilized by the entire network, as shown in (17) \begin{equation*} \mathcal {L}_{\text{total}} = \mathcal {L}_{\text{recon}} + \lambda \mathcal {L}_{\text{compact}} + \mathcal {L}_{1} + \mathcal {L}_{2}. \tag{17} \end{equation*}

View SourceRight-click on figure for MathML and additional features. \lambda is set to 0.3.

SECTION IV.

Experiments

A. Datasets

The ISPRS Vaihingen dataset has 33 true orthophoto images with an average size of 2494 × 2064 pixels and a spatial resolution of 9 cm. This dataset contains 6 land-cover categories: impervious surfaces (imp.surf), buildings, low vegetation (low veg), trees, cars, and clutter/background. We conduct all experiments on the ground truth with eroded boundaries. In our experiments, we utilize 17 images for testing and 16 images for training.

The ISPRS Potsdam dataset has 38 true orthophoto images with an image size of 6000 × 6000 pixels and a spatial resolution of 5 cm. This dataset contains 6 land-cover categories: impervious surfaces (imp.surf), buildings, low vegetation (low veg), trees, cars, and clutter/background. We conduct all experiments on the ground truth with eroded boundaries. Our experiments use 14 images for testing and 24 images for training.

The UAVid dataset mainly includes 8 categories: clutter, building, road, tree, low vegetation (low veg), moving car, static car, and human. The image resolution is either 4096 × 2160 or 3840 × 2160. The dataset comprises 420 images, including 200 training and 70 validation sets. The official website provides 150 test sets.

The images in all the aforementioned datasets are cropped into 512 × 512 size.

B. Implementation Details

Experiments are conducted using PyTorch on an NVIDIA GeForce RTX 3090 GPU with 24 GB of memory. We utilize the adaptive moment estimation (Adam) with a learning rate of 3e-4 and a batch size of 2. Data augmentation techniques, including random horizontal flip, random vertical flip, random brightness contrast, and random rotation, are applied during training. The training epoch is 200. During the inference stage, data augmentation operations, specifically horizontal and vertical flips, are employed as the test-time augmentation strategy.

C. Evaluation Metrics

The OA, mIoU, and mF1 are used to evaluate the model's performance. The third metric is the average F1 score of each category \begin{align*} \text{OA} =& \frac{\sum _{k=1}^{K} \text{TP}_{k}}{\sum _{k=1}^{K} (\text{TP}_{k} + \text{FP}_{k} + \text{TN}_{k} + \text{FN}_{k})} \tag{18}\\ \text{mIoU} =& \frac{1}{K} \sum _{k=1}^{K} \frac{\text{TP}_{k}}{\text{TP}_{k} + \text{FP}_{k} + \text{FN}_{k}} \tag{19}\\ F1_{k} =& 2 \times \frac{\text{precision}_{k} \times \text{recall}_{k}}{\text{precision}_{k} + \text{recall}_{k}} \tag{20} \end{align*}

View SourceRight-click on figure for MathML and additional features.where \begin{align*} \text{precision}_{k} = & \frac{\text{TP}_{k}}{\text{TP}_{k} + \text{FP}_{k}} \tag{21} \\ \text{recall}_{k} =& \frac{\text{TP}_{k}}{\text{TP}_{k} + \text{FN}_{k}}. \tag{22} \end{align*}
View SourceRight-click on figure for MathML and additional features.

Here, \text{TP}_{k}, \text{FP}_{k}, \text{TN}_{k}, and \text{FN}_{k} represent the true positive, false positive, true negative, and false negative pixels.

D. Ablation Study

All ablations are performed on the Vaihingen dataset.

1) Ablation Study for the SGM

The primary purpose of the SGM is to obtain the pixel-superpixel association map and superpixel features. We conduct experiments with different configurations in the SGM, as shown in Table I. The results reveal that using a simple convolutional layer (“1 Conv”) significantly reduces performance, with OA, mIoU, and mF1 dropping to 75.39%, 53.67%, and 68.28%, respectively. Similarly, using an FCN leads to only 90.03% OA, 81.24% mIoU, and 89.53% mF1, indicating the limitations of simpler architectures. In contrast, using a UNet improves performance and yields superior results, emphasizing the necessity of employing the UNet for superpixel generation. Moreover, we investigate the impact of different UNet backbones on the model's performance. The results indicate that shallower backbones, such as ResNet18 and ResNet34, yield lower OA, mIoU, and mF1 values. In contrast, deeper backbones, such as ResNet101, significantly improve superpixel quality and overall segmentation performance, achieving OA of 92.03%, mIoU of 84.56%, and mF1 of 91.53%, yielding the optimal result. This demonstrates the advantage of deeper architectures in capturing features. These findings underscore the critical role of a UNet with a ResNet101 backbone in enhancing our model's performance.

TABLE I Ablation Study for the SGM
Table I- 
              Ablation Study for the SGM

2) Ablation Study for SEGAT's Each Component

Table II outlines the combinations and abbreviations of SEGAT's components. The SGM serves as the backbone and baseline for incrementally designing our method in conjunction with the GSM. Ablation 1 includes only the SGM, which achieves semantic segmentation by mapping superpixels back to pixels. Ablation 2 adds one LM to the SGM base to validate the effectiveness of a single LM within the GSM. Similarly, Ablation 3 integrates one GM into the SGM base to assess the impact of a single GM. Due to the relatively slow computation of local graph construction, we limit the GSM to a single LM, as incorporating more would reduce the network's efficiency. Instead, we focus on increasing the number of GMs. Ablation 4 combines the SGM with one LM and one GM, Ablation 5 with one LM and two GMs, and Ablation 6 with one LM and three GMs. Finally, Ablation 7 employs an LM and a GM within the GSM, excluding the transformer architecture, to evaluate the effectiveness of integrating GAT into the transformer framework.

TABLE II Ablation Experimental Settings for Components of SEGAT
Table II- 
              Ablation Experimental Settings for Components of SEGAT

Table III presents the results for the components of SEGAT. First, Ablations 1–4 demonstrate that incorporating either LM or GM significantly enhances image semantic segmentation performance. Moreover, combining LM and GM further improves accuracy, indicating that employing two graph construction methods is superior to using only one. Second, a comparison of Ablations 4–6 shows a steady increase in accuracy as the number of GMs in the GSM increases, suggesting that a dynamically updated graph structure boosts the model's segmentation performance. However, due to computational limitations, the combination “SGM+1(LM)+4(GM)” is not feasible. Third, the comparison between Ablations 7 and 4 reveals that excluding the transformer architecture when using GAT leads to a significant reduction in precision, underscoring the critical role of the transformer architecture in achieving high performance. In summary, SEGAT (Ablation 6) achieves the highest accuracy, improving OA, mIoU, and mF1 by 1.19%, 1.74%, and 1.06%, respectively, compared to Ablation 1. The feature maps of each component are visualized in Fig. 9.

TABLE III Ablation of SEGAT's Each Component
Table III- 
              Ablation of SEGAT's Each Component

3) Ablation Study for the Number of Superpixels

Table IV and Fig. 10 reveal that as the number of superpixels increases, performance peaks at 1024, then declines. Considering the impact of graph nodes' numbers on computational efficiency, it is inadvisable to use too many superpixels. This is qualitatively validated in Fig. 11. Specifically, HRS images contain targets of varying scales, making the selection of the appropriate number of superpixels crucial for segmentation performance. If the number of superpixels is too small, segmentation boundaries fail to accurately capture finer details, particularly for small-scale objects like cars. As shown in Fig. 11, with 1024 superpixels, the segmentation aligns well with object boundaries, effectively considering objects of different scales. The edges of two buildings within the green box and the gap between them are also correctly segmented. Conversely, an excessive number of superpixels results in overly fine-grained segmentation. While this may capture finer details, it causes fragmented representations, introduces noise, reduces segmentation efficiency, and increases computational costs.

TABLE IV Ablation of Superpixels' Numbers
Table IV- 
              Ablation of Superpixels' Numbers

4) Ablation Study for the Number of the Nearest Neighbors in the GM

To evaluate the influence of the number of nearest neighbors in the GM on the overall performance, we conduct ablation studies by varying the number of nearest neighbors from 9 to 100. As shown in Fig. 12, the performance metrics, including OA, mIoU, and mF1, remain relatively stable across different settings. Specifically, the OA fluctuates slightly around 91.90%, while the mIoU and mF1 hover around 84.53% and 91.50%, respectively. These results indicate that the model's performance is not highly sensitive to the choice of the nearest neighbor count within this range. To maintain a balanced graph structure between the GM and LM, we chose 9-nearest neighbors, ensuring that the number of edges in both the local and global graphs remains approximately the same.

5) Ablation Study for the Number of GATs' Heads

Implementing the attention mechanism in GATs multiple times corresponds to multihead attention, where the number of heads represents the number of attention weights between a node and its neighboring nodes. According to Table V, the model's performance improves as the number of heads increases, peaking at three heads. However, due to limited computational resources, the model cannot support four heads, as it exceeds GPU memory capacity. While additional heads can capture diverse features, enriching information and enhancing the model's expressiveness, they also decrease computational efficiency. Considering the negative impact of excessive heads on computational efficiency, we recommend limiting the number of heads to ensure a balance between performance and resource utilization.

TABLE V Ablation of Heads' Numbers of GATs
Table V- 
              Ablation of Heads' Numbers of GATs

6) Ablation Study for Parameter \lambda of Loss Function

To evaluate the impact of the \lambda parameter in the loss function on model's performance, we conduct experiments with varying \lambda values, as illustrated in Fig. 14. In the loss function (17), \lambda is designed to control the contribution of the compactness loss (14). The compactness loss encourages superpixels to exhibit lower spatial variance. Consequently, a larger \lambda value reduces spatial variance in the segmented superpixels, while a smaller \lambda value leads to higher spatial variance. This relationship is visualized in Fig. 13. As shown in Fig. 14, the model performance improves gradually as \lambda increases up to 0.3. Beyond this value, performance significantly declines across all metrics. The model achieves optimal performance at \lambda = 0.3, with mIoU, OA, and mF1 reaching their peak values of 84.56%, 92.03%, and 91.53%, respectively. This result demonstrates that \lambda = 0.3 provides the optimal tradeoff, maximizing segmentation accuracy and overall performance. Hereby, \lambda = 0.3 is selected for our experiments.

7) Computational Complexity Comparison With Other Models

Table VI compares different models' computational requirements regarding parameter complexity and the floating point operations (FLOPs), analyzing the tradeoffs with mIoU performance. The table shows that our SEGAT ranks fourth in computational demand, which is not the highest and achieves optimal segmentation performance.

TABLE VI Comparison of Computational Complexity and Performance
Table VI- 
              Comparison of Computational Complexity and Performance

E. Comparing With State-of-the-Art (SOTA) Methods

SEGAT results are compared with SOTA methods on Vaihingen, Potsdam, and UAVid datasets. Table VII lists all comparison methods.

TABLE VII Comparison Methods on Three Datasets
Table VII- 
            Comparison Methods on Three Datasets

In Table VIII, SEGAT demonstrates superior performance, achieving an OA of 92.03%, a mIoU of 84.56%, and an mF1 of 91.53%. As illustrated in columns 6–8 of Fig. 15, the first row indicates that LoG-CAN merges trees that are spaced apart, whereas SEGAT effectively preserves their separation, closely matching the ground truth. In the second row, LoG-CAN fails to accurately capture the local details of “buildings,” resulting in local pixels belonging to “building” being misclassified as “impervious surfaces.” In contrast, SEGAT better maintains the integrity of “buildings” and preserves semantic object boundaries, more accurately reflecting ground conditions. This highlights SEGAT's superior ability to handle interactions between local and global information. Although SEGAT's performance in segmenting “buildings,” “trees,” and “cars” is not the best, it excels in “impervious surfaces” and “low vegetation,” achieving F1 scores of 93.81% and 86.56%, respectively, outperforming other networks by at least 0.1% and 0.67%. In addition, the superpixel generation map visualization in Fig. 16 demonstrates that SEGAT effectively preserves areas of local semantic similarity and aligns accurately with category boundaries. The well-segmented superpixel generation maps further confirm SEGAT's effectiveness.

TABLE VIII Comparison Results With SOTA Networks on the Vaihingen Dataset
Table VIII- 
            Comparison Results With SOTA Networks on the Vaihingen Dataset

Tables IX and X demonstrate that SEGAT achieves optimal semantic segmentation performance. According to Table IX, SEGAT excels in segmenting the “trees” and “cars” classes, with F1 scores of 90.14% and 97%, respectively. Similarly, Table X shows that SEGAT achieves the highest IoU for “moving car” (76.01%) and “static car” (62.92%). SEGAT exhibits significant advantages in classifying the small object “car,” with its accuracy in the “moving car” and “static car” categories surpassing other networks by at least 1.43% and 3.82%, respectively.

TABLE IX Comparison Results With SOTA Networks on the Potsdam Dataset
Table IX- 
            Comparison Results With SOTA Networks on the Potsdam Dataset
TABLE X Comparison Results With SOTA Networks on the UAVid Dataset
Table X- 
            Comparison Results With SOTA Networks on the UAVid Dataset

Overall, SEGAT demonstrates strong performance in both small- and large-scale class segmentation, effectively preserving the integrity and independence of object categories. This success is attributed to its superior superpixel segmentation and the effective interaction of features within the graph module. The qualitative validation of this performance is illustrated in Figs. 17–​20.

Fig. 20. - 
            Visualization of the superpixel generation map on the UAVid dataset.
Fig. 20.

Visualization of the superpixel generation map on the UAVid dataset.

SECTION V.

Discussion

This article explores the use of GNNs in the semantic segmentation of HRS images, focusing on integrating superpixel generation with GNNs. To the best of our knowledge, this issue has not been previously addressed in HRS images. In our approach, superpixel segmentation is collaboratively trained with the entire model, rather than as a standalone preprocessing step. We mitigate the effects of ICS and ICV by combining two graph construction methods: one based on superpixel spatial adjacency and the other on feature similarity. Although our experiments show the superior performance of SEGAT, several aspects require further exploration.

1) Exploration of graph construction methods: While local graphs and global graphs effectively capture both local and global contextual information, combining these two graph construction methods into a unified graph could also be a promising approach. This method can enable a more comprehensive representation of topological relationships in a single graph. In future work, we will explore this approach and its impact on model performance.

2) Exploration of superpixel optimization methods: Our model employs a fixed number of superpixels at one level, which is inflexible for HRS images with varying object scales and complex scenes. To improve superpixel segmentation and enhance model performance, we can consider the following methods to address these issues: 1) Design multilevel superpixels. By introducing multilevel and multiscale superpixel representations, which leverages the advantages of both fine-grained and coarse-grained segmentation, making it better suited for multiscale objects in HRS images. 2) Design an adaptive superpixel segmentation method. This method can adjust the number of superpixels based on image content. We will explore these methods in future work.

3) Exploration of graph model optimization methods: The introduction of SAM [87] in 2023 has established a new paradigm in semantic segmentation, drawing widespread attention for its powerful object segmentation capabilities. In the field of RS, researchers have utilized this approach for related studies. For instance, Yan et al. [88] introduced the RingMo-SAM model, which employs SAM for multimodal RS image segmentation, handling both optical and SAR images. Ma et al. [89] utilized objects and boundaries generated by SAM, designing suitable loss functions to achieve RS image semantic segmentation. We can leverage SAM's powerful object boundary segmentation capabilities to optimize incomplete boundaries of graph model segmentation. In summary, the integration of graph models and SAM offers immense potential for future exploration.

4) Exploration of cross-domain issues: Our method relies on supervised learning, which assumes that the training and testing data share the same closed-set label space and are identically distributed. Consequently, when we extend our approach to new datasets with significantly different distributions, SEGAT, which relies on a predefined target domain, requires additional training data to achieve better segmentation performance. Many researchers have studied this cross-domain issue, which is a specialized research direction that focuses on transferring knowledge from labeled source domains to unlabeled or unseen target domains. For example, Luo et al. [90] and Rombach et al. [91] utilized a diffusion model to extract valuable features from large amounts of unlabeled data. Chen et al. [92] employed contrastive learning to enhance domain adaptation capabilities. In the future, we will study the cross-domain problem by exploring domain adaptation techniques such as adversarial training, generative models, and contrastive learning to further enhance the adaptability of our method to new datasets.

SECTION VI.

Conclusion

This study proposes a novel SEGAT for semantic segmentation of HRS images. This SEGAT model, which innovatively integrates the SGM and the GSM into a unified pipeline, enables the refinement of superpixel boundaries and the optimization of semantic segmentation to be implemented simultaneously during network training. Unlike other superpixel-based GNN methods where superpixel segmentation is typically a preprocessing step. This approach not only reduces undersegmentation errors but also allows the graph structure to be continuously refined under supervised learning. In addition, we design a strategy employing two graph construction methods that fully consider both the spatial positional relationships and feature similarity of superpixels, reducing the interference of ICS and ICV, thereby enhancing the perception of both local and global contextual information. In the classification step, to retain boundary information, we first classify the superpixel features and then map them back to pixel features by using the pixel-superpixel association map, effectively minimizing missing boundaries and ensuring object integrity. Extensive experiments on three datasets demonstrate the robustness of SEGAT, highlighting its superior performance in HRS image semantic segmentation. However, the model still relies on supervised learning and requires high-quality training data. While the collection of RS images has become more accessible in the era of RS big data, creating accurate labels remains time-consuming and labor-intensive. Given the advantages of GATs in semisupervised or unsupervised learning, which can automatically aggregate information from neighboring nodes to learn latent data structures, we plan to explore ways to reduce dependency on labels in future work, aiming to further improve our methods through semisupervised or unsupervised approaches. In addition, according to the discussion section, we will further explore relevant issues in the future.

References

References is not available for this document.