Introduction
High-resolution remote sensing (HRS) image semantic segmentation involves assigning an object type to each image pixel, which is crucial for applications such as natural resource management, ecological evaluation, urban planning, and crop assessment. Despite extensive research, achieving accurate semantic segmentation of HRS images remains an ongoing scientific challenge.
Traditional methods [1], [2], [3], [4], [5] rely on manual feature design and struggle to represent high-level semantic information in HRS images. Deep learning-based approaches have effectively addressed this issue, introducing many promising methods. Most methods have focused on CNN-based and transformer-based research [6], [7], [8], while the exploration of graph neural networks (GNNs) for HRS image semantic segmentation remains relatively limited.
Among these studies, CNN-based methods typically rely on regular grids of images [see Fig. 1(b)] [9], [10], [11], [12], [13]. These methods take advantage of the local receptive field and translation invariance of convolutions to extract features. While they leverage local information, their ability to capture global context is limited. To address this limitation, transformer-based methods have emerged and gained widespread attention [14], [15], [16]. Researchers divide the image into patches, convert them into sequences as input [see Fig. 1(c)], and use the attention mechanism to perform pairwise interactions, effectively capturing the global context of the image.
Although CNN-based and transformer-based methods have made significant progress in semantic segmentation, they still have shortcomings. The shapes of most geographic objects in HRS images are irregular. Using pixel-based regular grids or patch-based sequences to process these shapes is inflexible and inadequate. This limits the ability to understand and represent the topological relationships between objects, leading to missing boundaries and a higher risk of misclassification. In contrast, graph-based representations can flexibly model topologies between objects without being constrained by their geometric shapes or positions. As shown in Fig. 1(d), the image is segmented into superpixels aligned with boundaries, and a graph is constructed based on these superpixels.
Therefore, some researchers have turned their attention to GNNs, which can adaptively learn kernel parameters from object distributions and utilize strong correlations between objects to model topologies on graphs. For instance, Mou et al. [17] and Qin et al. [18] proposed nonlocal graph convolutional networks (GCNs) and spectral–spatial GCNs, respectively, for hyperspectral image classification. However, representing each pixel as a graph node significantly increases computational costs. To avoid this issue, researchers increasingly use superpixels as graph nodes, which reduces computational complexity and helps minimize noise. For instance, some research [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29] segment data into superpixels using simple linear iterative clustering (SLIC) [30], subsequently using the superpixels as primitives for graph operations. Similarly, Shi et al. [31] used the Felzenswalb and Huttenlocher (FH) algorithm [32] to generate superpixels, which were then used to construct graphs as input to a deep GCN for small waterbody extraction. Ouyang et al. [33] first segmented images into superpixels and then used a deep semantic segmentation module to extract features as the semantic initialization of graph nodes. The constructed graphs were subsequently input into a GCN for classification. In addition, Wu et al. [34] selected simple noniterative clustering (SNIC) [35] to generate superpixels and then fed them into GNNs to perform brain tissue segmentation. Although the methods mentioned above have achieved good results, superpixel segmentation is merely a preprocessing step due to the use of non-differentiable superpixel algorithms, meaning that superpixel generation and GNNs are executed separately. Therefore, Zhang et al. [36] introduced the differentiable superpixel sampling networks (SSN) [37], which successfully integrates with a mixhop GCN for hyperspectral image classification. Meanwhile, Eliasof et al. [38] utilized an unsupervised CNN [39] for superpixel segmentation, which was jointly trained with GNNs for unsupervised natural image semantic segmentation. While the above-mentioned two methods overcome the end-to-end integration issue in hyperspectral and natural images, they do not belong to the HRS images we focus on. Furthermore, to the best of our knowledge, there is currently no prior work addressing this issue in HRS images. Therefore, achieving an end-to-end integration of superpixel generation and GNNs in HRS images remains a significant challenge.
In summary, superpixel-based GNN methods still face two major limitations. First, in HRS image semantic segmentation, researchers primarily use superpixel segmentation methods as preprocessing steps, resulting in fixed graphs as input to GNNs. This process introduces undersegmentation errors, which can degrade segmentation performance. Second, these methods depend solely on a single graph construction approach, making them susceptible to interclass similarity (ICS) or intraclass variability (ICV), ultimately leading to inaccurate segmentation.
In light of the issues mentioned above, we combine superpixel generation with graph attention networks (GATs) [40] within a unified pipeline. First, we employ a learnable neural network as the superpixel generation module (SGM), which is directly coupled with the subsequent graph segmentation module (GSM). Second, the GSM constructs graphs and updates both graph structures and superpixel features. This module comprises
The main contributions are summarized as follows.
We propose an end-to-end graph attention network with superpixel embedding (SEGAT) framework to achieve HRS image semantic segmentation with refined boundaries. Dynamically fine superpixels make constructed graphs align with object boundaries through collaborative training, reducing undersegmentation errors and effectively improving segmentation performance.
We propose a strategy that utilizes two graph construction methods to obtain more accurate graph structures, reducing segmentation inaccuracies caused by the interference of ICS or ICV. This strategy comprehensively considers strong correlations between adjacent objects and the similarity of long-range superpixel features, constructing local and global graphs based on superpixel spatial positions and features, respectively.
We design a GSM composed of LM and GM. This module flexibly updates graph structures and learns local and global superpixel features. Using the graph block with multihead GATs, adaptive edge weight learning is achieved, along with the enhancement and updating of superpixel features.
We conduct extensive experiments on three publicly available HRS datasets, and our method outperforms competitive approaches.
Image representation of different methods. (a) Image. (b) Pixel-based regular grid representation (CNNs). (c) Patch-based sequence representation (transformers). (d) Superpixel-based graph representation (GNNs). We map the image as superpixel features with homogeneous semantics that match the geometric shapes of objects. Based on these features, graph can better represent the topological relationships of objects. Since CNN and transformer-based methods cannot utilize graph data, using GNNs is an ideal approach.
Whole network architecture. The SEGAT consists of three components: the SGM, the GSM, and the classification stage.
Illustration of
Visualization of the feature maps of each component. 1) Image. 2) Ground truth. 3) SGM. 4) SGM+1 (LM). 5) SGM+1 (GM). 6) SGM+1 (LM)+1 (GM). 7) SGM+1 (LM)+2 (GM). 8) SGM+1 (LM)+3 (GM). 9) SGM+(no-arc) LM+(no-arc) GM.
Impact of the number of nearest neighbors in the GM on the model's performance. The baseline performance is illustrated with dashed lines to show the variation of performance.
Impact of
Sensitivity analysis of
Related Work
A. Superpixel
Superpixels were introduced by Ren and Malik in 2003 [41], which are formed by grouping perceptually similar image pixels, reducing the number of image primitives for subsequent processing [42].
Traditional superpixel algorithms typically oversegment images by analyzing low-level features such as color properties and spatial distances [42]. The Watershed algorithm [43] identifies and segments objects based on the intensity of image edges. It considers areas of high gradient as edges and areas of low gradient as the interior of the same object. SLIC [30] adopts
Driven by advancements in deep learning, superpixel algorithms with deep neural networks (DNNs) have emerged in succession. SSN [37] is the first end-to-end trainable algorithm. Nevertheless, this algorithm is not a pure DNN. It only uses a CNN to extract features, which are input to an iterative
Compared with traditional superpixel algorithms, DNN-based methods can automatically learn higher-dimensional features and more effectively handle HRS images. Furthermore, unlike other DNN-based algorithms, SFCN [47] is a pure DNN algorithm that integrates feature extraction and superpixel segmentation in one step, making it run faster and more easily integrated into downstream networks than SSN [37]. In addition, because SFCN is supervised by segmentation labels, it is more capable of producing superpixel segmentation results that match object boundaries than unsupervised LSN-Net [48]. For these reasons, we use the idea of SFCN [47], which directly predicts pixel-superpixel association map, and use UNet [9] to optimize superpixel segmentation.
B. Semantic Segmentation With CNNs or Transformers
In recent years, extensive studies have been conducted using both CNN-based and transformer-based approaches to achieve semantic segmentation of remote sensing (RS) images. In CNN-based methods, researchers have enhanced feature map quality and segmentation performance by enlarging the receptive field, integrating multiscale features, and employing attention mechanisms. For instance, Nguyen et al. [49] improved the capture of scale variations in images by fine-tuning the atrous convolution rates within the atrous spatial pyramid pooling module of DeepLabV3+ and optimized segmentation performance by using a feature aggregation network that consolidates features at different scales. Ma et al. [50] designed a multiscale network that integrates the local class-aware module with the global class-aware module, achieving effective segmentation. Yang et al. [51] developed an attention-fused network that fuses multilevel features by using attention modules, addressing the issue of multipath and multilevel feature fusion. In addition, segmentation performance has been improved by modifying traditional single-branch encoder–decoder architectures [9], [52], [53]. For example, HBSeNet [54] utilizes a dual-path network that combines the spatial and context paths, yielding effective segmentation. Similarly, Li et al. [55] also designed a network with a bilateral architecture for RS image semantic segmentation.
In transformer-based methods, we can classify them into pure transformer methods and hybrid designs combining CNNs and transformers. Pure transformers refer to methods where both the encoder and decoder are transformers. For instance, Strudel et al. [15] proposed Segmenter, a convolution-free, fully transformer-based model for semantic segmentation. This model uses output embeddings from image patches to get class labels through a linear or mask transformer decoder. Cao et al. [56] developed a U-shaped network that integrates the Swin transformer with the UNet architecture. Compared to pure transformer methods, hybrid designs combining CNNs and transformers have become more prevalent in semantic segmentation. Zeng et al. [57] proposed MSGCNet, a hybrid architecture that leverages the strengths of both CNNs and transformers, using efficient cross-attention to facilitate interaction among multiscale features of the encoder. Li et al. [58] designed an attention-focused feature enhancement network, which integrates a ResNet50-based encoder with a parallel multistage feature enhancement group, a global multiscale attention mechanism in the decoder, and a feature-weighted fusion module, effectively addressing challenges like complex structures and occlusions. Chen et al. [59] developed a hybrid architecture that combines a ResNet50 encoder with a transformer-based decoder constructed by channel-spatial transformer block and global cross-fusion module for optimizing representation performance. With advances in technology, a model called Mamba [60], which can model long-range relationships through linear computations, has gained attention. To address the limitations of CNNs in modeling long-range dependencies and the high computational complexity of transformers, researchers have gradually applied Mamba to the semantic segmentation of RS images, such as RS3Mamba [61] and RSM-CD [62].
Despite significant progress made by the aforementioned semantic segmentation methods, they still have certain limitations. Many geographical objects in HRS images have irregular shapes, and using pixel-based grids (CNNs) or patch-based sequences (transformers) to represent these objects is both limited and insufficient. These limitations hinder the accurate representation of topological relationships between objects, increasing boundary errors and misclassification. In contrast, GNNs depend on graph-based representations, which offer greater flexibility in modeling the topologies of irregular objects. Therefore, we use GNNs to achieve HRS image semantic segmentation.
C. Semantic Segmentation With GNNs
GNNs learn from graph-structured data and are applied in transportation networks, biomedicine, recommender systems, social networks, computer vision, etc. The concept of GNNs was first proposed by Gori et al. [63] in 2005 and further elaborated by Scarselli et al. [64] and Gallicchio et al. [65]. Subsequently, with the advancement of GNNs, numerous studies have proposed various models, including GCNs [66] and GATs [40]. GCNs apply the convolution operator of CNNs to graphs, updating the features of each node by aggregating neighbor information with the same weight. Unlike GCNs, GATs use attention mechanisms to dynamically learn the importance of neighboring nodes by assigning weights based on their relevance to the current node. Therefore, GATs can more flexibly capture dependencies between nodes. Compared to CNN-based and transformer-based methods, GNNs can treat images as undirected graphs and achieve effective and flexible topological modeling of objects with arbitrary shapes. Consequently, some researchers have explored GNN-based approaches for RS image semantic segmentation.
In these approaches, researchers primarily use superpixels as a preprocessing step to reduce computational complexity. Wan et al. [26] and Liu et al. [24] used SLIC to preprocess the images into superpixels, which were then used as graph nodes in the GCN for hyperspectral image classification. Similar methods have been proposed by works [22], [23], [28], [29], [67]. Wang et al. [27] proposed a method using multiscale superpixels segmented with SLIC, which were then incorporated into a weighted GCN for land cover classification of polarimetric synthetic aperture radar images. Diao et al. [21] also employed SLIC for superpixel segmentation of HRS images, using these superpixels as input to an attention GNN. However, the accuracy of these methods heavily depends on the quality of superpixel segmentation. If undersegmentation occurs, correcting this issue within the GNN framework is difficult.
Furthermore, these approaches consider only a single graph construction method. For instance, Diao et al. [21] and Eliasof et al. [38] used the
Therefore, based on the above-mentioned issues, we propose an HRS image semantic segmentation method that implements end-to-end superpixel-based graph segmentation by using two graph construction approaches. This method dynamically optimizes superpixels to provide more accurate object boundaries while providing more robust topological graph structure and contextual information.
Methodology
A. Overall Architecture
The SEGAT architecture, illustrated in Fig. 2, comprises three main components: the superpixel generation module (SGM), the graph segmentation module (GSM), and the classification stage. The process begins with image input, which the SGM processes to generate the pixel-superpixel association map
B. Superpixel Generation Module
The primary purpose of the SGM is to obtain the pixel-superpixel association map
Specifically, we use a UNet with a resnet101 backbone for superpixel segmentation. HRS images
In the process of calculating superpixel features
\begin{equation*} h(\mathbf {s}) = \frac{\sum _{\mathbf {p} : s \in \mathcal {N}} g(\mathbf {p}) \cdot V_{s}(\mathbf {p})}{\sum _{\mathbf {p} : s \in \mathcal {N}} V_{s}(\mathbf {p})}. \tag{1} \end{equation*}
Based on the above-mentioned Formula (1), the superpixel features
\begin{equation*} T_{s} = \frac{H \times W}{M^{2}}. \tag{2} \end{equation*}
C. Graph Segmentation Module
Using superpixels as nodes in the GSM can significantly reduce the computational complexity. GATs rely on internode features and can assign different weights to neighbors of each node, allowing for more flexible and adaptive information aggregation. Therefore, we use the GATs to aggregate superpixel features. Our GSM is detailed in Fig. 5.
The whole GSM comprises two modules: the LM and the GM, with stacks of identical layers,
1) Local Graph Construction
Superpixel segmentation is a collection of pixels characterized by similar colors, textures, and proximity. As shown in Fig. 6, superpixels close to each other are likely to belong to the same category. To better preserve local location information, we construct a local graph structure based on superpixels' spatial positions. The binary adjacency matrix
\begin{equation*} A_{ij} = \left\lbrace \begin{array}{ll}1, & \text{if } i \text{ and } j \text{ are adjacent} \\ 0, & \text{otherwise.} \end{array}\right. \tag{3} \end{equation*}
2) Global Graph Construction
While spatially adjacent superpixels may belong to the same category, determining their category solely based on local positional relationships is insufficient. As shown in Fig. 7, “cars” in the same category may not be adjacent. Consequently, it is necessary to disregard positional relationships and construct the global graph structure from the perspective of superpixel feature similarity.
We use the
\begin{equation*} F_{\text{sim}} = \text{distance}(h(s_{i}), h(s_{j})) = \Vert h(s_{i}) - h(s_{j}) \Vert _{2}. \tag{4} \end{equation*}
Here,
We built the binary adjacency matrix
\begin{equation*} A^{\prime }_{ij} = \left\lbrace \begin{array}{ll}1, & \text{select only the 9 superpixels closest to} \\ & \text{the } i\text{th superpixel in feature distance} \\ 0, & \text{otherwise.} \end{array}\right. \tag{5} \end{equation*}
3) Graph Block
A superpixel generation map is viewed as a graph
The input features
\begin{equation*} e_{ij} = \text{LeakyReLU}\left(\mathbf {a}^{T} \left[ \mathbf {W} \mathbf {S}_{i} \parallel \mathbf {W} \mathbf {S}_{j} \right] \right). \tag{6} \end{equation*}
Here,
When calculating attention scores, all structural information needs to be ignored, and the attention scores between all node pairs should be calculated. After calculating, since not all superpixel nodes are interconnected, the adjacency matrix representing the graph structure is necessary. By performing masked attention, the structural information is added to the attention weight matrix
\begin{equation*} \alpha _{ij} = \text{softmax}_{j}(e_{ij}) = \frac{\exp (e_{ij})}{\sum _{m \in \mathcal {N}_{i}} \exp (e_{im})}. \tag{7} \end{equation*}
In the graph, attention scores
The final attention weight matrix
\begin{equation*} S_{i}^{\prime } = \sigma \left(\sum _{j \in \mathcal {N}_{i}} \alpha _{ij} \mathbf {W} \mathbf {S}_{j} \right). \tag{8} \end{equation*}
Then, the attention weight matrix is computed multiple times using the multihead attention mechanism. The features from several independent attention mechanisms (with
\begin{equation*} S_{i}^{\prime } = \bigg \Vert _{n=1}^{N} \sigma \left(\sum _{j \in \mathcal {N}_{i}} \alpha _{ij}^{n} \mathbf {W}^{n} \mathbf {S}_{j} \right) \tag{9} \end{equation*}
Finally, calculate the weight matrix with the concatenated features and perform (8). To achieve better results, we integrated the GAT within the transformer architecture to further enhance and update the superpixel features.
D. Classification
The classification process is shown in Fig. 8. Since superpixel features preserve well-delineated boundaries, we first perform superpixel-wise classification to retain this fine boundary information, followed by pixel-wise classification.
Initially, based on the superpixel features output by the GSM, we apply a linear layer to classify these features. Then, Using (10), we map the classified superpixel features back to the pixel features, thereby achieving semantic segmentation
\begin{equation*} \hat{g(\mathbf {p})} = \sum _{s \in \mathcal {N}} h(\mathbf {s}) \cdot V_{s}(\mathbf {p}). \tag{10} \end{equation*}
Here,
E. Loss Function
To obtain superpixels that align with semantic boundaries and achieve more accurate semantic segmentation during network optimization, we incorporated four individual loss functions into the overall loss to optimize the parameters of the entire network.
The reconstruction loss and compactness loss are calculated based on the predicted pixel-superpixel association map
\begin{equation*} L(\mathbf {s}) = \frac{\sum _{\mathbf {p} : s \in \mathcal {N}} f(\mathbf {{l}}) \cdot V_{s}(\mathbf {p})}{\sum _{\mathbf {p} : s \in \mathcal {N}} V_{s}(\mathbf {p})}. \tag{11} \end{equation*}
Here,
\begin{equation*} f^{\prime }(\mathbf {{l}}) = \sum _{s \in \mathcal {N}} L(\mathbf {s}) \cdot V_{s}(\mathbf {p}). \tag{12} \end{equation*}
Therefore, the reconstruction loss and the compactness loss are calculated as follows:
\begin{align*} \mathcal {L}_{\text{recon}} =& \text{CE}(g(\mathbf {p}), g^{\prime }(\mathbf {p})) \tag{13} \\ \quad \mathcal {L}_{\text{compact}} =& \sum _{\mathbf {p}} \Vert f(\mathbf {{l}}) - f^{\prime }(\mathbf {{l}}) \Vert _{2}. \tag{14} \end{align*}
Here,
In addition, the prediction results are refined using both cross-entropy loss
\begin{align*} \mathcal {L}_{1} =& \text{CE}(\text{output}, \text{label}) \tag{15} \\ \mathcal {L}_{2} =& \text{Dice}(\text{output}, \text{label}). \tag{16} \end{align*}
The overall loss function is utilized by the entire network, as shown in (17)
\begin{equation*} \mathcal {L}_{\text{total}} = \mathcal {L}_{\text{recon}} + \lambda \mathcal {L}_{\text{compact}} + \mathcal {L}_{1} + \mathcal {L}_{2}. \tag{17} \end{equation*}
Experiments
A. Datasets
The ISPRS Vaihingen dataset has 33 true orthophoto images with an average size of 2494 × 2064 pixels and a spatial resolution of 9 cm. This dataset contains 6 land-cover categories: impervious surfaces (imp.surf), buildings, low vegetation (low veg), trees, cars, and clutter/background. We conduct all experiments on the ground truth with eroded boundaries. In our experiments, we utilize 17 images for testing and 16 images for training.
The ISPRS Potsdam dataset has 38 true orthophoto images with an image size of 6000 × 6000 pixels and a spatial resolution of 5 cm. This dataset contains 6 land-cover categories: impervious surfaces (imp.surf), buildings, low vegetation (low veg), trees, cars, and clutter/background. We conduct all experiments on the ground truth with eroded boundaries. Our experiments use 14 images for testing and 24 images for training.
The UAVid dataset mainly includes 8 categories: clutter, building, road, tree, low vegetation (low veg), moving car, static car, and human. The image resolution is either 4096 × 2160 or 3840 × 2160. The dataset comprises 420 images, including 200 training and 70 validation sets. The official website provides 150 test sets.
The images in all the aforementioned datasets are cropped into 512 × 512 size.
B. Implementation Details
Experiments are conducted using PyTorch on an NVIDIA GeForce RTX 3090 GPU with 24 GB of memory. We utilize the adaptive moment estimation (Adam) with a learning rate of 3e
C. Evaluation Metrics
The OA, mIoU, and mF1 are used to evaluate the model's performance. The third metric is the average F1 score of each category
\begin{align*} \text{OA} =& \frac{\sum _{k=1}^{K} \text{TP}_{k}}{\sum _{k=1}^{K} (\text{TP}_{k} + \text{FP}_{k} + \text{TN}_{k} + \text{FN}_{k})} \tag{18}\\ \text{mIoU} =& \frac{1}{K} \sum _{k=1}^{K} \frac{\text{TP}_{k}}{\text{TP}_{k} + \text{FP}_{k} + \text{FN}_{k}} \tag{19}\\ F1_{k} =& 2 \times \frac{\text{precision}_{k} \times \text{recall}_{k}}{\text{precision}_{k} + \text{recall}_{k}} \tag{20} \end{align*}
\begin{align*} \text{precision}_{k} = & \frac{\text{TP}_{k}}{\text{TP}_{k} + \text{FP}_{k}} \tag{21} \\ \text{recall}_{k} =& \frac{\text{TP}_{k}}{\text{TP}_{k} + \text{FN}_{k}}. \tag{22} \end{align*}
Here,
D. Ablation Study
All ablations are performed on the Vaihingen dataset.
1) Ablation Study for the SGM
The primary purpose of the SGM is to obtain the pixel-superpixel association map and superpixel features. We conduct experiments with different configurations in the SGM, as shown in Table I. The results reveal that using a simple convolutional layer (“1 Conv”) significantly reduces performance, with OA, mIoU, and mF1 dropping to 75.39%, 53.67%, and 68.28%, respectively. Similarly, using an FCN leads to only 90.03% OA, 81.24% mIoU, and 89.53% mF1, indicating the limitations of simpler architectures. In contrast, using a UNet improves performance and yields superior results, emphasizing the necessity of employing the UNet for superpixel generation. Moreover, we investigate the impact of different UNet backbones on the model's performance. The results indicate that shallower backbones, such as ResNet18 and ResNet34, yield lower OA, mIoU, and mF1 values. In contrast, deeper backbones, such as ResNet101, significantly improve superpixel quality and overall segmentation performance, achieving OA of 92.03%, mIoU of 84.56%, and mF1 of 91.53%, yielding the optimal result. This demonstrates the advantage of deeper architectures in capturing features. These findings underscore the critical role of a UNet with a ResNet101 backbone in enhancing our model's performance.
2) Ablation Study for SEGAT's Each Component
Table II outlines the combinations and abbreviations of SEGAT's components. The SGM serves as the backbone and baseline for incrementally designing our method in conjunction with the GSM. Ablation 1 includes only the SGM, which achieves semantic segmentation by mapping superpixels back to pixels. Ablation 2 adds one LM to the SGM base to validate the effectiveness of a single LM within the GSM. Similarly, Ablation 3 integrates one GM into the SGM base to assess the impact of a single GM. Due to the relatively slow computation of local graph construction, we limit the GSM to a single LM, as incorporating more would reduce the network's efficiency. Instead, we focus on increasing the number of GMs. Ablation 4 combines the SGM with one LM and one GM, Ablation 5 with one LM and two GMs, and Ablation 6 with one LM and three GMs. Finally, Ablation 7 employs an LM and a GM within the GSM, excluding the transformer architecture, to evaluate the effectiveness of integrating GAT into the transformer framework.
Table III presents the results for the components of SEGAT. First, Ablations 1–4 demonstrate that incorporating either LM or GM significantly enhances image semantic segmentation performance. Moreover, combining LM and GM further improves accuracy, indicating that employing two graph construction methods is superior to using only one. Second, a comparison of Ablations 4–6 shows a steady increase in accuracy as the number of GMs in the GSM increases, suggesting that a dynamically updated graph structure boosts the model's segmentation performance. However, due to computational limitations, the combination “SGM+1(LM)+4(GM)” is not feasible. Third, the comparison between Ablations 7 and 4 reveals that excluding the transformer architecture when using GAT leads to a significant reduction in precision, underscoring the critical role of the transformer architecture in achieving high performance. In summary, SEGAT (Ablation 6) achieves the highest accuracy, improving OA, mIoU, and mF1 by 1.19%, 1.74%, and 1.06%, respectively, compared to Ablation 1. The feature maps of each component are visualized in Fig. 9.
3) Ablation Study for the Number of Superpixels
Table IV and Fig. 10 reveal that as the number of superpixels increases, performance peaks at 1024, then declines. Considering the impact of graph nodes' numbers on computational efficiency, it is inadvisable to use too many superpixels. This is qualitatively validated in Fig. 11. Specifically, HRS images contain targets of varying scales, making the selection of the appropriate number of superpixels crucial for segmentation performance. If the number of superpixels is too small, segmentation boundaries fail to accurately capture finer details, particularly for small-scale objects like cars. As shown in Fig. 11, with 1024 superpixels, the segmentation aligns well with object boundaries, effectively considering objects of different scales. The edges of two buildings within the green box and the gap between them are also correctly segmented. Conversely, an excessive number of superpixels results in overly fine-grained segmentation. While this may capture finer details, it causes fragmented representations, introduces noise, reduces segmentation efficiency, and increases computational costs.
4) Ablation Study for the Number of the Nearest Neighbors in the GM
To evaluate the influence of the number of nearest neighbors in the GM on the overall performance, we conduct ablation studies by varying the number of nearest neighbors from 9 to 100. As shown in Fig. 12, the performance metrics, including OA, mIoU, and mF1, remain relatively stable across different settings. Specifically, the OA fluctuates slightly around 91.90%, while the mIoU and mF1 hover around 84.53% and 91.50%, respectively. These results indicate that the model's performance is not highly sensitive to the choice of the nearest neighbor count within this range. To maintain a balanced graph structure between the GM and LM, we chose 9-nearest neighbors, ensuring that the number of edges in both the local and global graphs remains approximately the same.
5) Ablation Study for the Number of GATs' Heads
Implementing the attention mechanism in GATs multiple times corresponds to multihead attention, where the number of heads represents the number of attention weights between a node and its neighboring nodes. According to Table V, the model's performance improves as the number of heads increases, peaking at three heads. However, due to limited computational resources, the model cannot support four heads, as it exceeds GPU memory capacity. While additional heads can capture diverse features, enriching information and enhancing the model's expressiveness, they also decrease computational efficiency. Considering the negative impact of excessive heads on computational efficiency, we recommend limiting the number of heads to ensure a balance between performance and resource utilization.
6) Ablation Study for Parameter \lambda of Loss Function
To evaluate the impact of the
7) Computational Complexity Comparison With Other Models
Table VI compares different models' computational requirements regarding parameter complexity and the floating point operations (FLOPs), analyzing the tradeoffs with mIoU performance. The table shows that our SEGAT ranks fourth in computational demand, which is not the highest and achieves optimal segmentation performance.
E. Comparing With State-of-the-Art (SOTA) Methods
SEGAT results are compared with SOTA methods on Vaihingen, Potsdam, and UAVid datasets. Table VII lists all comparison methods.
In Table VIII, SEGAT demonstrates superior performance, achieving an OA of 92.03%, a mIoU of 84.56%, and an mF1 of 91.53%. As illustrated in columns 6–8 of Fig. 15, the first row indicates that LoG-CAN merges trees that are spaced apart, whereas SEGAT effectively preserves their separation, closely matching the ground truth. In the second row, LoG-CAN fails to accurately capture the local details of “buildings,” resulting in local pixels belonging to “building” being misclassified as “impervious surfaces.” In contrast, SEGAT better maintains the integrity of “buildings” and preserves semantic object boundaries, more accurately reflecting ground conditions. This highlights SEGAT's superior ability to handle interactions between local and global information. Although SEGAT's performance in segmenting “buildings,” “trees,” and “cars” is not the best, it excels in “impervious surfaces” and “low vegetation,” achieving F1 scores of 93.81% and 86.56%, respectively, outperforming other networks by at least 0.1% and 0.67%. In addition, the superpixel generation map visualization in Fig. 16 demonstrates that SEGAT effectively preserves areas of local semantic similarity and aligns accurately with category boundaries. The well-segmented superpixel generation maps further confirm SEGAT's effectiveness.
Tables IX and X demonstrate that SEGAT achieves optimal semantic segmentation performance. According to Table IX, SEGAT excels in segmenting the “trees” and “cars” classes, with F1 scores of 90.14% and 97%, respectively. Similarly, Table X shows that SEGAT achieves the highest IoU for “moving car” (76.01%) and “static car” (62.92%). SEGAT exhibits significant advantages in classifying the small object “car,” with its accuracy in the “moving car” and “static car” categories surpassing other networks by at least 1.43% and 3.82%, respectively.
Overall, SEGAT demonstrates strong performance in both small- and large-scale class segmentation, effectively preserving the integrity and independence of object categories. This success is attributed to its superior superpixel segmentation and the effective interaction of features within the graph module. The qualitative validation of this performance is illustrated in Figs. 17–20.
Discussion
This article explores the use of GNNs in the semantic segmentation of HRS images, focusing on integrating superpixel generation with GNNs. To the best of our knowledge, this issue has not been previously addressed in HRS images. In our approach, superpixel segmentation is collaboratively trained with the entire model, rather than as a standalone preprocessing step. We mitigate the effects of ICS and ICV by combining two graph construction methods: one based on superpixel spatial adjacency and the other on feature similarity. Although our experiments show the superior performance of SEGAT, several aspects require further exploration.
1) Exploration of graph construction methods: While local graphs and global graphs effectively capture both local and global contextual information, combining these two graph construction methods into a unified graph could also be a promising approach. This method can enable a more comprehensive representation of topological relationships in a single graph. In future work, we will explore this approach and its impact on model performance.
2) Exploration of superpixel optimization methods: Our model employs a fixed number of superpixels at one level, which is inflexible for HRS images with varying object scales and complex scenes. To improve superpixel segmentation and enhance model performance, we can consider the following methods to address these issues: 1) Design multilevel superpixels. By introducing multilevel and multiscale superpixel representations, which leverages the advantages of both fine-grained and coarse-grained segmentation, making it better suited for multiscale objects in HRS images. 2) Design an adaptive superpixel segmentation method. This method can adjust the number of superpixels based on image content. We will explore these methods in future work.
3) Exploration of graph model optimization methods: The introduction of SAM [87] in 2023 has established a new paradigm in semantic segmentation, drawing widespread attention for its powerful object segmentation capabilities. In the field of RS, researchers have utilized this approach for related studies. For instance, Yan et al. [88] introduced the RingMo-SAM model, which employs SAM for multimodal RS image segmentation, handling both optical and SAR images. Ma et al. [89] utilized objects and boundaries generated by SAM, designing suitable loss functions to achieve RS image semantic segmentation. We can leverage SAM's powerful object boundary segmentation capabilities to optimize incomplete boundaries of graph model segmentation. In summary, the integration of graph models and SAM offers immense potential for future exploration.
4) Exploration of cross-domain issues: Our method relies on supervised learning, which assumes that the training and testing data share the same closed-set label space and are identically distributed. Consequently, when we extend our approach to new datasets with significantly different distributions, SEGAT, which relies on a predefined target domain, requires additional training data to achieve better segmentation performance. Many researchers have studied this cross-domain issue, which is a specialized research direction that focuses on transferring knowledge from labeled source domains to unlabeled or unseen target domains. For example, Luo et al. [90] and Rombach et al. [91] utilized a diffusion model to extract valuable features from large amounts of unlabeled data. Chen et al. [92] employed contrastive learning to enhance domain adaptation capabilities. In the future, we will study the cross-domain problem by exploring domain adaptation techniques such as adversarial training, generative models, and contrastive learning to further enhance the adaptability of our method to new datasets.
Conclusion
This study proposes a novel SEGAT for semantic segmentation of HRS images. This SEGAT model, which innovatively integrates the SGM and the GSM into a unified pipeline, enables the refinement of superpixel boundaries and the optimization of semantic segmentation to be implemented simultaneously during network training. Unlike other superpixel-based GNN methods where superpixel segmentation is typically a preprocessing step. This approach not only reduces undersegmentation errors but also allows the graph structure to be continuously refined under supervised learning. In addition, we design a strategy employing two graph construction methods that fully consider both the spatial positional relationships and feature similarity of superpixels, reducing the interference of ICS and ICV, thereby enhancing the perception of both local and global contextual information. In the classification step, to retain boundary information, we first classify the superpixel features and then map them back to pixel features by using the pixel-superpixel association map, effectively minimizing missing boundaries and ensuring object integrity. Extensive experiments on three datasets demonstrate the robustness of SEGAT, highlighting its superior performance in HRS image semantic segmentation. However, the model still relies on supervised learning and requires high-quality training data. While the collection of RS images has become more accessible in the era of RS big data, creating accurate labels remains time-consuming and labor-intensive. Given the advantages of GATs in semisupervised or unsupervised learning, which can automatically aggregate information from neighboring nodes to learn latent data structures, we plan to explore ways to reduce dependency on labels in future work, aiming to further improve our methods through semisupervised or unsupervised approaches. In addition, according to the discussion section, we will further explore relevant issues in the future.