Introduction
Remote sensing semantic segmentation enables precise identification and monitoring of land surface cover types through pixel-level classification of remote sensing images, thereby providing information support for research in remote sensing-related fields [1]. With the rapid development of remote sensing image acquisition technologies, such as satellite remote sensing and aerial remote sensing [2], remote sensing semantic segmentation plays a crucial role in practical applications, such as land cover mapping [3], urban change detection [4], [5], environmental protection [6], and precision agriculture [7].
Traditional remote sensing image segmentation methods mainly rely on pixel-level features for supervised classification, including methods, such as support vector machine [8] and random forests [9]. However, these traditional methods are limited in their feature extraction capabilities, especially when modeling complex land cover and scenes in remote sensing images, and they often underutilize spatial information [10]. With the rise of deep learning, the introduction of convolutional neural networks (CNNs) and Transformers has overcome the limitations of traditional methods in feature representation [11].
However, deep learning-based semantic segmentation of remote sensing images also faces the following challenges: First, the distribution of land cover in remote sensing images is more complex, with variations in the appearance of the same land cover type in different scenes [12], leading to high interclass and intraclass variability, which makes accurate segmentation difficult [13], [14]. Second, there is a greater diversity of segmentation objects, and remote sensing images contain richer information and details, requiring the extraction of more diverse details and semantic features for precise segmentation [15]. Lastly, hyperspectral remote sensing images contain rich spectral information, which is challenging for common semantic segmentation methods to leverage. For example, backbone networks that utilize depthwise separable convolutions [16], [17] to reduce parameter and computational complexity lack the ability to model spectral dimensions effectively.
In the field of CNNs, addressing the issue of complex and inconsistent scale distribution of objects often involves increasing the receptive field or using multiscale modules.
Large receptive field allows the network to better capture broader spatial information from the input imagery, thereby enhancing its adaptability to changes in object scales. Zhang et al. [18] introduced dilated convolutions with different strides in the backbone network to enlarge the receptive field, mitigating intra-class heterogeneity. Chen et al. [19] used adaptive effective receptive field convolution to control the sampling positions of convolutions, automatically adjusting the receptive field to alleviate the problem of varying object scales in high-resolution remote sensing imagery. Given the significant differences in road lengths and shapes in remote sensing imagery, Yang et al. [20] enhanced the U-Net architecture with dual decoding structures and introduced dilated convolutional attention mechanisms to accurately capture roads in the images.
Multiscale modules extract and integrate features at different scales to better capture the complexity and inconsistency of object scales. Different categories of objects have appropriate adjustment scales. Based on this behavior, Cai et al. [21] proposed a stacked semantic segmentation framework, which uses a learnable error correction module for segmentation result fusion to improve the results. Liu et al. [22] focused on scale perception and fusion, annotating categories with large intraclass differences and different scales, and training their scale feature attention module accordingly. Bai et al. [23] developed a multiscale attention module based on fine-grained multiscale features and channel dependencies to extract features at multiple scales. Hang et al. [24] introduced a multiscale progressive segmentation network, cascading three subnetworks and designing a scale-guided module to utilize the segmentation results of the previous subnetwork to guide the feature learning of the next subnetwork.
However, despite the significant success achieved by CNNs, they still have certain limitations. CNNs are proficient at extracting local features, yet their convolutional layers' local nature hinders the network's ability to capture global context. To overcome this limitation, various methods [25], [26] have been proposed, but each of these methods has its own set of limitations. For instance, dilated convolutions may struggle to extract small objects [27], stacking multiple convolutional layers can lead to a substantial increase in computational complexity [28], and feature pyramids may result in information loss.
Given the complex distribution of objects in remote sensing images, leveraging global information is crucial. Transformer, known for its ability to model global information effectively, has shown significant potential in this regard [29]. By stacking multiple self-attention modules, Transformer constructs long-range dependencies among objects, thus addressing the limitations of convolutional approaches [30]. General semantic segmentation networks are divided into methods based on pure transformers or hybrid networks.
Pure Transformer: Both the encoder and decoder are Transformer architectures. Unlike the conventional use of convolution or pooling, Xie et al. [31] achieved feature downsampling and upsampling by merging and expanding patches, capturing long-range contextual relationships between each patch. Swin Transformer, as the backbone network, can vary the receptive field by using local shifting windows, thus extracting multiscale features [32]. Liu et al. [33] utilized a pretrained transformer encoder to extract the semantic features of the input image and designed a multiscale global-local transformer decoder for encoding and decoding, ensuring consistent feature representations. With the Transformer as the encoder, it maximizes the utilization of global information during feature extraction and decodes the encoded semantic information, resulting in higher computational efficiency and stronger generalization capabilities.
CNNs and Transformer: The encoder or decoder consists of a Transformer, and the other is CNNs. UNetFormer [34] adopts ResNet18 as the backbone network and incorporates a global-local attention mechanism to construct Transformer blocks in the decoder, allowing attention blocks to capture both global and local contexts. Li et al. [35] utilized a linear attention mechanism at the skip connections of the U-shaped network to reduce the memory and computational costs of dot-product attention. Zhang et al. [36] adopt Swin Transformer as the backbone network and simultaneously utilize a spatial pyramid pooling block based on depthwise separable convolutions to capture multiscale contexts. The advantage of using residual networks is their sensitivity to local information, while Transformer can focus on different parts of the input sequence, thereby enhancing the model's performance in complex contexts.
In summary, while Transformer excels at capturing global information, its utilization of local information is often inadequate. For remote sensing imagery, local information is equally essential, as neighboring objects often exhibit strong correlations [37]. Current research often focuses on methods combining convolution and attention [38]. However, complex networks inevitably increase computational requirements. In addition, simple feature aggregation methods, such as feature addition or concatenation, may lead to information overlap and redundancy, making it challenging for the network to learn effective feature representations [39].
Finally, to address the issue of common semantic segmentation methods failing to fully leverage spectral information, recent semantic segmentation approaches integrate spectral features when applied to hyperspectral classification. The extraction of spectral information primarily involves two methods: spectral extraction modules and spectral attention modules.
Spectral extraction modules are typically designed to extract spectral information by incorporating 3-D convolutions or introducing additional spectral branches. FCN, combined with 3-D convolutions to form SS3FCN [40], is utilized for extracting both spectral spatial and semantic information. Huang et al. [41] proposed 3-D-Swin Transformer based on the Swin Transformer to better fit hyperspectral data. In addition, a combination of convolutional layers and mixed convolution-Transformer modules has been employed to extract and fuse local and global features [42], [43]. Xie et al. [44] utilized Swin Transformer as the backbone network and introduced a hyperpatch embedding module to extract spectral and local spatial information from hyperspectral images. Feng et al. [45], [46] proposed a multicomplementary GAN with contrastive learning (CMC-GAN) and further introduced a class-aligned and class-balanced generative domain adaptation (CCGDA) method, both of which effectively capture rich spatial–spectral features.
Spectral attention modules can assist the model in better extracting spectral information. FPGA [47] introduced spectral attention in the encoder, proposing an end-to-end trained patch-free global learning FCN framework. Xin et al. [48] proposed a triple-attention network based on hyperspectral images, namely, the spectral-space-scale attention network, which adaptively weights each channel, each pixel, and each scale perception of the feature map. Gui et al. [49] proposed an infrared attention network, which incorporates an additional infrared spectrum encoder and multiple attention blocks. This design focuses on vegetation features sensitive to infrared, enabling precise extraction of forested areas in multispectral images. The multiscale receptive fields graph attention neural network [50], introduces a spectral transformation mechanism and a multiscale receptive field map attention network, addressing the limitations of GNNs in hyperspectral classification.
However, the aforementioned methods are designed for hyperspectral images with fewer samples and are typically trained by dividing patches. This significantly reduces the model's receptive field, making it unable to fully utilize the abundant spatial information in remote sensing images. Moreover, training on large-scale hyperspectral images increases computational complexity.
To sum up, existing methods still encounter the following issues.
The local nature of convolutional layers limits the network's ability to capture global context, while transformers overlook local information, both of which are necessary for remote sensing image segmentation. Combining convolution and transformers can lead to a sharp increase in computational costs, thereby increasing training expenses.
Remote sensing imagery contains rich semantic features. However, existing methods often aggregate information between different levels primarily through skip connections, without considering the correlation between different features. This oversight can lead to information overlap and redundancy, making it difficult for the network to learn effective feature representations.
Networks in the field of semantic segmentation often underutilize spectral information. Moreover, methods designed for hyperspectral remote sensing images mostly rely on patch-based approaches, limiting the extraction of spatial information from large-scale hyperspectral remote sensing images due to the constrained receptive fields.
In this article, we propose a novel U-shaped semantic segmentation network that combines spectral attention with deformable Transformers for large-scale hyperspectral remote sensing datasets. Inspired by Zhu et al. [51], we introduce deformable Transformers along with self-attention and cross-attention-based feature interaction methods. First, in the process of feature extraction, we integrate global spectral attention (GSA) to weight more valuable feature layers, thereby extracting rich spectral information. Second, to overcome the issue of convolutional networks struggling to extract global features, this article adopts a multilevel feature interaction approach based on deformable Transformer. By introducing deformable self-attention mechanisms, the model can more effectively capture features at different scales, thereby improving its performance in complex objects within large-scale images. Lastly, utilizing deformable cross-attention facilitates deep and shallow feature interaction, further enhancing the model's ability to dynamically perceive semantic features. By synthesizing the challenges faced and the proposed solutions, we develop a novel, lightweight, and effective semantic segmentation method for large-scale hyperspectral remote sensing imagery.
The main contributions of this article are summarized as follows.
We propose a U-shaped semantic segmentation network, DTSU-Net, for large-scale hyperspectral image semantic segmentation. By performing hierarchical downsampling on the entire input image and applying GSA to weight the spectral features of the feature maps, the weighted spectral features emphasize the importance of different spectral information bands, effectively mitigating intraclass spectral variations.
Employing deformable Transformer to facilitate multilevel and multiscale feature interaction, capturing targets at different scales effectively. Deformable Transformer offers low computational complexity and strong local dependency, thus efficiently extracting local information from remote sensing imagery.
We utilize deformable cross-attention to enhance deep features. Deep features possess rich semantic information but lack spatial information from shallow features. By leveraging shallow information, deep features are empowered to fully explore target characteristics and mitigate information loss caused by downsampling.
Related Work
A. Review of Transformer
Before introducing the deformable transformer, let us briefly review the Transformer. The main principle of the multihead self-attention (MHSA) mechanism is to compute relationships between each pixel and all other pixels. Unlike convolution, which focuses on a fixed receptive field, Transformer can model long-range dependencies at large scales, capturing global contextual information. In computer vision, Transformer divides an image into a series of image patches and computes similarity relationships between these patches. If the input image is denoted as
\begin{equation*}
q = xW_{q}, k = xW_{k}, v = xW_{v}. \tag{1}
\end{equation*}
\begin{equation*}
\text{{MHSA}}(x) = \sum _{m=1}^{M} W_{m} \sum _{k=1}^{K} \sigma \left(\frac{{q^{T}_{k}}}{{\sqrt{d_{k}}}}\right) \cdot \text{W'}_{m} v. \tag{2}
\end{equation*}
B. Deformable Transformer
Although the standard Transformer excels in modeling long-range dependencies, it has significant drawbacks when handling large-resolution remote sensing images. Due to its equal treatment of relationships between every pair of sequences, the model incurs considerable computational and memory complexity when processing images with high resolutions. Moreover, it requires substantial training time to focus attention weights on meaningful positions. Consequently, using Transformer for segmentation tasks consumes substantial computational resources. Furthermore, while Transformer exhibits strong long-range relationship modeling capabilities, local information is equally crucial for remote sensing images, as neighboring pixels often exhibit strong correlations. Self-attention mechanisms are insensitive to local contextual information and may distribute attention weights indiscriminately. This phenomenon causes the network's scope of interest to extend beyond the boundaries of the target, leading to attention weights being assigned to irrelevant information.
To address the aforementioned issues, our previous work has made significant progress in the field of hyperspectral image classification. Xue et al. [52] proposed a local Transformer with a spatial partition recovery network, which restricts attention computation to overlapping subblocks to enhance computational efficiency. Li et al. [53] utilized self-pooling instead of traditional MHSA, significantly reducing the model's parameter count. However, these approaches are not suitable for large-scale natural remote sensing images. Zhu et al. [51] proposed a deformable Transformer block to capture multiscale objects in object detection tasks. Later, Xiao et al. [54] and Zuo et al. [55] introduced it into the semantic segmentation tasks of high-resolution remote sensing imagery.
This approach combines the advantages of deformable convolutions' sparse spatial sampling [56] with Transformer's long-range relationship modeling capabilities. It focuses on a small set of sampling positions and computes attention for the feature map pixels corresponding to these sampling points. If the input image is
\begin{equation*}
\text{{MHDSA}}(x) {=} \sum _{m=1}^{M} W_{m} \sum _{k=1}^{K} \text{Attention}(q,k) \cdot \text{W'}_{m} v_{\phi (p_{q}+\Delta p_{qk})} \tag{3}
\end{equation*}
\begin{equation*}
p_{q}^{\prime } = \phi (p_{q} + \Delta p_{qk}) = \phi (p_{q} (x+\Delta p_{x}, y+\Delta p_{y})). \tag{4}
\end{equation*}
Proposed Methodology
In this section, we provide detailed overview of the DTSU-Net, a U-shaped semantic segmentation method that combines GSA and deformable Transformer for large-scale hyperspectral image semantic segmentation. The overall architecture of our proposed DTSU-Net is illustrated in Fig. 1. Initially, feature extraction is performed using convolutional blocks, followed by feature downsampling. Concurrently, during feature dimensionality expansion, GSA is applied to weight crucial feature layers. This step aims to extract both local spatial and spectral information from the hyperspectral images, resulting in three features spanning from shallow to deep layers. To further extract multiscale information from these features, they are uniformly projected into 128 dimensions and then fused to perform deformable self-attention, facilitating the learning of feature relationships across different scales. The interacted deep features are subsequently merged with shallow features using deformable cross-attention, resulting in enriched, semantically meaningful high-level features. Finally, the features are upsampled and successively fused to restore spatial details, producing prediction maps of the original image size. The pseudocode detailing the specific algorithm is presented in Algorithm 1.
Graphical illustration of the proposed deformable Transformer and spectral U-Net (DTSU-Net) for large-scale hyperspectral image semantic segmentation.
DTSU-Net for Large-Scale Hyperspectral Image Semantic Segmentation.
A. Global Spectral Attention
The encoder network comprises convolutional modules and a GSA section, aiming to extract rich spatial and spectral information from large-scale hyperspectral images. The convolutional module consists of 3 × 3 convolutions followed by group normalization and rectified linear unit (ReLU) activation. Given the computational demands of large-scale hyperspectral imagery, the network operates with a relatively small batch size. Group normalization is employed within each group to reduce reliance on batch size, enhancing network stability and suitability for training with small batches.
The GSA module incorporates global average pooling and global max pooling over the entire hyperspectral cube. Illustrated in Fig. 2, with an input of H × W × C hyperspectral cube, the GSA mechanism performs max pooling and average pooling along the dimension of H × W, enhances interchannel relationships while preserving individual channel information. Subsequently, two convolutional layers followed by ReLU activation are applied sequentially to map the features into a high-dimensional space, resulting in two global feature vectors of size 1
\begin{align*}
\text{Avg}(Y_{(m,:,:)}) = &\frac{1}{H \times W} \sum _{i=1}^{H} \sum _{j=1}^{W} Y_{(m,i,j)} \tag{5}
\\
\text{Max}(Y_{(m,:,:)}) = &\max \lbrace Y_{(m,i,j)}, \quad i \in (1,H), \quad j \in (1,W) \rbrace. \tag{6}
\end{align*}
The formula for this step is
\begin{equation*}
F^{\prime } = \sigma (\text{FC}(\text{Avg}(Y)) + \text{FC}(\text{Max}(Y))). \tag{7}
\end{equation*}
\begin{equation*}
F = F \cdot F^{\prime }. \tag{8}
\end{equation*}
B. Multihead Deformable Self-Attention
The multiscale feature mechanism is widely applied in remote sensing semantic segmentation models and can be seamlessly integrated into our model. If the input feature maps consist of multiple levels of features
\begin{equation*}
x_{f} = \text{concat}\lbrace \text{flatten}(X_{1}), \text{flatten}(X_{2}), \ldots, \text{flatten}(X_{l})\rbrace \tag{9}
\end{equation*}
\begin{equation*}
x_{\text{out}} = \text{MLP}(\text{{MHDSA}}(x_{f})) + x_{f}. \tag{10}
\end{equation*}
Since the output feature size remains the same as the input and can be reshaped back into the shapes of the original three input features, we stack this module three times and correspondingly pair it with the subsequent deformable cross-attention modules. The schematic diagram of this part is shown in Fig. 3, where the deformable transformer provides information about the target in different features, thereby improving the model's performance under various complex conditions. Since each pixel in the multiscale feature map serves as an object query, so smaller deep-level features save significant computational and memory costs compared to shallow features.
C. Multihead Deformable Cross-Attention (MHDCA)
While the backbone network can model multiscale feature maps separately, uniform attention overlooks the importance of features with different downsampling levels, especially the deepest features that contain rich abstract semantic information. On the contrary, shallow features typically contain more spatial information and details, necessitating multilevel feature fusion to further enrich the features. Leveraging deformable cross-attention with the deformable transformer mentioned above facilitates feature interaction and strengthens the deep features of the network. Specifically, we use the attention-weighted bottom-level feature
\begin{equation*}
q=xW_{q}, \bar{k}= \bar{x}_{i} W_{k}, \bar{v}= \bar{x}_{i} W_{v}. \tag{11}
\end{equation*}
The process, as shown in Fig. 3, begins with self-attention on the query vectors embedded with positional information, fully exploiting the internal information of the target sequence. The information from self-attention is then fused with the original sequence information using residual connections, retaining some of the original information. Subsequently, the obtained features are separately computed with cross-attention with the projected shallow features
\begin{align*}
\text{{MHDCA}}(x_{1}, \ldots, x_{l}) =& \sum _{m{=}1}^{M} W_{m} \sum _{l{=}1}^{L} \sum _{k{=}1}^{K} A_{lqk}\\
&\cdot \text{W'}_{m} v_{\phi (p_{q}{+}\Delta p_{lqk}).} \tag{12}
\end{align*}
Next, for the obtained features, we perform upsampling using 2× nearest-neighbor interpolation after a 3 × 3 convolution. Simultaneously, at the same resolution, we use skip connections to continuously integrate shallow features to restore details. This process continues until upsampling to the size of the original input image. Finally, we use a 1 × 1 convolution with the channel set to the preset number of classes to obtain the prediction results.
D. Cross-Entropy Loss and Segmentation
The final output of a semantic segmentation network typically consists of a 3-D feature tensor
\begin{equation*}
L(y, \hat{y})_{\text{ce}} = -\frac{1}{N} \sum _{i=1}^{N} \sum _{j=1}^{C} y_{i}^{(j)} \cdot \log (\hat{y}_{i}^{(j)}) \tag{13}
\end{equation*}
Experimental Results
A. Dataset
This study utilizes the OHS dataset [57], characterized by a spatial resolution of 10 m and encompassing 32 spectral channels spanning from visible light to near-infrared, with an average spectral resolution of 15 nm. Comprising 42 OHS satellite images sourced from diverse cities across China, the dataset has been meticulously curated into 7795 subimages of dimensions 512 × 512 pixels. Among these, 4821 serve as training images, 515 as validation, and 2459 as testing. The dataset offers a wealth of information, featuring an average of thirteen land cover classes per city. There are a total of 24 categories, with their names and corresponding numbers as shown in Table I. The distribution of land cover exhibits substantial variation between cities. Notably, cities in the southeast of China, such as Shanghai and Guangzhou showcase developed urban areas and intricate water networks, predominantly adorned with rice fields. In contrast, northern cities like Changchun boast dense forests and predominantly arid farmlands, while cities in the northwest, exemplified by Xinjiang, exhibit sparse vegetation and distinct geographical features, such as deserts. The three-band false-color composite images and ground truth maps corresponding to WHU-OHS datasets are shown in Fig. 4, respectively.
False-color composite image for four cities (R: 670 nm, G: 566 nm, B: 480 nm). (a) Changchun, (b) Shanghai, (c) Guangzhou, and (d) Karamay.
B. Implementation Details
All experiments were conducted on an Intel(R) Xeon(R) processor and a GTX 3090 24GB GPU, utilizing the PyTorch 1.9.1 framework with cuDNN 11.1.1. For training, the adaptive moment estimation (Adam) optimizer was chosen, with an initial learning rate of 0.0001 and weight decay of 0.0001. The training process involved 100 epochs on the dataset, with a batch size of 6. No data augmentation techniques were applied, and the network was trained from scratch on the dataset.
C. Evaluation Metrics
To quantitatively validate the effectiveness and robustness of the DTSU-Net, we employed the following evaluation metrics: intersection over union (IoU) for each class, mean intersection over union (mIoU) and overall accuracy (OA) are used to evaluate the classification accuracy of the model on test set
\begin{align*}
\text{IoU}_{i} =& \frac{\text{TP}_{i}}{\text{TP}_{i}+\text{FP}_{i}+\text{FN}_{i}} \tag{14}\\
\text{mIoU} = &\frac{1}{N} \sum _{i=1}^{N} \text{IoU}_{i} \tag{15}\\
\text{OA} = &\frac{\sum _{k=1}^{N} \text{TP}_{k}}{\sum _{k=1}^{N} (\text{TP}_{k} + \text{FP}_{k} + \text{TN}_{k} + \text{FN}_{k})}. \tag{16}
\end{align*}
D. Parameter Sensitive Analysis
1) Learning Rate
To determine the optimal learning rate for our model, we systematically set the learning rates to [1e-2, 5e-3, 1e-3, 5e-4, 1e-4, 5e-5, 1e-5] and conducted experiments on datasets from four different cities. We recorded the mIoU values corresponding to each learning rate. As depicted in Fig. 5(a), we documented the segmentation results for each learning rate and utilized a line graph for intuitive analysis. Precision gradually increases when the learning rate is between 1e-2 and 1e-4. The optimal segmentation performance is observed when the learning rate is set to 1e-4.
2) Sampling Points
This hyperparameter is a crucial component of the deformable Transformer, indicating the number of sampled pixels on each feature layer, thereby affecting the model's receptive field. Consequently, we conduct a sensitivity analysis on the parameter of sampling points to verify its impact on the model's performance. We set the parameter K to [2, 4, 6, 8, 10], where a higher number of sampling points corresponds to a larger receptive field but also an increased computational load. As is shown in Fig. 5(b), the number of sampling points is set to 8.
3) Number of Head in Deformable Transformer
The parameter in question refers to the number of heads in the multihead attention mechanism we employ. It affects both the model's parameter count and the feature dimensions of intermediate layers. We set this parameter to [1, 2, 4, 8, 16] to evaluate its impact on the model. Increasing this parameter significantly increases computational load. As is shown in Fig. 5(c), When this hyperparameter is set to 8, the model exhibits superior performance across the three metrics on the dataset compared to other configurations.
4) Number of Projection Layers
To ensure uniform feature layer counts across various convolutional blocks for ease of subsequent feature interaction, we project the features from different channels to appropriate dimensions. In this experiment, we set a consistent projection layer dimensionality of [64, 96, 128, 192, 256] and analyze the impact of different dimensions on the results. As is shown in Fig. 5(d), The optimal performance is achieved when the projection layer dimension is set to 128.
E. Performance Comparison
In this section, we will validate the effectiveness of our method in large-scale hyperspectral image segmentation. To ensure the results are more universally applicable and compelling, we will conduct comparisons across four cities nationwide in China. These cities include Changchun City (northern region) from Jilin Province, Shanghai City (eastern region), Guangzhou City (southern region) from Guangdong Province, and Karamay City (western region) from Xinjiang. Simultaneously, we will compare DTSU-Net with common hyperspectral methods and state-of-the-art semantic segmentation methods, demonstrating the superiority of our approach. The methods for comparison include hyperspectral techniques such as 1DCNN, 3DCNN, A2S2KResNet, SS3FCN, FPGA, CNN-based methods, such as FCN, U-Net, DeepLabV3+, DANet, LANet, and Transformer-based methods, such as SegFormer, UNetFormer, TransUNet, DC-Swin, ST-UNet, and EMRT. Specifically, when comparing hyperspectral methods, we did not adopt the traditional pixel-wise classification approach. Instead, we modified the final output layer by upsampling the high-level features to match the input size for semantic segmentation. This approach not only fully leverages the spectral extraction capabilities but also maintains consistency in the model architecture. Moreover, it enables fast segmentation, making it more suitable for large-scale hyperspectral imagery. DTSU-Net does not belong to the aforementioned three categories. First, it differs from pixel-wise hyperspectral classification methods. Second, instead of using a common residual network as the encoder backbone, our method employs spectral attention and convolution blocks for feature extraction. Finally, our Transformer blocks are not used for feature extraction or decoding but are dedicated to multiscale feature interaction. As a result, our method offers advantages over typical semantic segmentation approaches, with lower computational memory requirements and higher segmentation accuracy.
To provide a more intuitive reflection of the segmentation performance of different models, Figs. 6–9, respectively, display partial ground truth and segmentation result images for the four datasets. Due to the extensive coverage and large volume of the dataset, the distribution of land cover types is more complex, and there is a significant imbalance in the number of samples between certain classes. This imbalance leads to the inability of some methods to segment more challenging classes, resulting in a large number of zero IoU values in the table. In addition, the backbone of DeepLabV3+ is Xception, which uses depthwise separable convolutions that decompose convolution into depthwise and pointwise convolutions. Although this reduces computational complexity, it fails to fully utilize the rich spectral information. As a result, DeepLabV3+ performs the worst, and its performance in Figs. 6(i)–9(i) is also unsatisfactory. In contrast, our method performs better on these more difficult-to-segment categories, achieving the highest mIoU.
Evolution of mIoU as a function of (a) learning rate, (b) sampling points, (c) number of heads, and (d) number of projection layers for four datasets.
Segmentation maps of different methods on the Changchun city. (a) Ground Truth, (b) 1DCNN (18.39%), (c) 3DCNN (24.26%), (d) A2S2KResNet (34.66%), (e) SS3FCN (40.91%), (f) FPGA (49.30%), (g) FCN (29.27%), (h) U-Net (28.44%), (i) DeepLabV3+ (2.31%), (j) DANet (42.25%), (k) LANet (41.99%), (l) SegFormer (41.79%), (m) TransUNet (17.70%), (n) DC-Swin (27.54%), (o) UNetFormer (23.33%), (p) ST-UNet (35.31%), (q) EMRT (40.91%), (r) DTSU-Net (56.19%).
Segmentation maps of different methods on the Shanghai city. (a) Ground truth, (b) 1DCNN (19.22%), (c) 3DCNN (20.39%), (d) A2S2KResNet (25.48%), (e) SS3FCN (29.58%), (f) FPGA (33.88%), (g) FCN (26.26%), (h) U-Net (25.17%), (i) DeepLabV3+(2.41%), (j) DANet (28.18%), (k) LANet (31.05%), (l) SegFormer (31.05%), (m) TransUNet (11.70%), (n) DC-Swin (22.31%), (o) UNetFormer (21.69%), (p) ST-UNet (21.38%), (q) EMRT (21.48%), and (r) DTSU-Net (37.89%).
Segmentation maps of different methods on the Guangzhou city. (a) Ground truth, (b) 1DCNN (16.09%), (c) 3DCNN (19.63%), (d) A2S2KResNet (26.36%), (e) SS3FCN (30.24%), (f) FPGA (42.43%), (g) FCN (26.22%), (h) U-Net (28.36%), (i) DeepLabV3+(4.48%), (j) DANet (35.17%), (k) LANet (32.64%), (l) SegFormer (34.79%), (m) TransUNet (14.66%), (n) DC-Swin (19.65%), (o) UNetFormer (23.57%), (p) ST-UNet (25.42%), (q) EMRT (31.97%), and (r) DTSU-Net (52.90%).
Segmentation maps of different methods on the Karamay city. (a) Ground Truth, (b) 1DCNN (20.29%), (c) 3DCNN (33.75%), (d) A2S2KResNet (31.82%), (e) SS3FCN (58.50%), (f) FPGA (54.64%), (g) FCN (43.00%), (h) U-Net (26.72%), (i) DeepLabV3+(1.93%), (j) DANet (46.77%), (k) LANet (46.43%), (l) SegFormer (50.19%), (m) TransUNet (20.55%), (n) DC-Swin (39.43%), (o) UNetFormer (36.15%), (p) ST-UNet (30.87%), (q) EMRT (52.67%), and (r) DTSU-Net (63.54%).
1) Comparison on the Changchun City
The segmentation results of all methods on the Changchun City dataset are summarized in the Table II. The best results are highlighted in bold, and the second-best results are underlined. DTSU-Net achieves the best performance in mIoU and OA, reaching 56.19
2) Comparison on the Shanghai City
The segmentation results of all methods on the Shanghai dataset are presented in Table III, where the best results are highlighted in bold. DTSU-Net achieves the best performance in mIoU and OA, reaching 37.89
3) Comparison on the Guangzhou City
The segmentation results of all methods on the Guangzhou dataset are summarized in the Table IV, with the best results highlighted in bold and the second-best results underlined. DTSU-Net achieves the best performance in mIoU and OA, reaching 52.90
4) Comparison on the Karamay City
The segmentation results of all methods on the Karamay dataset are presented in the Table V, where the best results are highlighted in bold and the second-best results are underlined. DTSU-Net achieves the best performance in mIoU and OA, reaching 63.54
Discussion
A. Ablation Study
1) Effect of Spectral Attention
Spectral attention aims to assign weights to feature dimensions that contain more important and informative information, mitigating within-class spectral variations. This is particularly crucial for processing hyperspectral images. In this module, we aggregate information for each dimension through global average pooling and global max pooling. Consequently, we will assess their capabilities for aggregating Spectral information and determine whether this module effectively concentrates attention on channels with richer information. In Tables VI and VIII, we validated the model's performance during the ablation of multihead deformable self-attention (MHDSA) and MHDCA by comparing the method with and without the inclusion of spectral attention. The comparison between the two tables reveals that the model consistently performs better when spectral attention is introduced, demonstrating that spectral attention effectively enhances the model's accuracy.
2) Effect of MHDSA and MHDCA
To validate the effectiveness of the modules in our method, we conducted ablation experiments by adding one module at a time, as shown in Table VI. The mIoU on the baseline for all four datasets is not high, indicating that CNN has limitations in handling large-scale hyperspectral images. Transformers, with their ability to process sequential data, can capture relationships between pixels at a longer distance compared to CNNs, which is crucial for extracting global information. Upon adding the MHDSA module alone, the model can aggregate multiscale features from different stages, leading to a significant improvement in accuracy. However, adding MHDCA alone does not significantly improve performance and even leads to a decline in results on the Changchun dataset. This is because the self-attention module primarily captures global features and contextual information, while the cross-attention module focuses on fusing information across multiple feature maps. Without first obtaining global features through self-attention, the cross-attention module may struggle to effectively utilize the available information, resulting in only minimal improvement in performance. When both MHDSA and MHDCA are added, MHDCA helps integrate multiscale information into the deepest feature maps, enhancing feature representation. At this point, our model achieves optimal performance, demonstrating the synergistic effect of combining multiscale attention mechanisms.
3) Effectiveness of Deformable Transformer
The MHSA module is a core component of the Transformer, effectively extracting global features. However, the standard MHSA is computationally intensive and may not fully utilize local information compared to deformable self-attention.We conducted experiments comparing these two attention methods in both self-attention and cross-attention modules to validate their performance and accuracy. The self-attention module consists of self-attention and MLP components, computing the query and value vectors of the aggregated multiscale features. Similarly, the cross-attention module consists of cross-attention and MLP components, where the deformable cross-attention calculates the query and value vectors between the two features. Specifically, we apply either a standard Transformer or a deformable Transformer in both the self-attention and cross-attention modules, resulting in four configurations: using a Transformer in both modules, using a deformable Transformer in both modules, using a deformable Transformer only in the self-attention module, and using a deformable Transformer only in the cross-attention module.
To highlight the advantages of deformable self-attention, we evaluated them in terms of both accuracy and speed, as shown in the Table VII. The deformable Transformer outperforms the standard Transformer in both accuracy and speed. This indicates that the Transformer model, when performing global operations, not only has high complexity but also introduces unnecessary information. Since deformable self-attention handles multiscale features concatenated together, it reduces computation time more significantly than cross-attention. We observed that using only deformable cross-attention did not achieve optimal results. This is because standard self-attention mechanisms are better at capturing global context, while deformable cross-attention focuses more on aggregating local information. When applied to globally focused features derived from self-attention, the deformable cross-attention struggles to effectively leverage its ability to handle local variations, leading to a loss of detail and, consequently, poorer results. The deformable Transformer, with its adaptability and efficiency, proves to be more suitable for capturing global features.
B. Comparison on All Cities
In this section, we validate our approach across all cities, with the training, validation, and testing sets consisting of 4821, 515, and 2459 subimages, respectively. With a significant increase in data volume, the complexity of land cover types will be higher. In addition, the characteristics of land cover vary across different regions, leading to significant performance differences for some methods between individual cities and all cities combined. Nevertheless, even with a large dataset, our method still achieves the best results. As shown in Table IX, the DTSU-Net achieves an mIoU of 57.22% and an OA of 79.28%. Specifically, the mIoU outperforms the highest baseline method by 1.65% and the lowest baseline method by 17.84%. Compared to results for a single city, methods based on transformers perform better than those based on CNNs, indicating that transformers require larger datasets to train effectively.
C. Complexity Analysis
Table IX presents the parameter count and floating point operations (FLOPs) for different models. The size of the parameter count reflects the complexity of the model, while FLOPs is commonly used to measure the computational requirements and speed of the model. Hyperspectral methods, compared to semantic segmentation methods, generally have fewer parameters but higher computational demands. Consequently, models in hyperspectral segmentation require more computing resources, leading to longer training times. On the other hand, semantic segmentation methods tend to have larger parameter counts and FLOPs.
DTSU-Net combines convolutional layers with deformable Transformers. Deformable Transformers, in comparison to traditional Transformers, have lower computational and parameter requirements. In deformable transformers, to simultaneously generate sampling offsets
Conclusion
This article proposes a semantic segmentation method for large-scale hyperspectral remote sensing images based on both spectral attention and deformable Transformer networks. The approach utilizes convolutional downsampling and GSA to focus on valuable spectral dimensions. It also employs MHDSA to extract multiscale global information and MHDCA to enhance semantic information in the high-level features. This allows for the segmentation of large-scale hyperspectral remote sensing images with reduced computational complexity. Experimental results on the public large-scale hyperspectral dataset WHU-OHS demonstrate that our method outperforms common hyperspectral methods and semantic segmentation methods based on CNNs and Transformers.
To fully exploit the rich information in large-scale hyperspectral remote sensing images, future research should focus on exploring transfer learning approaches that can effectively leverage prior knowledge for large-scale cross-scene and cross-sensor transferability.
ACKNOWLEDGMENT
The authors would like to thank Prof. J. Li for making the large-scale hyperspectral dataset WHU-OHS available to the community.
The authors would also like to thank Dr. W. Hu for sharing the code of 1DCNN, Dr. Y. Chen for 3DCNN, Dr. S. K. Roy for A2S2KResNet, Dr. Z. Zheng for FreeNet, Dr. J. Long for FCN, Dr. O. Ronneberger for U-Net, Dr. J. Fu for DANet, Dr. L. Ding for LANet, Dr. L.-C. Chen for DeepLabV3+, Dr. E. Xie for SegFormer, Dr. R. Li for MAResUNet, Dr. J. Chen for TransUNet, Dr. L. Wang for DC-Swin and UNetFormer, and Dr. X. He for ST-UNet, and Dr. T. Xiao for EMRT, respectively.