Loading [MathJax]/extensions/MathZoom.js
Deformable Transformer and Spectral U-Net for Large-Scale Hyperspectral Image Semantic Segmentation | IEEE Journals & Magazine | IEEE Xplore

Deformable Transformer and Spectral U-Net for Large-Scale Hyperspectral Image Semantic Segmentation


Abstract:

Remote sensing semantic segmentation tasks aim to automatically extract land cover types by accurately classifying each pixel. However, large-scale hyperspectral remote s...Show More
Topic: Large-Scale Pretraining for Interpretation Promotion in Remote Sensing Domain

Abstract:

Remote sensing semantic segmentation tasks aim to automatically extract land cover types by accurately classifying each pixel. However, large-scale hyperspectral remote sensing images possess rich spectral information, complex and diverse spatial distributions, significant scale variations, and a wide variety of land cover types with detailed features, which pose significant challenges for segmentation tasks. To overcome these challenges, this study introduces a U-shaped semantic segmentation network that combines global spectral attention and deformable Transformer for segmenting large-scale hyperspectral remote sensing images. First, convolution and global spectral attention are utilized to emphasize features with the richest spectral information, effectively extracting spectral characteristics. Second, deformable self-attention is employed to capture global-local information, addressing the complex scale and distribution of objects. Finally, deformable cross-attention is used to aggregate deep and shallow features, enabling comprehensive semantic information mining. Experiments conducted on a large-scale hyperspectral remote sensing dataset (WHU-OHS) demonstrate that: first, in different cities including Changchun, Shanghai, Guangzhou, and Karamay, DTSU-Net achieved the highest performance in terms of mIoU compared to the baseline methods, reaching 56.19%, 37.89%, 52.90%, and 63.54%, with an average improvement of 7.57% to 34.13%, respectively; second, module ablation experiments confirm the effectiveness of our proposed modules, and deformable Transformer significantly reduces training costs compared to conventional Transformers; third, our approach achieves the highest mIoU of 57.22% across the entire dataset, with a balanced trade-off between accuracy and parameter efficiency, demonstrating an improvement of 1.65% to 56.58% compared to the baseline methods.
Topic: Large-Scale Pretraining for Interpretation Promotion in Remote Sensing Domain
Page(s): 20227 - 20244
Date of Publication: 23 October 2024

ISSN Information:

Funding Agency:

Figures are not available for this document.

SECTION I.

Introduction

Remote sensing semantic segmentation enables precise identification and monitoring of land surface cover types through pixel-level classification of remote sensing images, thereby providing information support for research in remote sensing-related fields [1]. With the rapid development of remote sensing image acquisition technologies, such as satellite remote sensing and aerial remote sensing [2], remote sensing semantic segmentation plays a crucial role in practical applications, such as land cover mapping [3], urban change detection [4], [5], environmental protection [6], and precision agriculture [7].

Traditional remote sensing image segmentation methods mainly rely on pixel-level features for supervised classification, including methods, such as support vector machine [8] and random forests [9]. However, these traditional methods are limited in their feature extraction capabilities, especially when modeling complex land cover and scenes in remote sensing images, and they often underutilize spatial information [10]. With the rise of deep learning, the introduction of convolutional neural networks (CNNs) and Transformers has overcome the limitations of traditional methods in feature representation [11].

However, deep learning-based semantic segmentation of remote sensing images also faces the following challenges: First, the distribution of land cover in remote sensing images is more complex, with variations in the appearance of the same land cover type in different scenes [12], leading to high interclass and intraclass variability, which makes accurate segmentation difficult [13], [14]. Second, there is a greater diversity of segmentation objects, and remote sensing images contain richer information and details, requiring the extraction of more diverse details and semantic features for precise segmentation [15]. Lastly, hyperspectral remote sensing images contain rich spectral information, which is challenging for common semantic segmentation methods to leverage. For example, backbone networks that utilize depthwise separable convolutions [16], [17] to reduce parameter and computational complexity lack the ability to model spectral dimensions effectively.

In the field of CNNs, addressing the issue of complex and inconsistent scale distribution of objects often involves increasing the receptive field or using multiscale modules.

  1. Large receptive field allows the network to better capture broader spatial information from the input imagery, thereby enhancing its adaptability to changes in object scales. Zhang et al. [18] introduced dilated convolutions with different strides in the backbone network to enlarge the receptive field, mitigating intra-class heterogeneity. Chen et al. [19] used adaptive effective receptive field convolution to control the sampling positions of convolutions, automatically adjusting the receptive field to alleviate the problem of varying object scales in high-resolution remote sensing imagery. Given the significant differences in road lengths and shapes in remote sensing imagery, Yang et al. [20] enhanced the U-Net architecture with dual decoding structures and introduced dilated convolutional attention mechanisms to accurately capture roads in the images.

  2. Multiscale modules extract and integrate features at different scales to better capture the complexity and inconsistency of object scales. Different categories of objects have appropriate adjustment scales. Based on this behavior, Cai et al. [21] proposed a stacked semantic segmentation framework, which uses a learnable error correction module for segmentation result fusion to improve the results. Liu et al. [22] focused on scale perception and fusion, annotating categories with large intraclass differences and different scales, and training their scale feature attention module accordingly. Bai et al. [23] developed a multiscale attention module based on fine-grained multiscale features and channel dependencies to extract features at multiple scales. Hang et al. [24] introduced a multiscale progressive segmentation network, cascading three subnetworks and designing a scale-guided module to utilize the segmentation results of the previous subnetwork to guide the feature learning of the next subnetwork.

However, despite the significant success achieved by CNNs, they still have certain limitations. CNNs are proficient at extracting local features, yet their convolutional layers' local nature hinders the network's ability to capture global context. To overcome this limitation, various methods [25], [26] have been proposed, but each of these methods has its own set of limitations. For instance, dilated convolutions may struggle to extract small objects [27], stacking multiple convolutional layers can lead to a substantial increase in computational complexity [28], and feature pyramids may result in information loss.

Given the complex distribution of objects in remote sensing images, leveraging global information is crucial. Transformer, known for its ability to model global information effectively, has shown significant potential in this regard [29]. By stacking multiple self-attention modules, Transformer constructs long-range dependencies among objects, thus addressing the limitations of convolutional approaches [30]. General semantic segmentation networks are divided into methods based on pure transformers or hybrid networks.

  1. Pure Transformer: Both the encoder and decoder are Transformer architectures. Unlike the conventional use of convolution or pooling, Xie et al. [31] achieved feature downsampling and upsampling by merging and expanding patches, capturing long-range contextual relationships between each patch. Swin Transformer, as the backbone network, can vary the receptive field by using local shifting windows, thus extracting multiscale features [32]. Liu et al. [33] utilized a pretrained transformer encoder to extract the semantic features of the input image and designed a multiscale global-local transformer decoder for encoding and decoding, ensuring consistent feature representations. With the Transformer as the encoder, it maximizes the utilization of global information during feature extraction and decodes the encoded semantic information, resulting in higher computational efficiency and stronger generalization capabilities.

  2. CNNs and Transformer: The encoder or decoder consists of a Transformer, and the other is CNNs. UNetFormer [34] adopts ResNet18 as the backbone network and incorporates a global-local attention mechanism to construct Transformer blocks in the decoder, allowing attention blocks to capture both global and local contexts. Li et al. [35] utilized a linear attention mechanism at the skip connections of the U-shaped network to reduce the memory and computational costs of dot-product attention. Zhang et al. [36] adopt Swin Transformer as the backbone network and simultaneously utilize a spatial pyramid pooling block based on depthwise separable convolutions to capture multiscale contexts. The advantage of using residual networks is their sensitivity to local information, while Transformer can focus on different parts of the input sequence, thereby enhancing the model's performance in complex contexts.

In summary, while Transformer excels at capturing global information, its utilization of local information is often inadequate. For remote sensing imagery, local information is equally essential, as neighboring objects often exhibit strong correlations [37]. Current research often focuses on methods combining convolution and attention [38]. However, complex networks inevitably increase computational requirements. In addition, simple feature aggregation methods, such as feature addition or concatenation, may lead to information overlap and redundancy, making it challenging for the network to learn effective feature representations [39].

Finally, to address the issue of common semantic segmentation methods failing to fully leverage spectral information, recent semantic segmentation approaches integrate spectral features when applied to hyperspectral classification. The extraction of spectral information primarily involves two methods: spectral extraction modules and spectral attention modules.

  1. Spectral extraction modules are typically designed to extract spectral information by incorporating 3-D convolutions or introducing additional spectral branches. FCN, combined with 3-D convolutions to form SS3FCN [40], is utilized for extracting both spectral spatial and semantic information. Huang et al. [41] proposed 3-D-Swin Transformer based on the Swin Transformer to better fit hyperspectral data. In addition, a combination of convolutional layers and mixed convolution-Transformer modules has been employed to extract and fuse local and global features [42], [43]. Xie et al. [44] utilized Swin Transformer as the backbone network and introduced a hyperpatch embedding module to extract spectral and local spatial information from hyperspectral images. Feng et al. [45], [46] proposed a multicomplementary GAN with contrastive learning (CMC-GAN) and further introduced a class-aligned and class-balanced generative domain adaptation (CCGDA) method, both of which effectively capture rich spatial–spectral features.

  2. Spectral attention modules can assist the model in better extracting spectral information. FPGA [47] introduced spectral attention in the encoder, proposing an end-to-end trained patch-free global learning FCN framework. Xin et al. [48] proposed a triple-attention network based on hyperspectral images, namely, the spectral-space-scale attention network, which adaptively weights each channel, each pixel, and each scale perception of the feature map. Gui et al. [49] proposed an infrared attention network, which incorporates an additional infrared spectrum encoder and multiple attention blocks. This design focuses on vegetation features sensitive to infrared, enabling precise extraction of forested areas in multispectral images. The multiscale receptive fields graph attention neural network [50], introduces a spectral transformation mechanism and a multiscale receptive field map attention network, addressing the limitations of GNNs in hyperspectral classification.

However, the aforementioned methods are designed for hyperspectral images with fewer samples and are typically trained by dividing patches. This significantly reduces the model's receptive field, making it unable to fully utilize the abundant spatial information in remote sensing images. Moreover, training on large-scale hyperspectral images increases computational complexity.

To sum up, existing methods still encounter the following issues.

  1. The local nature of convolutional layers limits the network's ability to capture global context, while transformers overlook local information, both of which are necessary for remote sensing image segmentation. Combining convolution and transformers can lead to a sharp increase in computational costs, thereby increasing training expenses.

  2. Remote sensing imagery contains rich semantic features. However, existing methods often aggregate information between different levels primarily through skip connections, without considering the correlation between different features. This oversight can lead to information overlap and redundancy, making it difficult for the network to learn effective feature representations.

  3. Networks in the field of semantic segmentation often underutilize spectral information. Moreover, methods designed for hyperspectral remote sensing images mostly rely on patch-based approaches, limiting the extraction of spatial information from large-scale hyperspectral remote sensing images due to the constrained receptive fields.

In this article, we propose a novel U-shaped semantic segmentation network that combines spectral attention with deformable Transformers for large-scale hyperspectral remote sensing datasets. Inspired by Zhu et al. [51], we introduce deformable Transformers along with self-attention and cross-attention-based feature interaction methods. First, in the process of feature extraction, we integrate global spectral attention (GSA) to weight more valuable feature layers, thereby extracting rich spectral information. Second, to overcome the issue of convolutional networks struggling to extract global features, this article adopts a multilevel feature interaction approach based on deformable Transformer. By introducing deformable self-attention mechanisms, the model can more effectively capture features at different scales, thereby improving its performance in complex objects within large-scale images. Lastly, utilizing deformable cross-attention facilitates deep and shallow feature interaction, further enhancing the model's ability to dynamically perceive semantic features. By synthesizing the challenges faced and the proposed solutions, we develop a novel, lightweight, and effective semantic segmentation method for large-scale hyperspectral remote sensing imagery.

The main contributions of this article are summarized as follows.

  1. We propose a U-shaped semantic segmentation network, DTSU-Net, for large-scale hyperspectral image semantic segmentation. By performing hierarchical downsampling on the entire input image and applying GSA to weight the spectral features of the feature maps, the weighted spectral features emphasize the importance of different spectral information bands, effectively mitigating intraclass spectral variations.

  2. Employing deformable Transformer to facilitate multilevel and multiscale feature interaction, capturing targets at different scales effectively. Deformable Transformer offers low computational complexity and strong local dependency, thus efficiently extracting local information from remote sensing imagery.

  3. We utilize deformable cross-attention to enhance deep features. Deep features possess rich semantic information but lack spatial information from shallow features. By leveraging shallow information, deep features are empowered to fully explore target characteristics and mitigate information loss caused by downsampling.

SECTION II.

Related Work

A. Review of Transformer

Before introducing the deformable transformer, let us briefly review the Transformer. The main principle of the multihead self-attention (MHSA) mechanism is to compute relationships between each pixel and all other pixels. Unlike convolution, which focuses on a fixed receptive field, Transformer can model long-range dependencies at large scales, capturing global contextual information. In computer vision, Transformer divides an image into a series of image patches and computes similarity relationships between these patches. If the input image is denoted as X \in \mathbb {R}^{H \times W \times C}, after segmentation and positional embedding, it is transformed into a sequence format, denoted as x. The input sequence is then mapped to different query, key, and value spaces based on trainable parameter matrices W_{q}, W_{k}, and W_{v} \begin{equation*} q = xW_{q}, k = xW_{k}, v = xW_{v}. \tag{1} \end{equation*}

View SourceRight-click on figure for MathML and additional features.In the Transformer model, the query vector q \in \mathbb {R}^{d_{k}}, key vector k \in \mathbb {R}^{d_{k}}, and value vector v \in \mathbb {R}^{d_{k}}, where d_{k} = C \times M, and C represents the feature dimension. Specifically, the process involves computing the dot product of the query and key vectors, followed by scaling. Then, the softmax function is applied to normalize the attention scores, yielding weights for each position. These weights are then multiplied by the value vector v. Finally, a linear transformation W_{m^{\prime }} is applied to map the result back to the original input space, resulting in the final output. The mathematical expression for this process is as follows: \begin{equation*} \text{{MHSA}}(x) = \sum _{m=1}^{M} W_{m} \sum _{k=1}^{K} \sigma \left(\frac{{q^{T}_{k}}}{{\sqrt{d_{k}}}}\right) \cdot \text{W'}_{m} v. \tag{2} \end{equation*}
View SourceRight-click on figure for MathML and additional features.
Here, \sigma represents the softmax function, ensuring that the sum of the results inside the parentheses is equal to 1. m denotes the number of heads in the MHSA mechanism. W_{m} and W^{\prime }_{m} are trainable weights used to determine the aggregation relationship between different attention heads. Multihead attention allow the model to focus on content from different subspaces and positions.

B. Deformable Transformer

Although the standard Transformer excels in modeling long-range dependencies, it has significant drawbacks when handling large-resolution remote sensing images. Due to its equal treatment of relationships between every pair of sequences, the model incurs considerable computational and memory complexity when processing images with high resolutions. Moreover, it requires substantial training time to focus attention weights on meaningful positions. Consequently, using Transformer for segmentation tasks consumes substantial computational resources. Furthermore, while Transformer exhibits strong long-range relationship modeling capabilities, local information is equally crucial for remote sensing images, as neighboring pixels often exhibit strong correlations. Self-attention mechanisms are insensitive to local contextual information and may distribute attention weights indiscriminately. This phenomenon causes the network's scope of interest to extend beyond the boundaries of the target, leading to attention weights being assigned to irrelevant information.

To address the aforementioned issues, our previous work has made significant progress in the field of hyperspectral image classification. Xue et al. [52] proposed a local Transformer with a spatial partition recovery network, which restricts attention computation to overlapping subblocks to enhance computational efficiency. Li et al. [53] utilized self-pooling instead of traditional MHSA, significantly reducing the model's parameter count. However, these approaches are not suitable for large-scale natural remote sensing images. Zhu et al. [51] proposed a deformable Transformer block to capture multiscale objects in object detection tasks. Later, Xiao et al. [54] and Zuo et al. [55] introduced it into the semantic segmentation tasks of high-resolution remote sensing imagery.

This approach combines the advantages of deformable convolutions' sparse spatial sampling [56] with Transformer's long-range relationship modeling capabilities. It focuses on a small set of sampling positions and computes attention for the feature map pixels corresponding to these sampling points. If the input image is X \in \mathbb {R}^{H \times W \times C}, it undergoes segmentation and position embedding, transforming into sequence format x. The input sequence is linearly mapped to query vectors q and value vectors v, with k representing the sampling points. The formula is expressed as follows: \begin{equation*} \text{{MHDSA}}(x) {=} \sum _{m=1}^{M} W_{m} \sum _{k=1}^{K} \text{Attention}(q,k) \cdot \text{W'}_{m} v_{\phi (p_{q}+\Delta p_{qk})} \tag{3} \end{equation*}

View SourceRight-click on figure for MathML and additional features.where M represents the number of heads in the MHSA mechanism, and K denotes the number of sampling points (K \leq H \times W). \text{Attention}(q,k) represents the attention weight of the kth sampling point, which is also parameterized by the learned query vector. The weight \text{Attention}(q,k) lies within the range [0,1] and is normalized such that \sum _{k=1}^{K} \text{Attention}(q,k) = 1. P_{q} represents the 2-D reference point, which is derived from the mapped feature x as a set of linear 2-D coordinates \lbrace (0,0),\ldots, (H-1,W-1) \rbrace normalized to the range of [0,1]. The sampling offset \Delta p_{qk}, which is an unconstrained learnable parameter, is generated by a subnetwork that consumes the query features and outputs the corresponding offset for each reference point, i.e., \Delta p_{q} = \text{offset}(q). Since each reference point covers a local region, the generating network uses convolutional blocks to sense local features and learn reasonable offset values. Thus, the K sampling points will generate new sampling pixels through the corresponding sampling offsets \Delta p_{qk}. Considering that the sampling offsets may not be integers, bilinear interpolation \phi () is used to connect the K sampling pixels and generate the corresponding sampling feature map Z \in \mathbb {R}^{K \times C}, i.e., \begin{equation*} p_{q}^{\prime } = \phi (p_{q} + \Delta p_{qk}) = \phi (p_{q} (x+\Delta p_{x}, y+\Delta p_{y})). \tag{4} \end{equation*}
View SourceRight-click on figure for MathML and additional features.
The final result is a deformable attention feature map, incorporating both the learned sampling offsets and the interpolated features for more effective multiscale feature representation. Due to the fact that each query matrix in the deformable attention module computes attention over the sampled features, the computational complexity is reduced from O(N^{2}) to O(KN), where N is the number of pixels, effectively transitioning from quadratic to linear complexity. Consequently, deformable attention not only addresses the issue of insufficient local information in Transformer but also significantly reduces the computational cost of dense prediction maps at high resolutions.

SECTION III.

Proposed Methodology

In this section, we provide detailed overview of the DTSU-Net, a U-shaped semantic segmentation method that combines GSA and deformable Transformer for large-scale hyperspectral image semantic segmentation. The overall architecture of our proposed DTSU-Net is illustrated in Fig. 1. Initially, feature extraction is performed using convolutional blocks, followed by feature downsampling. Concurrently, during feature dimensionality expansion, GSA is applied to weight crucial feature layers. This step aims to extract both local spatial and spectral information from the hyperspectral images, resulting in three features spanning from shallow to deep layers. To further extract multiscale information from these features, they are uniformly projected into 128 dimensions and then fused to perform deformable self-attention, facilitating the learning of feature relationships across different scales. The interacted deep features are subsequently merged with shallow features using deformable cross-attention, resulting in enriched, semantically meaningful high-level features. Finally, the features are upsampled and successively fused to restore spatial details, producing prediction maps of the original image size. The pseudocode detailing the specific algorithm is presented in Algorithm 1.

Fig. 1. - Graphical illustration of the proposed deformable Transformer and spectral U-Net (DTSU-Net) for large-scale hyperspectral image semantic segmentation.
Fig. 1.

Graphical illustration of the proposed deformable Transformer and spectral U-Net (DTSU-Net) for large-scale hyperspectral image semantic segmentation.

SECTION Algorithm 1:

DTSU-Net for Large-Scale Hyperspectral Image Semantic Segmentation.

Algorithm

A. Global Spectral Attention

The encoder network comprises convolutional modules and a GSA section, aiming to extract rich spatial and spectral information from large-scale hyperspectral images. The convolutional module consists of 3 × 3 convolutions followed by group normalization and rectified linear unit (ReLU) activation. Given the computational demands of large-scale hyperspectral imagery, the network operates with a relatively small batch size. Group normalization is employed within each group to reduce reliance on batch size, enhancing network stability and suitability for training with small batches.

The GSA module incorporates global average pooling and global max pooling over the entire hyperspectral cube. Illustrated in Fig. 2, with an input of H × W × C hyperspectral cube, the GSA mechanism performs max pooling and average pooling along the dimension of H × W, enhances interchannel relationships while preserving individual channel information. Subsequently, two convolutional layers followed by ReLU activation are applied sequentially to map the features into a high-dimensional space, resulting in two global feature vectors of size 1 \times\, 1\,\times c. The 1 × 1 global adaptive average pooling and global adaptive max pooling are applied, where \begin{align*} \text{Avg}(Y_{(m,:,:)}) = &\frac{1}{H \times W} \sum _{i=1}^{H} \sum _{j=1}^{W} Y_{(m,i,j)} \tag{5} \\ \text{Max}(Y_{(m,:,:)}) = &\max \lbrace Y_{(m,i,j)}, \quad i \in (1,H), \quad j \in (1,W) \rbrace. \tag{6} \end{align*}

View SourceRight-click on figure for MathML and additional features.

Fig. 2. - Graphical illustration of the proposed GSA module.
Fig. 2.

Graphical illustration of the proposed GSA module.

The formula for this step is \begin{equation*} F^{\prime } = \sigma (\text{FC}(\text{Avg}(Y)) + \text{FC}(\text{Max}(Y))). \tag{7} \end{equation*}

View SourceRight-click on figure for MathML and additional features.Here, m represents a specific channel, Y(m,i,j) denotes the pixel value at position (i,j) in the mth channel, and FC applies 1 \times 1 convolution to adjust the channel dimension, followed by the ReLU activation function. F^{\prime } represents the proportional weight of the feature map for each spectral dimension, obtained through the sigmoid function. Multiplying by the original spectral features yields the features weighted by spectral attention, i.e., the output result of GSA F \in \mathbb {R}^{H \times W \times C} \begin{equation*} F = F \cdot F^{\prime }. \tag{8} \end{equation*}
View SourceRight-click on figure for MathML and additional features.
In the backbone of the network, spectral attention modules are first employed to weight the features, followed by convolutional modules to enrich the spatial–spectral information. Subsequently, instead of using max pooling, 3 × 3 convolutions with a stride of 2 are employed for downsampling to retain more information during the downsampling process. This process is repeated 5 times, resulting in a maximum downsampling of 32 times, thus obtaining more abstract semantic features while reducing computational complexity. In addition, the feature maps downscaled to 8 times, 16 times, and 32 times are further processed to extract features at different scales.

B. Multihead Deformable Self-Attention

The multiscale feature mechanism is widely applied in remote sensing semantic segmentation models and can be seamlessly integrated into our model. If the input feature maps consist of multiple levels of features X_{l=1}^{L} where X_{l} \in \mathbb {R}^{H_{l} \times W_{l} \times C}, for the deformable transformer proposed above, when incorporating L scale feature layers, the multiscale features can be flattened and concatenated together as follows: \begin{equation*} x_{f} = \text{concat}\lbrace \text{flatten}(X_{1}), \text{flatten}(X_{2}), \ldots, \text{flatten}(X_{l})\rbrace \tag{9} \end{equation*}

View SourceRight-click on figure for MathML and additional features.where x_{f} represents the multiscale feature map obtained by concatenating different-scale feature maps along the 2-D direction. The multiscale feature map replaces the single-scale feature map for deformable attention operations, following the same principles as a standard deformable transformer. However, sampling points will be sampled from the multiscale feature map, followed by a multilayer perceptron composed of linear, GELU, and layer normalization operations \begin{equation*} x_{\text{out}} = \text{MLP}(\text{{MHDSA}}(x_{f})) + x_{f}. \tag{10} \end{equation*}
View SourceRight-click on figure for MathML and additional features.
Therefore, for the multiscale features extracted by the backbone network, the multiscale deformable transformer will interact with the information from three deep-level features x_{l=3}^{L}. Specifically, we select three feature maps downsampled by factors of 8 times, 16 times, and 32 times, respectively. They not only have different resolutions but also different levels of abstraction of geographic information. For these three feature maps, we unify the channel number to 128 dimensions and flatten them to form 2-D long vectors as input features. Since position information is lost after flattening, we add position embeddings and scale embeddings to represent the position and scale level of each pixel in the feature representation. In addition, we keep track of the scale size corresponding to each feature. The input features are linearly projected to obtain queries Q and keys K, and the feature grid is rasterized to facilitate obtaining reference points within each scale. As objects are composed of multiple feature maps of different scales, the obtained reference points are concatenated together to form a long vector. Subsequently, based on the queries Q, we train sampling offsets and attention weights. By adding sampling offsets across the entire scale, we obtain the sampling point positions and their corresponding attention weights. Finally, the values V of the three features are sequentially bilinearly interpolated onto the sampling grid, and the attention of the reference point to them is calculated and outputted. The sampling grid is linearly transformed by the sampling points. For each sampling position coordinates (x, y), the original coordinates (x, y) are mapped to a new coordinate range [-1, 1] by grid=2p_{q}^{\prime}-1 for normalization.

Since the output feature size remains the same as the input and can be reshaped back into the shapes of the original three input features, we stack this module three times and correspondingly pair it with the subsequent deformable cross-attention modules. The schematic diagram of this part is shown in Fig. 3, where the deformable transformer provides information about the target in different features, thereby improving the model's performance under various complex conditions. Since each pixel in the multiscale feature map serves as an object query, so smaller deep-level features save significant computational and memory costs compared to shallow features.

Fig. 3. - Graphical illustration of the the proposed MHDSA and the MHDCA.
Fig. 3.

Graphical illustration of the the proposed MHDSA and the MHDCA.

C. Multihead Deformable Cross-Attention (MHDCA)

While the backbone network can model multiscale feature maps separately, uniform attention overlooks the importance of features with different downsampling levels, especially the deepest features that contain rich abstract semantic information. On the contrary, shallow features typically contain more spatial information and details, necessitating multilevel feature fusion to further enrich the features. Leveraging deformable cross-attention with the deformable transformer mentioned above facilitates feature interaction and strengthens the deep features of the network. Specifically, we use the attention-weighted bottom-level feature c5^{\prime } as the query vector to extract features from the similarly weighted shallow features c3^{\prime } and c4^{\prime }. If x denotes the deep features used as query vectors and x_{i} represents the shallow features, the process of obtaining qkv is as follows: \begin{equation*} q=xW_{q}, \bar{k}= \bar{x}_{i} W_{k}, \bar{v}= \bar{x}_{i} W_{v}. \tag{11} \end{equation*}

View SourceRight-click on figure for MathML and additional features.Here, \bar{k} and \bar{v} represent the k-embedding and value embedding of the modified shallow features, obtained from the shallow features \bar{x}_{i} after sampling interpolation.

The process, as shown in Fig. 3, begins with self-attention on the query vectors embedded with positional information, fully exploiting the internal information of the target sequence. The information from self-attention is then fused with the original sequence information using residual connections, retaining some of the original information. Subsequently, the obtained features are separately computed with cross-attention with the projected shallow features c3^{\prime } and c4^{\prime }, integrating shallow information. The calculation of the sampling offset \Delta _{lqk} = f(q)_{\text{offset}} based on q adds reference points, followed by bilinear interpolation to obtain the sampling point p_{q^{\prime }}. The attention weights, obtained through training, are applied here. The principle of deformable cross-attention is the same as deformable self-attention, except that the target changes from a concatenated 2-D long vector to two shallow feature vectors. The formula is as follows: \begin{align*} \text{{MHDCA}}(x_{1}, \ldots, x_{l}) =& \sum _{m{=}1}^{M} W_{m} \sum _{l{=}1}^{L} \sum _{k{=}1}^{K} A_{lqk}\\ &\cdot \text{W'}_{m} v_{\phi (p_{q}{+}\Delta p_{lqk}).} \tag{12} \end{align*}

View SourceRight-click on figure for MathML and additional features.The attention matrix A_{lqk} is obtained through training and is the projection of c5^{\prime }. The value v_{\phi (p_{q} + \Delta p_{l} qk)} is derived from two other weighted shallow features, interpolated to match the value size using bilinear interpolation at sampling points, and then multiplied by the attention weights to calculate the weighted features. This process is repeated twice, with skip connections corresponding to the shallow features. The output weighted features undergo Layer normalization and multilayer perceptron, resulting in the same size as the input. Due to their sequential information interaction, the deep features continuously learn information from the shallow features, achieving the goal of enriching features at multiple levels.

Next, for the obtained features, we perform upsampling using 2× nearest-neighbor interpolation after a 3 × 3 convolution. Simultaneously, at the same resolution, we use skip connections to continuously integrate shallow features to restore details. This process continues until upsampling to the size of the original input image. Finally, we use a 1 × 1 convolution with the channel set to the preset number of classes to obtain the prediction results.

D. Cross-Entropy Loss and Segmentation

The final output of a semantic segmentation network typically consists of a 3-D feature tensor (C, H, W), representing the predetermined number of classes and the size of the original input image. The values in each of the C channels at each pixel position correspond to the likelihood of belonging to a particular class. During the training process, the model optimizes the feature maps using a loss function to ensure that the probability distribution of class labels at each pixel position closely resembles the ground truth labels. The cross-entropy loss function used in this study is formulated as follows: \begin{equation*} L(y, \hat{y})_{\text{ce}} = -\frac{1}{N} \sum _{i=1}^{N} \sum _{j=1}^{C} y_{i}^{(j)} \cdot \log (\hat{y}_{i}^{(j)}) \tag{13} \end{equation*}

View SourceRight-click on figure for MathML and additional features.where N denotes the number of pixels, while C represents the number of classes. y_{i}^{(j)} and \hat{y}_{i}^{(j)} are, respectively, used to denote the ground truth label and the predicted probability of the ith pixel belonging to the jth class. During training, the model computes and try to minimize the disparity between them, thereby aiming for a more accurate prediction of the class for each pixel in the image.

SECTION IV.

Experimental Results

A. Dataset

This study utilizes the OHS dataset [57], characterized by a spatial resolution of 10 m and encompassing 32 spectral channels spanning from visible light to near-infrared, with an average spectral resolution of 15 nm. Comprising 42 OHS satellite images sourced from diverse cities across China, the dataset has been meticulously curated into 7795 subimages of dimensions 512 × 512 pixels. Among these, 4821 serve as training images, 515 as validation, and 2459 as testing. The dataset offers a wealth of information, featuring an average of thirteen land cover classes per city. There are a total of 24 categories, with their names and corresponding numbers as shown in Table I. The distribution of land cover exhibits substantial variation between cities. Notably, cities in the southeast of China, such as Shanghai and Guangzhou showcase developed urban areas and intricate water networks, predominantly adorned with rice fields. In contrast, northern cities like Changchun boast dense forests and predominantly arid farmlands, while cities in the northwest, exemplified by Xinjiang, exhibit sparse vegetation and distinct geographical features, such as deserts. The three-band false-color composite images and ground truth maps corresponding to WHU-OHS datasets are shown in Fig. 4, respectively.

TABLE I Class Names of Datasets
Table I- Class Names of Datasets
Fig. 4. - False-color composite image for four cities (R: 670 nm, G: 566 nm, B: 480 nm). (a) Changchun, (b) Shanghai, (c) Guangzhou, and (d) Karamay.
Fig. 4.

False-color composite image for four cities (R: 670 nm, G: 566 nm, B: 480 nm). (a) Changchun, (b) Shanghai, (c) Guangzhou, and (d) Karamay.

B. Implementation Details

All experiments were conducted on an Intel(R) Xeon(R) processor and a GTX 3090 24GB GPU, utilizing the PyTorch 1.9.1 framework with cuDNN 11.1.1. For training, the adaptive moment estimation (Adam) optimizer was chosen, with an initial learning rate of 0.0001 and weight decay of 0.0001. The training process involved 100 epochs on the dataset, with a batch size of 6. No data augmentation techniques were applied, and the network was trained from scratch on the dataset.

C. Evaluation Metrics

To quantitatively validate the effectiveness and robustness of the DTSU-Net, we employed the following evaluation metrics: intersection over union (IoU) for each class, mean intersection over union (mIoU) and overall accuracy (OA) are used to evaluate the classification accuracy of the model on test set \begin{align*} \text{IoU}_{i} =& \frac{\text{TP}_{i}}{\text{TP}_{i}+\text{FP}_{i}+\text{FN}_{i}} \tag{14}\\ \text{mIoU} = &\frac{1}{N} \sum _{i=1}^{N} \text{IoU}_{i} \tag{15}\\ \text{OA} = &\frac{\sum _{k=1}^{N} \text{TP}_{k}}{\sum _{k=1}^{N} (\text{TP}_{k} + \text{FP}_{k} + \text{TN}_{k} + \text{FN}_{k})}. \tag{16} \end{align*}

View SourceRight-click on figure for MathML and additional features.Besides, training time (s), the floating (FLOPs) point operation count, and the amount of model parameter (M) are used to evaluate the model complexity.

D. Parameter Sensitive Analysis

1) Learning Rate

To determine the optimal learning rate for our model, we systematically set the learning rates to [1e-2, 5e-3, 1e-3, 5e-4, 1e-4, 5e-5, 1e-5] and conducted experiments on datasets from four different cities. We recorded the mIoU values corresponding to each learning rate. As depicted in Fig. 5(a), we documented the segmentation results for each learning rate and utilized a line graph for intuitive analysis. Precision gradually increases when the learning rate is between 1e-2 and 1e-4. The optimal segmentation performance is observed when the learning rate is set to 1e-4.

2) Sampling Points

This hyperparameter is a crucial component of the deformable Transformer, indicating the number of sampled pixels on each feature layer, thereby affecting the model's receptive field. Consequently, we conduct a sensitivity analysis on the parameter of sampling points to verify its impact on the model's performance. We set the parameter K to [2, 4, 6, 8, 10], where a higher number of sampling points corresponds to a larger receptive field but also an increased computational load. As is shown in Fig. 5(b), the number of sampling points is set to 8.

3) Number of Head in Deformable Transformer

The parameter in question refers to the number of heads in the multihead attention mechanism we employ. It affects both the model's parameter count and the feature dimensions of intermediate layers. We set this parameter to [1, 2, 4, 8, 16] to evaluate its impact on the model. Increasing this parameter significantly increases computational load. As is shown in Fig. 5(c), When this hyperparameter is set to 8, the model exhibits superior performance across the three metrics on the dataset compared to other configurations.

4) Number of Projection Layers

To ensure uniform feature layer counts across various convolutional blocks for ease of subsequent feature interaction, we project the features from different channels to appropriate dimensions. In this experiment, we set a consistent projection layer dimensionality of [64, 96, 128, 192, 256] and analyze the impact of different dimensions on the results. As is shown in Fig. 5(d), The optimal performance is achieved when the projection layer dimension is set to 128.

E. Performance Comparison

In this section, we will validate the effectiveness of our method in large-scale hyperspectral image segmentation. To ensure the results are more universally applicable and compelling, we will conduct comparisons across four cities nationwide in China. These cities include Changchun City (northern region) from Jilin Province, Shanghai City (eastern region), Guangzhou City (southern region) from Guangdong Province, and Karamay City (western region) from Xinjiang. Simultaneously, we will compare DTSU-Net with common hyperspectral methods and state-of-the-art semantic segmentation methods, demonstrating the superiority of our approach. The methods for comparison include hyperspectral techniques such as 1DCNN, 3DCNN, A2S2KResNet, SS3FCN, FPGA, CNN-based methods, such as FCN, U-Net, DeepLabV3+, DANet, LANet, and Transformer-based methods, such as SegFormer, UNetFormer, TransUNet, DC-Swin, ST-UNet, and EMRT. Specifically, when comparing hyperspectral methods, we did not adopt the traditional pixel-wise classification approach. Instead, we modified the final output layer by upsampling the high-level features to match the input size for semantic segmentation. This approach not only fully leverages the spectral extraction capabilities but also maintains consistency in the model architecture. Moreover, it enables fast segmentation, making it more suitable for large-scale hyperspectral imagery. DTSU-Net does not belong to the aforementioned three categories. First, it differs from pixel-wise hyperspectral classification methods. Second, instead of using a common residual network as the encoder backbone, our method employs spectral attention and convolution blocks for feature extraction. Finally, our Transformer blocks are not used for feature extraction or decoding but are dedicated to multiscale feature interaction. As a result, our method offers advantages over typical semantic segmentation approaches, with lower computational memory requirements and higher segmentation accuracy.

To provide a more intuitive reflection of the segmentation performance of different models, Figs. 6–​9, respectively, display partial ground truth and segmentation result images for the four datasets. Due to the extensive coverage and large volume of the dataset, the distribution of land cover types is more complex, and there is a significant imbalance in the number of samples between certain classes. This imbalance leads to the inability of some methods to segment more challenging classes, resulting in a large number of zero IoU values in the table. In addition, the backbone of DeepLabV3+ is Xception, which uses depthwise separable convolutions that decompose convolution into depthwise and pointwise convolutions. Although this reduces computational complexity, it fails to fully utilize the rich spectral information. As a result, DeepLabV3+ performs the worst, and its performance in Figs. 6(i)–​9(i) is also unsatisfactory. In contrast, our method performs better on these more difficult-to-segment categories, achieving the highest mIoU.

Fig. 5. - Evolution of mIoU as a function of (a) learning rate, (b) sampling points, (c) number of heads, and (d) number of projection layers for four datasets.
Fig. 5.

Evolution of mIoU as a function of (a) learning rate, (b) sampling points, (c) number of heads, and (d) number of projection layers for four datasets.

Fig. 6. - Segmentation maps of different methods on the Changchun city. (a) Ground Truth, (b) 1DCNN (18.39%), (c) 3DCNN (24.26%), (d) A2S2KResNet (34.66%), (e) SS3FCN (40.91%), (f) FPGA (49.30%), (g) FCN (29.27%), (h) U-Net (28.44%), (i) DeepLabV3+ (2.31%), (j) DANet (42.25%), (k) LANet (41.99%), (l) SegFormer (41.79%), (m) TransUNet (17.70%), (n) DC-Swin (27.54%), (o) UNetFormer (23.33%), (p) ST-UNet (35.31%), (q) EMRT (40.91%), (r) DTSU-Net (56.19%).
Fig. 6.

Segmentation maps of different methods on the Changchun city. (a) Ground Truth, (b) 1DCNN (18.39%), (c) 3DCNN (24.26%), (d) A2S2KResNet (34.66%), (e) SS3FCN (40.91%), (f) FPGA (49.30%), (g) FCN (29.27%), (h) U-Net (28.44%), (i) DeepLabV3+ (2.31%), (j) DANet (42.25%), (k) LANet (41.99%), (l) SegFormer (41.79%), (m) TransUNet (17.70%), (n) DC-Swin (27.54%), (o) UNetFormer (23.33%), (p) ST-UNet (35.31%), (q) EMRT (40.91%), (r) DTSU-Net (56.19%).

Fig. 7. - Segmentation maps of different methods on the Shanghai city. (a) Ground truth, (b) 1DCNN (19.22%), (c) 3DCNN (20.39%), (d) A2S2KResNet (25.48%), (e) SS3FCN (29.58%), (f) FPGA (33.88%), (g) FCN (26.26%), (h) U-Net (25.17%), (i) DeepLabV3+(2.41%), (j) DANet (28.18%), (k) LANet (31.05%), (l) SegFormer (31.05%), (m) TransUNet (11.70%), (n) DC-Swin (22.31%), (o) UNetFormer (21.69%), (p) ST-UNet (21.38%), (q) EMRT (21.48%), and (r) DTSU-Net (37.89%).
Fig. 7.

Segmentation maps of different methods on the Shanghai city. (a) Ground truth, (b) 1DCNN (19.22%), (c) 3DCNN (20.39%), (d) A2S2KResNet (25.48%), (e) SS3FCN (29.58%), (f) FPGA (33.88%), (g) FCN (26.26%), (h) U-Net (25.17%), (i) DeepLabV3+(2.41%), (j) DANet (28.18%), (k) LANet (31.05%), (l) SegFormer (31.05%), (m) TransUNet (11.70%), (n) DC-Swin (22.31%), (o) UNetFormer (21.69%), (p) ST-UNet (21.38%), (q) EMRT (21.48%), and (r) DTSU-Net (37.89%).

Fig. 8. - Segmentation maps of different methods on the Guangzhou city. (a) Ground truth, (b) 1DCNN (16.09%), (c) 3DCNN (19.63%), (d) A2S2KResNet (26.36%), (e) SS3FCN (30.24%), (f) FPGA (42.43%), (g) FCN (26.22%), (h) U-Net (28.36%), (i) DeepLabV3+(4.48%), (j) DANet (35.17%), (k) LANet (32.64%), (l) SegFormer (34.79%), (m) TransUNet (14.66%), (n) DC-Swin (19.65%), (o) UNetFormer (23.57%), (p) ST-UNet (25.42%), (q) EMRT (31.97%), and (r) DTSU-Net (52.90%).
Fig. 8.

Segmentation maps of different methods on the Guangzhou city. (a) Ground truth, (b) 1DCNN (16.09%), (c) 3DCNN (19.63%), (d) A2S2KResNet (26.36%), (e) SS3FCN (30.24%), (f) FPGA (42.43%), (g) FCN (26.22%), (h) U-Net (28.36%), (i) DeepLabV3+(4.48%), (j) DANet (35.17%), (k) LANet (32.64%), (l) SegFormer (34.79%), (m) TransUNet (14.66%), (n) DC-Swin (19.65%), (o) UNetFormer (23.57%), (p) ST-UNet (25.42%), (q) EMRT (31.97%), and (r) DTSU-Net (52.90%).

Fig. 9. - Segmentation maps of different methods on the Karamay city. (a) Ground Truth, (b) 1DCNN (20.29%), (c) 3DCNN (33.75%), (d) A2S2KResNet (31.82%), (e) SS3FCN (58.50%), (f) FPGA (54.64%), (g) FCN (43.00%), (h) U-Net (26.72%), (i) DeepLabV3+(1.93%), (j) DANet (46.77%), (k) LANet (46.43%), (l) SegFormer (50.19%), (m) TransUNet (20.55%), (n) DC-Swin (39.43%), (o) UNetFormer (36.15%), (p) ST-UNet (30.87%), (q) EMRT (52.67%), and (r) DTSU-Net (63.54%).
Fig. 9.

Segmentation maps of different methods on the Karamay city. (a) Ground Truth, (b) 1DCNN (20.29%), (c) 3DCNN (33.75%), (d) A2S2KResNet (31.82%), (e) SS3FCN (58.50%), (f) FPGA (54.64%), (g) FCN (43.00%), (h) U-Net (26.72%), (i) DeepLabV3+(1.93%), (j) DANet (46.77%), (k) LANet (46.43%), (l) SegFormer (50.19%), (m) TransUNet (20.55%), (n) DC-Swin (39.43%), (o) UNetFormer (36.15%), (p) ST-UNet (30.87%), (q) EMRT (52.67%), and (r) DTSU-Net (63.54%).

1) Comparison on the Changchun City

The segmentation results of all methods on the Changchun City dataset are summarized in the Table II. The best results are highlighted in bold, and the second-best results are underlined. DTSU-Net achieves the best performance in mIoU and OA, reaching 56.19 \pm 0.90%, 87.79 \pm 0.26%, respectively. Our results outperform other models, with improvement percentages ranging from 6.89% to 37.8% for mIoU and 1.28% to 29.2% for OA. Changchun, located in northeastern China, is rich in forestry and agricultural resources, with extensive dry farmland and rice paddies. It serves as a typical representation of northern China's landscape. The imagery shows relatively limited urban land use and sparse water systems, replaced by vast fields of vegetation and widely distributed rural areas. Therefore, the dataset includes categories, such as agricultural land, forestry land, reservoirs, small patches of bare land and rock, as well as scattered residential areas. In addition, among the 13 classes present in the Changchun City dataset, achieves the highest accuracy in ten classes. Particularly notable is the bare land category, which is not successfully segmented by other methods, but our approach performs exceptionally well. Fig. 6 provides a visual example of the various networks' comparisons. The use of 3-D convolution and hyperspectral methods without downsampling, such as 1DCNN, 3DCNN, and A2S2KResNet, can result in fragmented and blotchy segmentation when applied to semantic segmentation. Medium-coveraged grassland and bare land are challenging to segment, with the IoU for these classes generally being close to zero. Only DTSU-Net successfully segments the bare land, as indicated by the yellow areas in the lower right corner of the Fig. 6.

TABLE II Segmentation Results of Different Methods on the Changchun City
Table II- Segmentation Results of Different Methods on the Changchun City

2) Comparison on the Shanghai City

The segmentation results of all methods on the Shanghai dataset are presented in Table III, where the best results are highlighted in bold. DTSU-Net achieves the best performance in mIoU and OA, reaching 37.89 \pm 0.99% and 87.23 \pm 0.20%, respectively. These results surpass those of other models, with improvement percentages of 4.01% to 18.67% for mIoU, 0.53% to 11.43% for OA. Fig. 7 provides a visual comparison of various networks. Shanghai, located in eastern China, is characterized by dense water networks and highly developed urban areas, with rice paddies as the dominant type of agricultural land. As a representative city in southeastern China, its imagery primarily features extensive urban land coverage with no rural land. Vegetation is relatively sparse, and most of the agricultural land consists of rice paddies. Consequently, the dataset is dominated by categories, such as rice fields, vegetation, water bodies, and urban land, with other land cover types being less prevalent. In addition, due to the selected image coverage, certain land cover classes occupy very few pixels in the remote sensing images, such as dry farmland in Shanghai. This results in insufficient feature information for model training, leading to poor segmentation performance in categories, such as dryland, other forested areas, reservoirs, and tidal flats across all methods, with IoU scores in the single digits or even zero. Consequently, all methods yield relatively low mIoU scores on this dataset, even though the OA scores are within a normal range. DTSU-Net, however, achieves comparatively the best results in harder-to-segment categories, such as woodlands and marshlands.

TABLE III Segmentation Results of Different Methods on the Shanghai City
Table III- Segmentation Results of Different Methods on the Shanghai City

3) Comparison on the Guangzhou City

The segmentation results of all methods on the Guangzhou dataset are summarized in the Table IV, with the best results highlighted in bold and the second-best results underlined. DTSU-Net achieves the best performance in mIoU and OA, reaching 52.90 \pm 1.15% and 88.18 \pm 0.22%, respectively. These results outperform other models, with improvement percentages of 10.47% to 36.81% for mIoU and 1.4% to 23.07% for OA. Fig. 8 provides a visual comparison of various networks. Guangzhou, located in southern China, is characterized by large river systems and extensive rice paddies as the primary form of agricultural land. As one of the major cities in southern China, Guangzhou is highly developed. In the imagery, rivers occupy a significant portion, while rice fields are widespread. The dataset also includes diverse urban, rural, and various other types of construction land. The comparison methods show poor segmentation results for the shoal and bare land categories, whereas our method achieves the best performance. Among the 12 categories in the Guangzhou dataset, DTSU-Net achieves the highest accuracy in ten categories.

TABLE IV Segmentation Results of Different Methods on the Guangzhou City
Table IV- Segmentation Results of Different Methods on the Guangzhou City

4) Comparison on the Karamay City

The segmentation results of all methods on the Karamay dataset are presented in the Table V, where the best results are highlighted in bold and the second-best results are underlined. DTSU-Net achieves the best performance in mIoU and OA, reaching 63.54 \pm 1.38% and 85.67 \pm 0.39%, respectively. These results outperform other models, with improvement percentages of 8.9% to 43.25% for mIoU, 1.5% to 21.06% for OA. Fig. 9 provides a visual comparison of various networks. Unlike the first three cities, Karamay is located in the northwestern part of Xinjiang Province. Situated in western China, this region has an arid climate and features vast deserts and saline-alkali lands, embodying the typical characteristics of northwestern China. As a result, the dataset primarily consists of dry farmland, Gobi desert, and saline-alkali land. Urban and rural infrastructures are relatively underdeveloped, and vegetation cover is minimal, mostly comprising grasslands. The comparison methods show poor segmentation results for the rural settlement and bare land categories, whereas our method achieves the best performance. Among the 11 categories in the Karamay dataset, DTSU-Net achieves the highest accuracy in six categories.

TABLE V Segmentation Results of Different Methods on the Karamay City
Table V- Segmentation Results of Different Methods on the Karamay City
SECTION V.

Discussion

A. Ablation Study

1) Effect of Spectral Attention

Spectral attention aims to assign weights to feature dimensions that contain more important and informative information, mitigating within-class spectral variations. This is particularly crucial for processing hyperspectral images. In this module, we aggregate information for each dimension through global average pooling and global max pooling. Consequently, we will assess their capabilities for aggregating Spectral information and determine whether this module effectively concentrates attention on channels with richer information. In Tables VI and VIII, we validated the model's performance during the ablation of multihead deformable self-attention (MHDSA) and MHDCA by comparing the method with and without the inclusion of spectral attention. The comparison between the two tables reveals that the model consistently performs better when spectral attention is introduced, demonstrating that spectral attention effectively enhances the model's accuracy.

TABLE VI Ablation Study of Different Attention Modules in Four Datasets
Table VI- Ablation Study of Different Attention Modules in Four Datasets
TABLE VII Ablation Study of Deformable Transformer in Four Datasets
Table VII- Ablation Study of Deformable Transformer in Four Datasets
TABLE VIII Ablation Study of Spectral Attention Modules in Four Datasets
Table VIII- Ablation Study of Spectral Attention Modules in Four Datasets

2) Effect of MHDSA and MHDCA

To validate the effectiveness of the modules in our method, we conducted ablation experiments by adding one module at a time, as shown in Table VI. The mIoU on the baseline for all four datasets is not high, indicating that CNN has limitations in handling large-scale hyperspectral images. Transformers, with their ability to process sequential data, can capture relationships between pixels at a longer distance compared to CNNs, which is crucial for extracting global information. Upon adding the MHDSA module alone, the model can aggregate multiscale features from different stages, leading to a significant improvement in accuracy. However, adding MHDCA alone does not significantly improve performance and even leads to a decline in results on the Changchun dataset. This is because the self-attention module primarily captures global features and contextual information, while the cross-attention module focuses on fusing information across multiple feature maps. Without first obtaining global features through self-attention, the cross-attention module may struggle to effectively utilize the available information, resulting in only minimal improvement in performance. When both MHDSA and MHDCA are added, MHDCA helps integrate multiscale information into the deepest feature maps, enhancing feature representation. At this point, our model achieves optimal performance, demonstrating the synergistic effect of combining multiscale attention mechanisms.

3) Effectiveness of Deformable Transformer

The MHSA module is a core component of the Transformer, effectively extracting global features. However, the standard MHSA is computationally intensive and may not fully utilize local information compared to deformable self-attention.We conducted experiments comparing these two attention methods in both self-attention and cross-attention modules to validate their performance and accuracy. The self-attention module consists of self-attention and MLP components, computing the query and value vectors of the aggregated multiscale features. Similarly, the cross-attention module consists of cross-attention and MLP components, where the deformable cross-attention calculates the query and value vectors between the two features. Specifically, we apply either a standard Transformer or a deformable Transformer in both the self-attention and cross-attention modules, resulting in four configurations: using a Transformer in both modules, using a deformable Transformer in both modules, using a deformable Transformer only in the self-attention module, and using a deformable Transformer only in the cross-attention module.

To highlight the advantages of deformable self-attention, we evaluated them in terms of both accuracy and speed, as shown in the Table VII. The deformable Transformer outperforms the standard Transformer in both accuracy and speed. This indicates that the Transformer model, when performing global operations, not only has high complexity but also introduces unnecessary information. Since deformable self-attention handles multiscale features concatenated together, it reduces computation time more significantly than cross-attention. We observed that using only deformable cross-attention did not achieve optimal results. This is because standard self-attention mechanisms are better at capturing global context, while deformable cross-attention focuses more on aggregating local information. When applied to globally focused features derived from self-attention, the deformable cross-attention struggles to effectively leverage its ability to handle local variations, leading to a loss of detail and, consequently, poorer results. The deformable Transformer, with its adaptability and efficiency, proves to be more suitable for capturing global features.

B. Comparison on All Cities

In this section, we validate our approach across all cities, with the training, validation, and testing sets consisting of 4821, 515, and 2459 subimages, respectively. With a significant increase in data volume, the complexity of land cover types will be higher. In addition, the characteristics of land cover vary across different regions, leading to significant performance differences for some methods between individual cities and all cities combined. Nevertheless, even with a large dataset, our method still achieves the best results. As shown in Table IX, the DTSU-Net achieves an mIoU of 57.22% and an OA of 79.28%. Specifically, the mIoU outperforms the highest baseline method by 1.65% and the lowest baseline method by 17.84%. Compared to results for a single city, methods based on transformers perform better than those based on CNNs, indicating that transformers require larger datasets to train effectively.

TABLE IX Complexity Analysis of Various Comparison Methods
Table IX- Complexity Analysis of Various Comparison Methods

C. Complexity Analysis

Table IX presents the parameter count and floating point operations (FLOPs) for different models. The size of the parameter count reflects the complexity of the model, while FLOPs is commonly used to measure the computational requirements and speed of the model. Hyperspectral methods, compared to semantic segmentation methods, generally have fewer parameters but higher computational demands. Consequently, models in hyperspectral segmentation require more computing resources, leading to longer training times. On the other hand, semantic segmentation methods tend to have larger parameter counts and FLOPs.

DTSU-Net combines convolutional layers with deformable Transformers. Deformable Transformers, in comparison to traditional Transformers, have lower computational and parameter requirements. In deformable transformers, to simultaneously generate sampling offsets \Delta p_{mqk} and attention weights A_{mqk}, the query feature z_{q} is linearly projected into 3MK channels. Therefore, the complexity of calculating the attention weights and sampling offsets is O(3N_{q} \times M \times K \times C). In the self-attention mechanism, the query and key vectors are obtained through linear transformations from the input feature map. Given a weight matrix dimension of C \times C, the complexity of computing the query and key is O(NC^{2}), and the complexity of calculating the value is O(N_{q} \times K \times C^{2}). When applying attention weights, bilinear interpolation is required, followed by weighted computation, where bilinear interpolation involves four nearest neighbor points, leading to a complexity of O(4N_{q} \times K \times C + N_{q} \times K \times C). The final total complexity is O(NC^{2} + NKC^{2} + 5N_{q}KC). In traditional self-attention modules, N_{q} = N_{k} = N, resulting in a total complexity of O(2NC^{2} + N^{2}\,C). The complexity becomes quadratic, leading to significantly higher computational costs compared to the former case. Therefore, our method achieves the best performance among all methods with a relatively low parameter count and moderate FLOP. The parameter count is 8.99 million, and the FLOP is 127.29 billion.

SECTION VI.

Conclusion

This article proposes a semantic segmentation method for large-scale hyperspectral remote sensing images based on both spectral attention and deformable Transformer networks. The approach utilizes convolutional downsampling and GSA to focus on valuable spectral dimensions. It also employs MHDSA to extract multiscale global information and MHDCA to enhance semantic information in the high-level features. This allows for the segmentation of large-scale hyperspectral remote sensing images with reduced computational complexity. Experimental results on the public large-scale hyperspectral dataset WHU-OHS demonstrate that our method outperforms common hyperspectral methods and semantic segmentation methods based on CNNs and Transformers.

To fully exploit the rich information in large-scale hyperspectral remote sensing images, future research should focus on exploring transfer learning approaches that can effectively leverage prior knowledge for large-scale cross-scene and cross-sensor transferability.

ACKNOWLEDGMENT

The authors would like to thank Prof. J. Li for making the large-scale hyperspectral dataset WHU-OHS available to the community.

The authors would also like to thank Dr. W. Hu for sharing the code of 1DCNN, Dr. Y. Chen for 3DCNN, Dr. S. K. Roy for A2S2KResNet, Dr. Z. Zheng for FreeNet, Dr. J. Long for FCN, Dr. O. Ronneberger for U-Net, Dr. J. Fu for DANet, Dr. L. Ding for LANet, Dr. L.-C. Chen for DeepLabV3+, Dr. E. Xie for SegFormer, Dr. R. Li for MAResUNet, Dr. J. Chen for TransUNet, Dr. L. Wang for DC-Swin and UNetFormer, and Dr. X. He for ST-UNet, and Dr. T. Xiao for EMRT, respectively.

References

References is not available for this document.