Introduction
Hyperspectral imagery (HSI) obtains hundreds of numerous narrow and contiguous spectral bands from the surface which provide abundant characteristics to enhance the identification ability of ground materials [1]. With high-resolution imaging technology rapidly developing, HSI becomes an ideal tool to effectively detect the surface, which spans a broad range of applications, including mineral substance [2], monitoring of plant diseases [3], anomaly detection [4], and land-cover mapping [5]. HSI classification plays a substantial role in these fields, intending to analyze discriminative characteristics of HSI and classify each pixel according to a corresponding land-cover category [6]. Therefore, two major characteristics of HSI should be considered. First, the high-dimensional nonlinear spectral signature, which originates from redundant bands of spectrums, enables the accurate distinction of homologous surface categories. Second, high spatial correlation provides spatial auxiliary contexts for accurate mapping of pixelwise classification, which derives from homogeneous regions [7].
Since the spectral information can natively reflect the characteristics of different materials, one set of traditional methods identifies the classification maps in a pixelwise way, which can be divided into two steps: 1) feature engineering, such as principal component analysis (PCA) [8], bands selection [9] and 2) classifier development, including support vector machine (SVM) [10], random forest [11]. This kind of approach is constrained by the high-dimensional nonlinear characteristics, which leads to an unsatisfactory result. To further improve the representation of HSIs, another set of approaches implements the positive effect on the spectral–spatial expression. Existing methods introduced the spatial contexts in the feature engineering step. For instance, Kang et al. [12] proposed the feature fusion framework combined with the edge-preserving filtering (EPF) and SVM. Jiang et al. [13] regarded the superpixel as a carrier to extract potential features. However, the models mentioned above consist of shallow structures which cannot provide an efficient description.
With the advancement of artificial intelligence, CNN-based approaches have attracted increased focus due to the fact that their objective functions directly aim at classification instead of two independent steps to obtain remarkable results [14], [15]. In 2016, Zhao et al. [16] adopted CNN to learn local spatial contexts for HSI classification. Chen et al. [17] designed a 3-D CNN to extract neighboring spectral cubes, which originate from HSIs instead of dimensionality-reduced data. Nonetheless, a deeper network may lead to the Hughes phenomenon, under the conditions of both complexity of the spectral–spatial distribution and the scarcity of training samples.
Meanwhile, with the development of deep learning, a series of deep-learning-derived methods have been applied for HSI classification and proven to be successful. Many works of classification frameworks obtains superior achievements by constructing high efficiency spectral–spatial feature extraction. For instance, Zhong et al. [18] built a spectral–spatial residual network (SSRN) to reduce the complexity of the network design and achieved advanced performance. In [19], a dense convolutional block was employed for accurate identification. A 3D-Conv-Capsule model [20] was presented for HSI classification, which attempted to consider the pixel position attributes to enhance the spatial awareness. In addition, in Sellami's work [21], a spectral–spatial graph was constructed to fully exploit the inherent spatial distribution.
Another line of approaches accomplished spectral–spatial classification by exploiting attention mechanisms, which performs classification after aggregating features from the homogeneous regions. Xu et al. [22] designed a control gate attention mechanism for the quick acquisition of key features. In [23], a spectral–spatial classification framework was proposed by performing CNN with a self-attention module to enhance the correlation of features. In [24], a multiattention fusion network (MAFN) was designed to mine significant features for classification. Yu et al. [25] presented a dense CNN framework with a feedback attention mechanism to further improve the computation efficiency. However, the attention weight embedding was placed behind the spectral–spatial representation, which introduced the influence of interference pixels and redundant spectral bands. He et al. [26] designed an HSI-BERT to capture global dependence among pixels at the receptive field. However, the transformer-based method needs multiple nonlocal areas to capture global long-term dependence.
In contrast to classical optical image classification objectives in the computer vision fields, which consist of hundreds of categories, the land cover classification of HSI takes much fewer targets for identification. Therefore, the theory that deep learning takes a high amount of data for training might not apply to HSIs which lack in labeled samples. Several works focus on the semisupervised learning via both labeled and unlabeled HSI samples for training. For instance, Fang et al. [27] presented a resampling strategy for training CNN sufficiently. In [28], the uncertainty of unlabeled samples of HSIs are considered for classification. Although these studies have acquired significant results, they may stem from the regions of high spatial correlation context, instead of deep learning methods.
Recently, generative adversarial network (GAN) have been applied for HSI classification to alleviate the issue of limited labeled samples. Specifically, GAN-based classifiers start from semisupervised HS-GAN proposed by Zhan et al. [29], which used 1-D spectral vectors as the input. To exploit the benefit of spatial information, a neighborhood majority voting strategy [30] is applied to the prediction, lately. He et al. [31] built a 3-D bilateral filtering-based GAN framework to improve the ability of spatial awareness. A 3D-GAN is proposed for HSI classification that keeps only the first three principal components of raw data as input. In [33], a semisupervised GAN with a conditional random field (GAN-CRF) was designed that regards the softmax prediction as conditional probabilities of HSI to refine classification maps. To enhance the meaningful semantic contexts, an adaptive DropBlock-enhanced GAN (AD-GAN) [34] was established to stabilize the training state of the model.
Although these GAN-based methods have achieved satisfying ability over the contemporaneous benchmarks, there are still two drawbacks over HSI classification to be solved.
The first challenge is the mode collapse of GAN. The generator G deceives the discriminator D through generating data from the limited labeled data distribution [35]. The restricted narrow redundant spectral signatures limit the representation ability of GAN and lead to terrible data generation. In Wang's work [34], an adaptive DropBlock is employed as a regularization method to alleviate the mode collapse. However, the supervised GANs generate the data distribution that is similar to that of labeled training ones and, thus, difficult to learn the complete real HSI distribution. In addition, the unlabeled data of HSI remains an unexploited gold mine for efficient data utilization. Recently, in response to this characteristic, Liang et al. [36] implemented the mean minimization loss that considers the constraint over unlabeled data of HSI and acquired superior achievement. The reason for this phenomenon is that it may minimize the values and variances of high-dimensional feature maps from D. As this point, the GAN model can hardly be subject to the impact of complex parameter calculation, which guaranteed the stability of the training state.
Another critical issue is the complexity, inefficient description of spectral–spatial characteristics. The classification performance seems to deteriorate when the extraction of spectral–spatial characteristics is affected by interference pixels. Therefore, it is hard to guarantee that the GAN always works toward the authentic HSI distribution, particularly for high-dimensional spectral signature or texture-dependent context. In Feng's work [37], the joint spatial spectral hard attention mechanism was employed in G to cooperate D discards misleading and confounding information for HSI classification. However, it only focused on a specific area of the input patches in one batch, which requires more complex technology for training. In a disparate line of work, the attention-aware block [38] was designed in ResNets to enhance the representation of HSI data. It demonstrated that the attention-aware block can learn more valuable and valid representations. However, when dealing with objects with variable spectral or irregular areas, the attentive architecture is inefficient. We argue that if the homogeneous spectrum and adaptive receptive fields are taken into account, the complexity issue of the HSI data can be alleviated.
To tackle the above-mentioned challenges of GAN-based methods, we suggest a spectral–spatial attention feature extraction approach based on GANs (SSAT-GANs) for HSI classification. The purpose of the proposal can build a significant representation for spectral–spatial characteristics and enhance the robustness and stability of GANs in the way of semisupervised learning. On the one hand, the SSAT-GAN takes the unlabeled data into account to alleviate the scarcity of labeled samples, which enables the generator G to implicitly reconstruct real HSI cubes. Meanwhile, we adopt the mean minimization loss as an unsupervised constraint item used in the discriminator D to avoid overfitting. On the other hand, the complicated spectral–spatial characteristics of local adjacent pixels herald the redundancy and inefficiency problem, which result in more insufficient classification with more complex regions. Inspired by the fact that the attention weights can enhance the effective representation of the saliency neighborhood of an object, the spectral–spatial attention modules (SSAT) are designed separately to capture the discriminative representation in this article, in which both intraspectrum and contextual relations of HSIs participate in the attention calculation through the feedback, and the weighted feature maps are considered to enhance intraclass consistency. In this way, we extend the SSAT to consecutive feature spreading and generation blocks and pass through them to build D and G, respectively. Unlike traditional semisupervised GANs, which require a deeper convolutional architecture for feature representation, our proposal is feature-efficient because both D and G share the weights of parameters with the corresponding attention modules and further improve the feature description. To this end, the well-trained D can achieve satisfactory classification accuracy.
The main contributions of this article are listed as follows.
We design a novel semisupervised GAN-based HSI classification framework using a small number of labeled and unlabeled data for training. The mean minimization loss is employed for unsupervised learning, which boost the backpropagation of the gradient and stabilize the training of GAN.
For the purpose of alleviating the inefficient description, we integrate the spectral–spatial attributes into SSAT for representation discrimination of the HSI data.
The alternately optimized architecture design makes the SSAT-GAN a framework that generalizes well in three real HSI datasets and achieves satisfactory classification accuracy over state-of-the-art methods.
The rest of this article is organized as follows. Section II reviews the basic concepts of GANs. The scheme of the proposed SSAT-GAN and its components are introduced in Section III. Experimental results and analysis are presented in Section IV. The superiority of SSAT-GAN is discussed in Section V. Finally, the conclusion is drawn in Section VI.
Related Work
A. Generative Adversarial Network
GAN is an unsupervised deep learning model proposed by Goodfellow et al. [39], which provides a reasonable scheme to implicitly reckon real data distribution. GAN incorporates a generator G and a discriminator D in a unified network, where G generates samples to fool D into believing it, and D distinguishes the genuineness of the samples. Contradictory results make G and D reach Nash equilibrium in the zero-sum game, which is finally expressed as a minimax optimization problem
\begin{equation*}
\begin{split} \underset{\text {G}}{{\text{min}}}\underset{\text {D}}{{\text{max}}}\text {Loss} = & \text {E}_{{\bf {z}}\sim p_{z}}\left[ \text{log}\left(1-D\left(G\left({\bf {z}} \right) \right) \right) \right] \\
& +\text {E}_{{\bf {x}}\sim p_{\text{data}}}\left[ \text{log}D\left({\bf {x}} \right) \right] \end{split} \tag{1}
\end{equation*}
In the optimization process of GAN, G and D are optimized alternately. Given
SSAT-GAN Framework
The SSAT-GAN flowchart is shown in Fig. 1. Suppose the raw HSI dataset
Flowchart of SSAT-GAN framework for HSI classification. First, the unlabeled group
SSAT-GAN incorporates the spectral and spatial attention modules in both discriminator and generator to extract discriminative features, where the discriminator and generator are, respectively, composed of convolutional and transposed layers.
A. Spectral and Spatial Attention Modules
The purpose of SSAT is to enhance the feature analysis of a salient and effective domain, which is inspired by CBAM [40]. Given an intermediate feature map, it sequentially calculates attention weights along spectral and spatial dimensions, separately.
1) Spectral Attention Module
The spectral attention module aims at exploring the intraclass consistency of spectrums. As each band of spectral energy is considered as a class feature detector, our spectral attention focuses on “which” bands are meaningful given an input cube. To highlight discriminative signature from spectral knowledge while retaining uniform characteristics, we use both depthwise separable convolution (Depth_CONV) [41] and 3-D convolution (3D_CONV) operations. For aggregating intraspectrum information, average-pooling and max-pooling have been commonly adopted so far. In addition, we employ a spectral squeeze mechanism to assign an independent weight to each element along spectral dimension. Finally, the spectral attention weights will be generated by a dynamic activation function.
As demonstrated in Fig. 2, take n HSI cubes
\begin{align*}
{\bf {Atte}}_{\text{spc}} & = \sigma \left(\text{FCs}\left(\text{AvgPool}\left({\bf {X}}^{\text{Dep}}\right) \right.\right.\\
\phantom{=\;\;} &\qquad\; \left. \left.+\text{MaxPool}\left({\bf {X}}^{3-D}\right)\right)\right)\\
&= \sigma \left(\mathbf {W}_{1}\left(\mathbf {W}_{0}\left({\bf {X}}_{{\text{avg}}}^{\text{Dep}}+{\bf {X}}_{{\text{max}}}^{\text{3-D}}\right)\right)\right), \tag{2}
\end{align*}
Spectral attention module utilizes both Depth_CONV and 3D_CONV descriptors with pooling operations, followed by a spectral squeeze mechanism to predict spectral attention weights.
2) Spatial Attention Module
For exploiting the interclass differences of spatial contexts, we build a spatial attention module to generate spatial attention map. As a reasonable theory, a pixel does not always share the category of its neighbors, and the spatial attention providing “where” are interesting areas, which is complementary to the spectral attention. As illustrated in Fig. 3, we apply the atrous convolution (Atrous_CONV) and 3D_CONV operations along the spectral axis, generating two different intermediate feature maps, which aim at extending receptive field and reducing the interference of abnormal pixels. Atrous_CONV has been proved effectively for learning intraclass consistency homogeneous areas of HSIs [42], which has a shared kernel with multiple dilations for learning spatial contexts. For the “gridding problem” [43], both intermediate maps are then forwarded to average-pooling and max-pooling to enhance the spatial representation and generate a contextual descriptor with pixelwise addition strategy. The spatial attention weights will be predicted after activating the neural parameters of the contextual descriptor.
Spatial attention module combines two similar feature maps that are convolved with Atrous_CONV and 3D_CONV, and pooled along the spectral axis, and then feeds them to a convolution layer.
Supposing an HSI cube
\begin{align*}
{\bf {Atte}}_{\text{spa}} & = \sigma \left(\left[\text{AvgPool}\left({\bf {X}}^{\text{Atr}}\right); \right.\right.\\
&\qquad \left. \left. \text{MaxPool}\left({\bf {X}}^{\text{3-D}}\right)\right]\right)\ast {\bf {H}}^{{q}+1}+ {\bf {b}}^{{q}+1}\\
& = \sigma \left(\left[{\bf {X}}_{{\text{avg}}}^{\text{Atr}};{\bf {X}}_{{\text{max}}}^{\text{3-D}}\right]\right)\ast {\bf {H}}^{{q}+1}+ {\bf {b}}^{{q}+1}. \tag{3}
\end{align*}
B. Spectral–Spatial Attention Discriminator and Generator
We incorporate our SSAT in the generator and discriminator and extend them to four spectral–spatial attention spread learning and generation blocks. Fig. 4 shows the architecture of four attention blocks, each of which can be regarded as an extension of successive convolution and transposed convolution.
Four feature spreading blocks with lightweight spectral–spatial attention modules aiming for HSI feature extraction and generation in SSAT-GAN. (a) and (b) Spectral and spatial attention feature spread blocks in discriminators; (c) and (d) Spectral and spatial attention feature generation blocks in generators.
1) Spectral Attention Feature Spread Block
For the redundant spectral bands, as shown in Fig. 4(a), the spectral attention module is introduced in the p+1th layer to assign an attention weight to the spectral tensor of HSI and aggregates the intraclass correlation of the narrow spectrum. Next, the p+2th layer utilizes a 3-D convolution layer with batch normalization [44] (CONV_BN) to update the parameters according to the spectral attention feature. The skip connection is applied instead of directly mapping between the p+1th and the p+2th layers and builds the spectral attention feature extraction function
\begin{align*}
{\bf {X}}^{{p}+2}&={\bf {X}}^{p}+{F}\left({\bf {X}}^{p};\boldsymbol{\theta }\right), \tag{4}
\\
{F}\left({\bf {X}}^{p};\boldsymbol{\theta }\right) &=\left({\bf {Atte}}_{\text{spc}}\otimes \text {R}\left(\hat{{\bf {X}}}^{{p}+1}\right)\right)\ast {\bf {h}}^{{p}+2}+{\bf {b}}^{{p}+2}, \tag{5}
\\
{\bf {X}}& = \text {R}\left(\hat{{\bf {X}}}^{{p}}\right)\ast {\bf {h}}^{{p}+1}+{\bf {b}}^{{p}+1}, \tag{6}
\\
\hat{{\bf {X}}}^{{p}} &= \frac{{\bf {X}}^{{p}}-E\left({\bf {X}}^{{p}}\right)}{\mathrm{Var}\left({\bf {X}}^{{p}}\right)} \tag{7}
\end{align*}
2) Spatial Attention Feature Spread Block
The spatial attention feature spread block aims to explore neighboring correlation and intraclass consistency of central pixels in high-spatial regions. Fig. 4(b) shows the detail of the spatial block. The depths along the spectrum of kernels are in an identical size with that of the input cubes
\begin{align*}
{\bf {X}}^{{q}+2}&={\bf {X}}^{q}+{F}\left({\bf {X}}^{q};\boldsymbol{\xi }\right), \tag{8}
\\
{F}\left({\bf {X}}^{q};\boldsymbol{\xi }\right) &=\left({\bf {Atte}}_{\text{spa}}\otimes \text {R}\left(\hat{{\bf {X}}}^{{q}+1}\right)\right)\ast {\bf {H}}^{{q}+2}+{\bf {b}}^{{q}+2}, \tag{9}
\\
{\bf {X}} &= \text {R}\left(\hat{{\bf {X}}}^{{q}}\right)\ast {\bf {H}}^{{q}+1}+{\bf {b}}^{{q}+1}, \tag{10}
\\
\hat{{\bf {X}}}^{{q}}& = \frac{{\bf {X}}^{{q}}-E\left({\bf {X}}^{{q}}\right)}{\mathrm{Var}\left({\bf {X}}^{{q}}\right)} \tag{11}
\end{align*}
3) Spectral–Spatial Attention Feature Generation Blocks
To overcome the challenge of a small-sample scenario, the idea of spectral–spatial attention is extended to the generator to improve the variety of generation. Fig. 4(c) and (d) shows the details of spectral–spatial attention generation blocks; they embed both attention modules to spread feature generation, which contains successive transposed 3-D convolution (CONV
\begin{align*}
{\bf {z}}^{{p}+2}&={\bf {z}}^{p}+{F}\left({\bf {z}}^{p};\boldsymbol{\theta }\right), \tag{12}
\\
{F}\left({\bf {z}}^{p};\boldsymbol{\theta }\right) &= \left({\bf {Atte}}_{\text{spc}}\otimes \text {R}\left(\hat{{\bf {z}}}^{{p}+1}\right)\right)\ast ^{T}{\bf {h}}^{{p}+2}+{\bf {b}}^{{p}+2}, \tag{13}
\\
{\bf {z}}&=\text {R}\left(\hat{{\bf {z}}}^{p} \right) \ast ^{T}{\bf {h}}^{{p}+1}+{\bf {b}}^{{p}+1} \tag{14}
\end{align*}
\begin{align*}
{\bf {Z}}^{{q}+2}&={\bf {Z}}^{q}+{F}\left({\bf {Z}}^{q};\boldsymbol{\xi }\right), \tag{15}
\\
{F}\left({\bf {Z}}^{q};\boldsymbol{\xi }\right) &= {\bf {Atte}}_{\text{spa}}\otimes \text {R}\left(\hat{{\bf {Z}}}^{{q}+1}\right)\ast ^{T}{\bf {H}}^{{q}+2}+{\bf {b}}^{{q}+2}, \tag{16}
\\
\boldsymbol{Z}&=\text {R}\left(\hat{\boldsymbol{Z}}^{q} \right) \ast ^{T} \boldsymbol{H}^{q+1}+ \boldsymbol{b}^{q+1} \tag{17}
\end{align*}
Unlike traditional feature representation blocks, which perform the attention mechanism after the feature extraction for HSI data characterization, the proposed spectral–spatial attention feature spread and generation blocks are feature-efficient, i.e., the attention maps are executed during the feature extraction. It can be described from two aspects. 1) The consecutive spectral–spatial attention feature spread blocks of the discriminator draw the SSAT into the architecture for training, which provides learnable spectral and spatial attributes. On the one hand, being similar to SNR enhancement, the spectral attention weight
C. Semisupervised SSAT-GAN
Taking the Pavia University (UP) dataset as raw input HSI cubes, Fig. 5 details the SSAT-GAN algorithm stream. The discriminator D contains a spectral attention feature spread block, spatial attention feature spread block, and one FC, and it outputs the vectors with the softmax layer. The generator G includes one FC, spectral attention generation block, and spatial attention generation block to generate HSI cubes. In addition, we extend the SSAT-GAN to semisupervised classification, which adopts unlabeled training samples of a raw HSI cube to improve HSI classification.
Spectral–spatial discriminator (top), which contains successive spectral and spatial attention feature spread blocks and outputs a vector consisting of an indicative entry of real or fake data and categorical probabilities; spectral–spatial generator (bottom), which contains successive spectral and spatial attention feature generation blocks and transforms a vector from random noise to a synthetic HSI cube.
In contrast to original GANs, semisupervised SSAT-GAN leads a supervised item into the GAN loss to achieve the HSI classification. The labeled HSI cube
\begin{equation*}
\hat{{\bf {Y}}}^{1}=D\left({\bf {X}}^{1};\theta _{D} \right) \tag{18}
\end{equation*}
Semisupervised GAN aims to alleviate the issue of small samples by labeled and unlabeled data of HSI. The point of view referred in [33] illustrated that D needs a bad G as a regularizer for training GANs. An opposite theory cited in [34] has pointed out that high-quality synthetic samples help D improve generalization ability for HSIs. In our proposal, we extend our spectral–spatial attention weights to G, reconstructing HSI cubes, implicitly. It can be divided into two phases. First, G is considered as the regularizer of D to improve HSI classification, and it updates the penalty factor with the discriminative loss. Thus, the optimized loss function of D takes the form
\begin{equation*}
\begin{split} L_{\text {SEMI}}\left(\theta _{D},\theta _{G} \right) & =L_{\text {SUP}}\left(\theta _{D},\theta _{G} \right) +L_{\text {UNSUP}}\left(\theta _{D},\theta _{G} \right) \\
& = L_{\text {SUP}}\left(\theta _{D} \right) + L_{D1}\left(\theta _{D}\right) \\
& \qquad\; + L_{D2}\left(\theta _{D},\theta _{G} \right) \end{split} \tag{19}
\end{equation*}
\begin{align*}
\begin{split} L_{\text {SUP}}\left(\theta _{D} \right) & = -E_{{\bf {X}}^{1}\sim {p}_{\text{data}}}\text{log}D\left({\bf {X}}^{1};\theta _{D} \right) \left[ 1:{n} \right]\\
& = -E_{{\bf {X}}^{1}\sim {P}_{\text{data}}}\text{log}\hat{{\bf {Y}}}^{1}\left[ 1:{n} \right] \end{split}, \tag{20}
\\
\begin{split} L_{D1}\left(\theta _{D} \right) & = -E_{{\bf {X}}^{1}\sim {p}_{\text{data}}}\left(1-\text{log}D\left({\bf {X}}^{1};\theta _{D} \right) \left[ 0 \right] \right) \\
& = -E_{{\bf {X}}^{1}\sim {p}_{\text{data}}}\text{log}\left(1-\hat{{\bf {Y}}}^{1}\left[ 0 \right] \right) \end{split}, \tag{21}
\\
\begin{split} L_{D2}\left(\theta _{D},\theta _{G} \right) & = -E_{{\bf {z}}\sim {p}_{z}}\text{log}D\left(G\left({\bf {z}};\theta _{D} \right) \right) \left[ 0 \right]\\
& = -E_{{\bf {z}}\sim {p}_{z}}\text{log}D \left({\bf {Z}};\theta _{D} \right) \left[ 0 \right]\\
& = -E_{{\bf {z}}\sim {p}_{z}}\text{log}\hat{{\bf {Y}}}^{1}\left[ 0 \right] \end{split} \tag{22}
\end{align*}
It is to be observed that the optimization of a semisupervised GAN focuses on exploring a real HSI data distribution by limited labeled samples, which often causes overfitting. As the high dimensional feature learning of LD1 is not constrained, it will contribute little to and even jeopardize the discriminator to enhance the capability of HSI classification.
Thus, we minimize the high-dimensional output of (21) to update the gradient in reverse and decrease the value and variance to inhibit overfitting, which is available in another work [36] called mean minimization loss. The function takes the form
\begin{equation*}
\theta ^{\ast }=\text {arg}\underset{\theta }{\text {min}}\left(\frac{1}{N}\sum _{i=1}^{N}\text{average}\left(f\left(x_{i}; \theta \right) \right) \right) \tag{23}
\end{equation*}
\begin{equation*}
\begin{split} L_{G}\left(\theta _{D},\theta _{G} \right) & = -E_{{\bf {z}}\sim {p}_{z}}\text{log}\left(1-D\left(G\left({\bf {z}};\theta _{D} \right) \right) \left[ 0 \right] \right) \\
& = -E_{{\bf {z}}\sim {p}_{z}}\text{log}\left(1-D \left({\bf {Z}};\theta _{D} \right) \left[ 0 \right] \right) \\
& = -E_{{\bf {z}}\sim {p}_{z}}\text{log}\left(1-\hat{{\bf {Y}}}^{1}\left[ 0 \right] \right). \end{split} \tag{24}
\end{equation*}
Algorithm 1: Training Process of SSAT-GAN.
Input:
The labeled training data:
Output: The labels of the test samples
Begin
Initialize: Randomly initialize the parameters
for
for
Generate
Concatenate noises with labels
Calculate
Calculate
Predict classification vectors
Compute
Input noises
Input
Predict authentic vectors
Compute
Update
Update
end for
end for
Classify
The training of SSAT-GAN involves two alternating steps through rms or adjacent optimization fashions in every epoch. First, the gradients of the discriminator
Experimental Analysis
We detail the experimental results from three real hyperspectral datasets, including the Indian Pines (IN), the University of Pavia (UP), and the Kennedy Space Center (KSC). Each of them is standardized by mean variance operation. Three classification evaluation metrics, including overall accuracy (OA), average accuracy (AA), and kappa coefficient (
A. Experimental Datasets
1) Indian Pines
IN was acquired by airborne visible/infrared imaging spectrometer (AVIRIS) from Northwest Indiana in 1992 and includes 16 vegetation categories, with an imbalance in pixel numbers over categories. It contains
2) University of Pavia
UP was gathered by reflective optics system imaging spectrometer (ROSIS) in 2001 from Northern Italy, consisting of
3) Kennedy Space Center
KSC was obtained by AVIRIS in 1996 from Florida and includes 13 upland and wetland land-cover types, with
Figs. 6–8 illustrate the dataset, the corresponding ground reference maps, and category information. All labeled samples are split into two groups: the training group and the test group. For the unlabeled group, the unlabeled training samples are randomly selected from the background. GANs contain relatively higher computational complexity, which is often guided to the mode collapse. Thus, we refer to Monte Carlo sampling [45] which is mentioned in [33] to marginalize noise during training.
B. Parameter Tuning
Fig. 5 takes the UP neighboring cube as an instance to show the detail of the discriminator D and generator G.The
1) Evaluation of Different Depths of Spectral–Spatial Attention Block
We assessed the impact of different depths of the spatial–spectral attention feature spreading blocks on classification results. For SSAT-GAN, the depths of blocks were validated from four convolutional layers to eight convolutional layers on all datasets. To maintain the stability of the model, the depth of the generator was symmetric to that of the discriminator.As illustrated in Fig. 9, it achieved the highest evaluation results on both IN and UP datasets, when set the depths of spectral-spatial attention feature spread blocks to “3 + 3”, i.e. the discriminator which consists of 3 spectral and 3 spatial convolutional layers, compared with other settings of convolutions. As for the KSC, the differences of OAs between deeper SSAT-GANs and their corresponding shallow depth get a small value. Meanwhile, in contrast to the obvious overfit deeper layers under limited training sample effects reviewed in [33], the quantitative HSI classification performance of SSAT-GANs with varying depths illustrated that our attention modules mitigate the overfitting effects to other GANs.
OAs of SSAT-GAN with different depths of convolutional layers in their spectral–spatial attention feature spread blocks using 500 labeled samples on IN and UP, and 250 on KSC for training. The
2) Evaluation of Different Numbers of Kernels for SSAT-GAN
Kernel numbers of each layer from feature spreading blocks greatly affects computation consumption and expressiveness of SSAT-GANs. We evaluated the impact of different numbers of kernels of the spectral–spatial attention feature spreading blocks on the results. In Fig. 10, the discriminator and the generator of SSAT-GAN set the same kernel number in their convolution and transposed convolution layers, with the number of kernels verified from
OAs of SSAT-GAN for varying kernel numbers in their spectral–spatial attention spreading blocks using 500 labeled samples on IN and UP, and 250 on KSC for training.
3) Influence of Unlabeled Real HSI Cubes
To evaluate the influence of unlabeled real HSI cubes, we tested SSAT-GAN and its three extensions using different numbers of unlabeled HSI samples on the IN, UP, and KSC datasets. The three extensions of SSAT-GAN are denoted as Spa-AT-GAN (the ones that only contain the spatial attention feature spreading part), the Spc-AT-GAN (the ones that only contain the spectral attention feature spreading part), and the Spa-Spc-AT-GAN (the ones that contain both spreading blocks, where the spatial attention module is set before the spectral attention module). Table I recorded the classification results of SSAT-GANs. Each experiment randomly selected 0, 300, 1000, and 5000 unlabeled samples for training.
For IN and KSC, the classification of SPA-SPC-AT-GAN did not efficiently improve with the increase of unlabeled samples. Among the four methods, SSAT-GAN had the best evaluation on each dataset with various unlabeled samples due to the spectral–spatial attentive feature learning guidance. Moreover, models with 300 unlabeled samples had the most accurate evaluation on all three datasets, and the improvement of 1000 unlabeled samples was not obvious. When the number increased to 5000, the results showed a downward trend in all extensions of SSAT-GAN on three datasets. This proves that adding too many real samples does not greatly improve the classification, which is caused by the abnormal distribution of unlabeled pixels. In addition, it can be seen that the HSI classification has been significantly improved if the unlabeled samples are set equal to the labeled samples. This conclusion is consistent with the opinions reported by Zhong et al. [33] and Liang et al. [36].
4) Evaluation of Different Spatial Data Sizes
To assess the impact of spatial size on the experimental results, we tested SSAT-GAN with spatial data sizes of
C. Comparison With Various Algorithms
This experiment aimed to compare the performance of the proposed SSAT-GAN with the EPF-SVM [12] (EPF-based SVM) and the state-of-the-art deep learning derived methods, such as SSRN [18], 3D-Conv-Capsule [20], and HSI-BERT [26]. To verify the improvement of GANs, we exploited three GAN-based methods for comparison, including 3D-GAN [32], GAN-CRF [33], and AD-GAN [34]. Moreover, to demonstrate the effectiveness of the SSAT module, we also introduced the extensions of SSAT-GAN: Spa-AT-GAN (only comprises one spatial attention feature spreading block), Spc-AT-GAN (only comprises one spectral attention feature spreading block), and Spa-Spc-AT-GAN (comprises one spatial attention feature spreading block and one spectral attention feature spreading block). To make a fair comparison, all the competitive algorithms were tuned to their optimal settings.
Regarding the EPF-SVM, the two parameters of the joint bilateral filter were set as follows:
As for the proposed SSAT-GAN, we set the spatial size of input HSI cubes to
1) Experimental Results on IN Dataset
For various methods, 500 labeled pixels were employed as training samples on the IN dataset. Table II lists the quantitative classification results of comparison methods, and the visualization maps are illustrated in Fig. 13. As shown in Table II, EPF-SVM yielded poor accuracies in the “Corn,” “Soybean-notil,” and “Buildings-Grass-Trees-Drivers” classes, which are 63.09%, 67.93%, and 52.74%, respectively. This is caused by their similarity of spectral curves, which makes them difficult to identify. In contrast, we observed that SSRN, 3D-Conv-Capsule, and HSI-BERT acquired better results than EPF-SVM in the three classes. However, in the HSI-BERT, it improved at least 12.21% in the “Corn” class. It can be analyzed that deep-learning methods have a certain positive effect on interpreting complex spectral characteristics. Different from the former, GAN-based methods showed superior prediction in the three classes, and 3D-GAN improved at least 26.66% in “Corn”. Besides, GAN-CRF achieved 93.05% in “Soybean-notil,” and AD-GAN had classified the “Corn” completely accurate. As for SSAT-GANs, both Spa-AT-GAN and Spc-AT-GAN achieved advanced prediction in the three classes.
Accuracy value of the IN dataset with SSAT-GAN model under different training data sampling. We report the average results of ten experiments. (a) Randomly selection strategy. (b) Monte Carlo sampling strategy.
Classification visualization of comparison models on IN dataset. (a) EPF-SVM. (b) SSRN. (c) 3D-Conv-Capsule. (d) HSI-BERT. (e) 3D-GAN. (f) GAN-CRF. (g) AD-GAN. (h) Spa-AT-GAN. (i) Spc-AT-GAN. (j) Spa-Spc-AT-GAN. (k) SSAT-GAN.
As SSAT can improve the intraclass aggregation, which effectively distinguishes the difference between spectra during hyperspectral interpretation.In the “Alfalfa,” “Grass-pasture-mowed,” and “Oats” categories, only 5, 3, and 5 pixels were used as training samples. SSAT-GAN gains superior classification, all with accuracies of 100%. This indicates that our SSAT-GAN can extract sensitive features under the classes of small samples. Among the competitive methods, SSAT-GAN also gathered the best accuracies in the “Soybean-mintil” and “Soybean-clean,” which contain redundant spectral signatures. Meanwhile, SSAT-GAN outperformed various comparison methods according to OA, AA, and kappa. In contrast to SSAT-GAN with its extensions, SSAT-GAN improved the OA with extensions by at least 1.01%, AA by 1.35%, and kappa by 1.25%. This illustrates that the SSAT module can extract discriminative spectral signatures and adaptive homogeneous areas to mitigate the impact of interfering pixels of HSIs. Besides, it should be noted that the experiment in Spa-Spc-AT-GAN showed inferior performance because the abundant spectral features are more difficult to learn than spatial features.
To verify the results of the SSAT-GAN under the Monte Carlo sampling, we also experiment with the randomly selection strategy under the training sampling ratio (SR) fixed as 5% (500 training samples) and 10% (1000 training samples).As shown in Fig. 12(a), the detailed class accuracy (CA) of each class shows qualitative comparisons with different circumstances, in which the “Soybean-notil” class obtained unsatisfactory results under the randomly selection strategy, with 79.79% of SR = 5% and 91.65% of SR = 10%. Besides, it is worth mentioning that the experiment under the SR = 10% achieves better realistic performance compared to that of SR = 5%. The reason is that the randomly chosen samples for any other classes contain more outstanding characteristics which confuse that of small-sample classes, when the data distribution is imbalanced. To confirm this stated opinion, we adopted the Monte Carlo sampling to redo the experiment with our SSAT-GAN with the SR MCR = 5% and MCR = 10%. The Monte Carlo sampling considers the interclass sample distribution under preserving the total random sampling size (as shown in Table II). As can be seen from Fig. 12(b), the performance of each class in the style of Monte Carlo sampling has superior observations than the indication in Fig. 12(a). From Fig. 13(a)–(k), EPF-SVM and Spa-Spc-GAN has got more visual noise and had the most misclassified pixels; besides, visualizations of SSRN and HSI-BERT got rough boundaries in most classes. The reason is that the imbalanced sample distribution of the IN dataset, in which part of classes with a large number of samples may contain more discriminative characteristics to the identification. 3D-GAN, GAN-CRF, Spa-AT-GAN, and Spc-AT-GAN gained relatively little visual noisy scatter. In contrast, 3D-Conv-Capsule and AD-GAN significantly reduced the impact of noise and established homogeneous areas. Among them, SSAT-GAN had more uniform regions and set up an adaptive neighboring relationship, from which it can be noted that SSAT can effectively suppress information detrimental to classification.
2) Experimental Results on UP Dataset
The evaluation of the comparison methods on the UP dataset is listed in Table III using 500 labeled samples. We can see that OAs yielded with SSRN, 3D-GAN, and GAN-CRF are 95.31%, 93.89%, and 94.95%. Our proposed model can further increase the performance to 98.09% by incorporating the SSAT module. In similarity, the AA values are 94.33%, 94.25%, and 97.16% for the 3D-Conv-Capsule, HSI-BERT, and AD-GAN, respectively. As can be observed, the proposed SSAT-GAN has a relatively stable and balanced classification effect for each category under high-resolution neighboring relationships and acquired the maximum AA value (98.21%). It can be noted that the proposed SSAT module can capture discriminative interclass differences and is essential and beneficial for the proposed architecture.
The classification visualization on the UP dataset is described in Fig. 14. It can be seen that the comparison methods produced rough prediction maps, especially in 3D-GAN and Spa-Spc-AT-GAN, which was caused by atmospheric effects and instrument noises. SSAT-GAN aimed at neighboring correlation context as auxiliary information and had the smoothest results and clearest boundary.
Classification visualization of comparison models on UP dataset. (a) EPF-SVM. (b) SSRN. (c) 3D-Conv-Capsule. (d) HSI-BERT. (e) 3D-GAN. (f) GAN-CRF. (g) AD-GAN. (h) Spa-AT-GAN. (i) Spc-AT-GAN. (j) Spa-Spc-AT-GAN. (k) SSAT-GAN.
3) Experimental Results on KSC Dataset
The last experiment is performed at the KSC dataset using 250 labeled pixels as training samples. As shown in Table IV, the SSAT-GAN achieved the best OA of 97.72% higher than GAN-CRF (95.38%) and AD-GAN (96.15%). In comparison, the SSRN yielded an OA of only 94.19%. The reasonable analysis is that the KSC dataset contains a relatively sparse characteristic so that the traditional network generally has more difficulties in interpreting spectral–spatial features. With the SSAT operation, the proposed model achieved superior performance in contrast to the other state-of-the-art methods. In addition, it needs to be noted that PCA-based 3D-GAN yielded the worst assessment with an OA of 93.38%, which illustrates that the representation of the primary components gains poor effect in the spectral–spatial feature extraction for HSIs with the characteristic of high sparsity. In contrast, our proposed architecture with the SSAT model acquires better robustness for the sparsity.
Classification maps are shown in Fig. 15. In contrast, it can be seen that SSAT-GAN achieved smoother and more adaptive visual results, which indicates that its SSAT module can both emphasize the intraclass consistency and increase interclass differences for HSI classification with high sparsity distribution. All the quantitative experiments conducted on the three datasets demonstrated that the SSAT-GAN framework reflects the excellence and robustness of HSI classification.
Classification visualization of comparison models on KSC dataset. (a) EPF-SVM. (b) SSRN. (c) 3D-Conv-Capsule. (d) HSI-BERT. (e) 3D-GAN. (f) GAN-CRF. (g) AD-GAN. (h) Spa-AT-GAN. (i) Spc-AT-GAN. (j) Spa-Spc-AT-GAN. (k) SSAT-GAN.
D. Investigation of the Impact of Attention Mechanism
To evaluate the effectiveness and the contribution of the attention mechanism, we compared various classical and representative attention modules which were executed over our GAN-baseline in Table V, including SE_Block [47], CBAM [40], FA [48], and MAFN [24], and reported OAs of three datasets. It can be seen that both CBAM and our SSAT can obtain a considerable result on all three cases. This is caused by their forms of cascade connection that fit our architecture better. Besides, the FA module has a more promising result on both UP and KSC datasets in contrast to the IN dataset. The reason is that it requires a high spatial resolution to calculate the utility of covariance matrices over FA.
Moreover, we also investigated feature visualization with the guidance of the attention weights under the SSAT modules. In this experiment, only the
Feature visualization with the guidance of the attention weights over SSAT on the UP dataset. Each land cover category is randomly selected from the labeled training set and described with the false color. The corresponding feature visualization is obtained by applying Grad-CAM [49].
E. Execution Time Analysis on Different Datasets
The training and testing time on the three datasets are also illustrated in Tables II–IV. To assess the computational complexity, we reported the execution time (in milliseconds, i.e., ms) at each epoch or iteration of various methods.
In general, EPF-SVM obviously consumed the shortest time for training in all three cases. 3D-Conv-Capsule took the longest time as the reason it needs to construct a dynamic route for optimal vector search during training. GAN-based methods need to optimize the discriminator and the generator alternately and, thus, gathered a relatively long time for training. In addition, Spa-AT-GAN took the shortest time to train on all three datasets, among the deep learning methods, which took about 4–6 times faster than GAN-CRF. In contrast, we can find that the time cost is relatively similar between SSAT-GAN and HSI-Bert, while our SSAT-GANs contains better accuracy as illustrated in Tables II–IV.
For testing, 3D-GAN took more time to test because of its large candidate neighboring areas and the deep network architecture. In contrast, SSAT-GAN consumes relatively less time due to the high efficiency that existed in the feature representation of the spectral–spatial attention spread and generation blocks. In summary, it can be concluded that the proposed framework is the most efficient method with advanced performance under fair comparison.
F. Sensitivity Analysis on Different Number of Labeled Sample for Training
To observe the effect of different number of labeled samples on OAs, we randomly selected labeled pixels in the range of
Impact of different number of labeled samples on OA results for training. OA results were obtained by all algorithms on (a) IN dataset, (b) UP dataset, and (c) KSC dataset.
To verify the contribution of different categories to both AA and kappa with modification of labeled samples for training, a new experiment was performed on the three datasets over the proposed SSAT-GAN. Fig. 18 illustrated the CA of each class on the three datasets. It is observed that the SSAT-GAN acquires stable CAs for “Grass-trees,” “Hay-windowed,” and “Woods” class, no matter how many amounts of total labeled samples are considered, owing to the discriminative spectral characteristics of three ground materials in the IN dataset. Therefore, it still achieves a satisfactory classification performance, even in a relatively small labeled samples. Furthermore, the contribution to the CA of the remaining classes in the IN dataset is improved and then stabilizes since the number of labeled samples increases.
Class accuracy results for each class with different number of total labeled samples for training over the SSAT-GAN on (a) IN dataset, (b) UP dataset, and (c) KSC dataset.
For the UP dataset illustrated in Fig. 18(b), for the “Meadows,” “Painted metal sheets,” and “Shadows” class, the CAs detail a negligible variation as the labeled samples increase. As for other classes, the accuracy values of the proposed SSAT-GAN tend to be stabilized as the number of samples increases. Similar achievements can be found in Fig. 18(b) and (c). Overall, not all the classes contribute to both AA and kappa to the same degree with modification of labeled training samples. The reason may be that the spectral signatures suffer from the challenge of spectral variability which stems from the illumination and atmospheric conditions. However, our SSAT modules can alleviate such limitations of spectral characteristics, which can be illustrated at those advanced accuracy values in the three datasets.
Discussion
There are three differences between the proposed SSAT-GAN and the GAN-based methods for HSI classification [29], [32], [33]. First, SSAT-GAN takes the attention information of HSIs into account for both the discriminator and the generator. Second, the discriminator in the adversarial framework adds unlabeled samples for semisupervised learning and alleviates the impact of small samples. Third, a mean minimization loss is employed for the unsupervised learning of SSAT-GAN to reduce the complex calculation parameters of high-dimensional features so as to achieve steady-state performance of GAN.
The SSAT-GAN models incorporate the SSAT as the feature perception enhancement step in the feature extraction stage, which builds a strong SNR spectral domain and a physical denoising contextual area upon both spectral and spatial dimensions, respectively. Compared with those attention mechanisms used in the vision community, the SSAT considers the long-range correlations between neighboring HSI cubes. This property helps the SSAT-GAN framework to better filter noises in the areas with different spectral purity and texture information.
We gain three major insights from the semisupervised HSI classification outcomes of GANs in all three datasets. First, by taking the spectral–spatial discriminative features of training data into account, the discriminators of SSAT-GANs extract efficient and significant HSI characteristics and achieve better classification accuracies. Second, the unlabeled samples and generated HSI samples of unsupervised learning make discriminators more robust among adversarial framework and learning complex real data distribution of HSIs to predict. This alternate training mode enables semisupervised GANs to promote superior classification outcomes than that of supervised deep learning derived frameworks. Third, the mean minimization loss takes the constrained optimization of the high-dimensional feature maps generated by the discriminators as the smooth filtering by calculating the efficiency values, which imposes the correlation in homogeneous regions including high texture areas or purity spectral domain.
Conclusion
In this article, an SSAT-GAN approach for HSI classification is proposed by using a cascade feature representation of spectral–spatial attributes with the SSAT. The proposed model improves the transmission of the characteristics with extended spectral–spatial attention feature spread and generation blocks to represent the feature. It effectively applied the attention weights to emphasize both spectral bands and spatial correlations to improve the characterization during feature extraction. Besides, SSAT-GAN constructs a semisupervised architecture by adding unlabeled samples for training to alleviate the scarcity of training samples. Furthermore, we employ the mean minimization loss for unsupervised learning of the discriminator to avoid the mode collapse. In terms of the accuracy and computation of the experiments, an analysis on the three HSI datasets indicates that our model achieves an excellent performance.
ACKNOWLEDGMENT
The authors would like to thank the Associate Editor and the three anonymous reviewers for their outstanding comments and suggestions, which greatly helped the authors to improve the technical quality and presentation of this article.