Loading [MathJax]/extensions/TeX/boldsymbol.js
Spectral–Spatial Attention Feature Extraction for Hyperspectral Image Classification Based on Generative Adversarial Network | IEEE Journals & Magazine | IEEE Xplore

Spectral–Spatial Attention Feature Extraction for Hyperspectral Image Classification Based on Generative Adversarial Network


Abstract:

Recent research shows that generative adversarial network (GAN) based deep learning derived frameworks can improve the accuracy of hyperspectral image (HSI) classificatio...Show More

Abstract:

Recent research shows that generative adversarial network (GAN) based deep learning derived frameworks can improve the accuracy of hyperspectral image (HSI) classification on limited labeled samples. However, several studies point out that existing GAN-based methods are heavily affected by the complexity and inefficient description issues of HSIs. The discriminator in GAN always attempts to interpret high-dimensional nonlinear spectral knowledge of HSIs, thus resulting in the Hughes phenomenon. Another critical issue is sample generation. The generator is only used as a regularizer for the discriminator, which seriously restricts the performance for classification. In this article, we propose SSAT-GAN, a semisupervised spectral–spatial attention feature extraction approach based on the GAN that feeds raw data into a deep learning framework, in an end-to-end fashion. First, the unlabeled data is added into the discriminator to alleviate the problems of training samples and supplies a reconstructed real HSI data distribution through adversarial training. Second, to enhance the description of HSIs, we build spectral–spatial attention modules (SSAT) and extend them to the discriminator and the generator to extract discriminative characteristics from abundant spatial contexts and spectral signatures. The SSAT modules learn a three-dimensional filter bank with spectral–spatial attention weights to obtain meaningful feature maps to improve the discrimination of the feature representation. In terms of the mode collapse of GANs, the mean minimization loss is employed for unsupervised learning. Experimental results from three real datasets indicate that SSAT-GAN has certain advantages over the state-of-the-art methods.
Page(s): 10017 - 10032
Date of Publication: 28 September 2021

ISSN Information:

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Hyperspectral imagery (HSI) obtains hundreds of numerous narrow and contiguous spectral bands from the surface which provide abundant characteristics to enhance the identification ability of ground materials [1]. With high-resolution imaging technology rapidly developing, HSI becomes an ideal tool to effectively detect the surface, which spans a broad range of applications, including mineral substance [2], monitoring of plant diseases [3], anomaly detection [4], and land-cover mapping [5]. HSI classification plays a substantial role in these fields, intending to analyze discriminative characteristics of HSI and classify each pixel according to a corresponding land-cover category [6]. Therefore, two major characteristics of HSI should be considered. First, the high-dimensional nonlinear spectral signature, which originates from redundant bands of spectrums, enables the accurate distinction of homologous surface categories. Second, high spatial correlation provides spatial auxiliary contexts for accurate mapping of pixelwise classification, which derives from homogeneous regions [7].

Since the spectral information can natively reflect the characteristics of different materials, one set of traditional methods identifies the classification maps in a pixelwise way, which can be divided into two steps: 1) feature engineering, such as principal component analysis (PCA) [8], bands selection [9] and 2) classifier development, including support vector machine (SVM) [10], random forest [11]. This kind of approach is constrained by the high-dimensional nonlinear characteristics, which leads to an unsatisfactory result. To further improve the representation of HSIs, another set of approaches implements the positive effect on the spectral–spatial expression. Existing methods introduced the spatial contexts in the feature engineering step. For instance, Kang et al. [12] proposed the feature fusion framework combined with the edge-preserving filtering (EPF) and SVM. Jiang et al. [13] regarded the superpixel as a carrier to extract potential features. However, the models mentioned above consist of shallow structures which cannot provide an efficient description.

With the advancement of artificial intelligence, CNN-based approaches have attracted increased focus due to the fact that their objective functions directly aim at classification instead of two independent steps to obtain remarkable results [14], [15]. In 2016, Zhao et al. [16] adopted CNN to learn local spatial contexts for HSI classification. Chen et al. [17] designed a 3-D CNN to extract neighboring spectral cubes, which originate from HSIs instead of dimensionality-reduced data. Nonetheless, a deeper network may lead to the Hughes phenomenon, under the conditions of both complexity of the spectral–spatial distribution and the scarcity of training samples.

Meanwhile, with the development of deep learning, a series of deep-learning-derived methods have been applied for HSI classification and proven to be successful. Many works of classification frameworks obtains superior achievements by constructing high efficiency spectral–spatial feature extraction. For instance, Zhong et al. [18] built a spectral–spatial residual network (SSRN) to reduce the complexity of the network design and achieved advanced performance. In [19], a dense convolutional block was employed for accurate identification. A 3D-Conv-Capsule model [20] was presented for HSI classification, which attempted to consider the pixel position attributes to enhance the spatial awareness. In addition, in Sellami's work [21], a spectral–spatial graph was constructed to fully exploit the inherent spatial distribution.

Another line of approaches accomplished spectral–spatial classification by exploiting attention mechanisms, which performs classification after aggregating features from the homogeneous regions. Xu et al. [22] designed a control gate attention mechanism for the quick acquisition of key features. In [23], a spectral–spatial classification framework was proposed by performing CNN with a self-attention module to enhance the correlation of features. In [24], a multiattention fusion network (MAFN) was designed to mine significant features for classification. Yu et al. [25] presented a dense CNN framework with a feedback attention mechanism to further improve the computation efficiency. However, the attention weight embedding was placed behind the spectral–spatial representation, which introduced the influence of interference pixels and redundant spectral bands. He et al. [26] designed an HSI-BERT to capture global dependence among pixels at the receptive field. However, the transformer-based method needs multiple nonlocal areas to capture global long-term dependence.

In contrast to classical optical image classification objectives in the computer vision fields, which consist of hundreds of categories, the land cover classification of HSI takes much fewer targets for identification. Therefore, the theory that deep learning takes a high amount of data for training might not apply to HSIs which lack in labeled samples. Several works focus on the semisupervised learning via both labeled and unlabeled HSI samples for training. For instance, Fang et al. [27] presented a resampling strategy for training CNN sufficiently. In [28], the uncertainty of unlabeled samples of HSIs are considered for classification. Although these studies have acquired significant results, they may stem from the regions of high spatial correlation context, instead of deep learning methods.

Recently, generative adversarial network (GAN) have been applied for HSI classification to alleviate the issue of limited labeled samples. Specifically, GAN-based classifiers start from semisupervised HS-GAN proposed by Zhan et al. [29], which used 1-D spectral vectors as the input. To exploit the benefit of spatial information, a neighborhood majority voting strategy [30] is applied to the prediction, lately. He et al. [31] built a 3-D bilateral filtering-based GAN framework to improve the ability of spatial awareness. A 3D-GAN is proposed for HSI classification that keeps only the first three principal components of raw data as input. In [33], a semisupervised GAN with a conditional random field (GAN-CRF) was designed that regards the softmax prediction as conditional probabilities of HSI to refine classification maps. To enhance the meaningful semantic contexts, an adaptive DropBlock-enhanced GAN (AD-GAN) [34] was established to stabilize the training state of the model.

Although these GAN-based methods have achieved satisfying ability over the contemporaneous benchmarks, there are still two drawbacks over HSI classification to be solved.

The first challenge is the mode collapse of GAN. The generator G deceives the discriminator D through generating data from the limited labeled data distribution [35]. The restricted narrow redundant spectral signatures limit the representation ability of GAN and lead to terrible data generation. In Wang's work [34], an adaptive DropBlock is employed as a regularization method to alleviate the mode collapse. However, the supervised GANs generate the data distribution that is similar to that of labeled training ones and, thus, difficult to learn the complete real HSI distribution. In addition, the unlabeled data of HSI remains an unexploited gold mine for efficient data utilization. Recently, in response to this characteristic, Liang et al. [36] implemented the mean minimization loss that considers the constraint over unlabeled data of HSI and acquired superior achievement. The reason for this phenomenon is that it may minimize the values and variances of high-dimensional feature maps from D. As this point, the GAN model can hardly be subject to the impact of complex parameter calculation, which guaranteed the stability of the training state.

Another critical issue is the complexity, inefficient description of spectral–spatial characteristics. The classification performance seems to deteriorate when the extraction of spectral–spatial characteristics is affected by interference pixels. Therefore, it is hard to guarantee that the GAN always works toward the authentic HSI distribution, particularly for high-dimensional spectral signature or texture-dependent context. In Feng's work [37], the joint spatial spectral hard attention mechanism was employed in G to cooperate D discards misleading and confounding information for HSI classification. However, it only focused on a specific area of the input patches in one batch, which requires more complex technology for training. In a disparate line of work, the attention-aware block [38] was designed in ResNets to enhance the representation of HSI data. It demonstrated that the attention-aware block can learn more valuable and valid representations. However, when dealing with objects with variable spectral or irregular areas, the attentive architecture is inefficient. We argue that if the homogeneous spectrum and adaptive receptive fields are taken into account, the complexity issue of the HSI data can be alleviated.

To tackle the above-mentioned challenges of GAN-based methods, we suggest a spectral–spatial attention feature extraction approach based on GANs (SSAT-GANs) for HSI classification. The purpose of the proposal can build a significant representation for spectral–spatial characteristics and enhance the robustness and stability of GANs in the way of semisupervised learning. On the one hand, the SSAT-GAN takes the unlabeled data into account to alleviate the scarcity of labeled samples, which enables the generator G to implicitly reconstruct real HSI cubes. Meanwhile, we adopt the mean minimization loss as an unsupervised constraint item used in the discriminator D to avoid overfitting. On the other hand, the complicated spectral–spatial characteristics of local adjacent pixels herald the redundancy and inefficiency problem, which result in more insufficient classification with more complex regions. Inspired by the fact that the attention weights can enhance the effective representation of the saliency neighborhood of an object, the spectral–spatial attention modules (SSAT) are designed separately to capture the discriminative representation in this article, in which both intraspectrum and contextual relations of HSIs participate in the attention calculation through the feedback, and the weighted feature maps are considered to enhance intraclass consistency. In this way, we extend the SSAT to consecutive feature spreading and generation blocks and pass through them to build D and G, respectively. Unlike traditional semisupervised GANs, which require a deeper convolutional architecture for feature representation, our proposal is feature-efficient because both D and G share the weights of parameters with the corresponding attention modules and further improve the feature description. To this end, the well-trained D can achieve satisfactory classification accuracy.

The main contributions of this article are listed as follows.

  1. We design a novel semisupervised GAN-based HSI classification framework using a small number of labeled and unlabeled data for training. The mean minimization loss is employed for unsupervised learning, which boost the backpropagation of the gradient and stabilize the training of GAN.

  2. For the purpose of alleviating the inefficient description, we integrate the spectral–spatial attributes into SSAT for representation discrimination of the HSI data.

  3. The alternately optimized architecture design makes the SSAT-GAN a framework that generalizes well in three real HSI datasets and achieves satisfactory classification accuracy over state-of-the-art methods.

The rest of this article is organized as follows. Section II reviews the basic concepts of GANs. The scheme of the proposed SSAT-GAN and its components are introduced in Section III. Experimental results and analysis are presented in Section IV. The superiority of SSAT-GAN is discussed in Section V. Finally, the conclusion is drawn in Section VI.

SECTION II.

Related Work

A. Generative Adversarial Network

GAN is an unsupervised deep learning model proposed by Goodfellow et al. [39], which provides a reasonable scheme to implicitly reckon real data distribution. GAN incorporates a generator G and a discriminator D in a unified network, where G generates samples to fool D into believing it, and D distinguishes the genuineness of the samples. Contradictory results make G and D reach Nash equilibrium in the zero-sum game, which is finally expressed as a minimax optimization problem \begin{equation*} \begin{split} \underset{\text {G}}{{\text{min}}}\underset{\text {D}}{{\text{max}}}\text {Loss} = & \text {E}_{{\bf {z}}\sim p_{z}}\left[ \text{log}\left(1-D\left(G\left({\bf {z}} \right) \right) \right) \right] \\ & +\text {E}_{{\bf {x}}\sim p_{\text{data}}}\left[ \text{log}D\left({\bf {x}} \right) \right] \end{split} \tag{1} \end{equation*} View SourceRight-click on figure for MathML and additional features. where {\bf {z}}\sim p_{z} and {\bf {x}}\sim p_{\text{data}} denote the random noise vectors and input images following real data distribution, respectively. \text {E}(\cdot) is the expectation. D({\bf {x}}) and G({\bf {z}}) represent the sigmoid output obtained from D by training on real input vectors, and synthetic data from G by random noise, respectively. D(G({\bf {z}})) gives the real expectations of D with the input derives from G({\bf {z}}).

In the optimization process of GAN, G and D are optimized alternately. Given G({\bf {z}}) of G, the model will optimize D by maximizing \text {E}_{{\bf {x}}\sim p_{\text{data}}}[ \text{log}D({\bf {x}}) ]+\text {E}_{{\bf {z}}\sim p_{z}}[ \text{log}(1-D(G({\bf {z}}))) ]. When D arrives at a stationary score, G is optimized by minimizing \text {E}_{{\bf {z}}\sim p_{z}}[ \text{log}(1-D(G({\bf {z}}))) ]. Since D and G achieve the Nash equilibrium during adversarial training, GAN will learn the probability estimation of real data and produce promising results.

SECTION III.

SSAT-GAN Framework

The SSAT-GAN flowchart is shown in Fig. 1. Suppose the raw HSI dataset {\bf {X}} contains m pixels \lbrace {\bf {x}}_{1},{\bf {x}}_{2},{\bf {x}}_{3},\ldots, {\bf {x}}_{m} \rbrace \in \mathbb {R}^{1\times 1\times {b}}, where {b} is the bands of spectrum. The neighboring cubes centered at the labeled pixels form the labeled datasets {\bf {X}}^{1}=\lbrace {\bf {x}}{{_{\bf {i}}}^{1}} \rbrace \in \mathbb {R}^{{w}\times {w}\times {b}\times {m}_{l}}. Take unlabeled cubes {\bf {X}}^{2}=\lbrace {\bf {x}}{{_{\bf {i}}}^{2}} \rbrace \in \mathbb {R}^{{w}\times {w}\times {b}\times {m}_{u}}, where w, {m}_{l}, and {m}_{u} are the spatial size of HSI cubes and the number of labeled and unlabeled HSI samples, respectively. We send these two datasets to the discriminator to learn the real distribution of HSI. The generator synthesizes HSI cube {\bf {Z}}=\lbrace {\bf {z}}_{1},{\bf {z}}_{2},{\bf {z}}_{3},\ldots, {\bf {z}}_{m} \rbrace, with samples of size {\bf {X}}^{2}. In addition, the labeled {\bf {X}}^{1} has its corresponding annotation {\bf {Y}}^{1}=\lbrace {\bf {y}}{{_{\bf {i}}}^{1}} \rbrace \in \mathbb {R}^{(1+{n}_{y}) \times {m}_{l}}, where {n}_{y} is the number of land cover categories, and {\bf {y}}{{_{\bf {i}}}^{1}}[ 0 ] is the first item of {\bf {y}}{{_{\bf {i}}}^{1}}, which indicates the authenticity of the corresponding HSI cube. The classified prediction of HSI is carried out with a well-trained discriminator.

Fig. 1. - Flowchart of SSAT-GAN framework for HSI classification. First, the unlabeled group ${\bf {X}}^{2}$ is established to initialize the parameters of a discriminator, and a generator transforms the noise vectors ${\bf {z}}$ to a set of fake HSI cubes ${\bf {Z}}$, which implicitly learns the real HSI distribution. Then, the discriminator attempts to identify the authenticity of the input HSI cubes that derive from ${\bf {X}}^{2}$ or ${\bf {Z}}$. Finally, the categorical information $\hat{{\bf {Y}}}$ is predicted by the discriminator that feeds labeled ${\bf {X}}^{1}$ during training. The corresponding annotation ${\bf {Y}}^{1}$ is adopted for the evaluation and acquire supervised partial loss of the GAN.
Fig. 1.

Flowchart of SSAT-GAN framework for HSI classification. First, the unlabeled group {\bf {X}}^{2} is established to initialize the parameters of a discriminator, and a generator transforms the noise vectors {\bf {z}} to a set of fake HSI cubes {\bf {Z}}, which implicitly learns the real HSI distribution. Then, the discriminator attempts to identify the authenticity of the input HSI cubes that derive from {\bf {X}}^{2} or {\bf {Z}}. Finally, the categorical information \hat{{\bf {Y}}} is predicted by the discriminator that feeds labeled {\bf {X}}^{1} during training. The corresponding annotation {\bf {Y}}^{1} is adopted for the evaluation and acquire supervised partial loss of the GAN.

SSAT-GAN incorporates the spectral and spatial attention modules in both discriminator and generator to extract discriminative features, where the discriminator and generator are, respectively, composed of convolutional and transposed layers.

A. Spectral and Spatial Attention Modules

The purpose of SSAT is to enhance the feature analysis of a salient and effective domain, which is inspired by CBAM [40]. Given an intermediate feature map, it sequentially calculates attention weights along spectral and spatial dimensions, separately.

1) Spectral Attention Module

The spectral attention module aims at exploring the intraclass consistency of spectrums. As each band of spectral energy is considered as a class feature detector, our spectral attention focuses on “which” bands are meaningful given an input cube. To highlight discriminative signature from spectral knowledge while retaining uniform characteristics, we use both depthwise separable convolution (Depth_CONV) [41] and 3-D convolution (3D_CONV) operations. For aggregating intraspectrum information, average-pooling and max-pooling have been commonly adopted so far. In addition, we employ a spectral squeeze mechanism to assign an independent weight to each element along spectral dimension. Finally, the spectral attention weights will be generated by a dynamic activation function.

As demonstrated in Fig. 2, take n HSI cubes {\bf {X}}^{p} of size {w}\times {w}\times {d} as the \mathit {p}+1th input feature map. It first captures the available homogeneous area using both Depth_CONV and 3D_CONV operations with \mathit {n} spectral kernels of size \mathrm{1}\times \mathrm{1}\times {m}, generating two different spatial context descriptors: {\bf {X}}^{\text{Dep}} and {\bf {X}}^{\text{3-D}}. Both descriptors are then forwarded to average-pooling and max-pooling operations, which denote the salient and effective features, respectively. After an elementwise addition strategy, the feature vectors are passed through a squeeze mechanism to extract spectral energy relationship, producing our spectral attention weights {\bf {\rm {Atte}}}_{\text{spc}}. It can enlarge the weights of HSI pixels with discriminative signatures in the spectral distribution and suppress those of adverse pixels for identification. The squeeze mechanism is composed of fully connected layers (FCs) with one embedding layer. To optimize the parameter efficiency, the embedding units are set to \mathbb {N}^{d/1\times 1 \times r}, where the r is the optimization ratio. The {\bf {Atte}}_{\text{spc}} can be formulated as \begin{align*} {\bf {Atte}}_{\text{spc}} & = \sigma \left(\text{FCs}\left(\text{AvgPool}\left({\bf {X}}^{\text{Dep}}\right) \right.\right.\\ \phantom{=\;\;} &\qquad\; \left. \left.+\text{MaxPool}\left({\bf {X}}^{3-D}\right)\right)\right)\\ &= \sigma \left(\mathbf {W}_{1}\left(\mathbf {W}_{0}\left({\bf {X}}_{{\text{avg}}}^{\text{Dep}}+{\bf {X}}_{{\text{max}}}^{\text{3-D}}\right)\right)\right), \tag{2} \end{align*} View SourceRight-click on figure for MathML and additional features. where \sigma (\cdot) is a sigmoid activation, which constrains the probabilities in the range of [0, 1]. \mathbf {W}_{0}\in \mathbb {N}^{d/r\times d} and \mathbf {W}_{1}\in \mathbb {N}^{d\times d/r} note that the FCs squeeze weights along spectral dimension. It can be considered as the signal-to-noise ratio (SNR) enhancement from the physical level, that is, the ratio of the validity spectral energy considered as signal energy specified by \mathbf {W}_{1} to the squeezed features considered as noise energy \mathbf {W}_{0}.

Fig. 2. - Spectral attention module utilizes both Depth_CONV and 3D_CONV descriptors with pooling operations, followed by a spectral squeeze mechanism to predict spectral attention weights.
Fig. 2.

Spectral attention module utilizes both Depth_CONV and 3D_CONV descriptors with pooling operations, followed by a spectral squeeze mechanism to predict spectral attention weights.

2) Spatial Attention Module

For exploiting the interclass differences of spatial contexts, we build a spatial attention module to generate spatial attention map. As a reasonable theory, a pixel does not always share the category of its neighbors, and the spatial attention providing “where” are interesting areas, which is complementary to the spectral attention. As illustrated in Fig. 3, we apply the atrous convolution (Atrous_CONV) and 3D_CONV operations along the spectral axis, generating two different intermediate feature maps, which aim at extending receptive field and reducing the interference of abnormal pixels. Atrous_CONV has been proved effectively for learning intraclass consistency homogeneous areas of HSIs [42], which has a shared kernel with multiple dilations for learning spatial contexts. For the “gridding problem” [43], both intermediate maps are then forwarded to average-pooling and max-pooling to enhance the spatial representation and generate a contextual descriptor with pixelwise addition strategy. The spatial attention weights will be predicted after activating the neural parameters of the contextual descriptor.

Fig. 3. - Spatial attention module combines two similar feature maps that are convolved with Atrous_CONV and 3D_CONV, and pooled along the spectral axis, and then feeds them to a convolution layer.
Fig. 3.

Spatial attention module combines two similar feature maps that are convolved with Atrous_CONV and 3D_CONV, and pooled along the spectral axis, and then feeds them to a convolution layer.

Supposing an HSI cube {\bf {X}}^{q} of size {w}\times {w}\times {d} is the {q}+1th input of the spatial attention module, we first aggregate spectral information by two convolutional operations with kernel size {a}\times {a}\times {d}, generating {\bf {X}}^{\text{Atr}} and {\bf {X}}^{\text{3-D}}. Then they passed through both pooling operations and generate two maps: {\bf {X}}_{{\text{avg}}}^{\text{Atr}} and {\bf {X}}_{{\text{max}}}^{\text{3-D}}. Each denotes local effective information and uniform contexts across the spectral knowledge. The spatial attention weights {\bf {Atte}}_{\text{spa}} are then predicted by a standard convolution after the pixelwise addition, which is computed as \begin{align*} {\bf {Atte}}_{\text{spa}} & = \sigma \left(\left[\text{AvgPool}\left({\bf {X}}^{\text{Atr}}\right); \right.\right.\\ &\qquad \left. \left. \text{MaxPool}\left({\bf {X}}^{\text{3-D}}\right)\right]\right)\ast {\bf {H}}^{{q}+1}+ {\bf {b}}^{{q}+1}\\ & = \sigma \left(\left[{\bf {X}}_{{\text{avg}}}^{\text{Atr}};{\bf {X}}_{{\text{max}}}^{\text{3-D}}\right]\right)\ast {\bf {H}}^{{q}+1}+ {\bf {b}}^{{q}+1}. \tag{3} \end{align*} View SourceRight-click on figure for MathML and additional features. where \ast denotes the convolutional operation and {\bf {H}}^{{q}+1}\in \mathbb {R}^{a\times a\times 2} is the spatial convolutional kernel, in which a denotes the spatial sampling size and {\bf {b}}^{{q}+1} denotes the bias. Note that their spatial sizes are fixed at w\times w under the padding strategy, which means that the spatial attention module can explore the adaptive neighboring correlation at the dilated receptive regions. Therefore, the spatial attention module can provide supplementary information for accurate spectral feature mapping.

B. Spectral–Spatial Attention Discriminator and Generator

We incorporate our SSAT in the generator and discriminator and extend them to four spectral–spatial attention spread learning and generation blocks. Fig. 4 shows the architecture of four attention blocks, each of which can be regarded as an extension of successive convolution and transposed convolution.

Fig. 4. - Four feature spreading blocks with lightweight spectral–spatial attention modules aiming for HSI feature extraction and generation in SSAT-GAN. (a) and (b) Spectral and spatial attention feature spread blocks in discriminators; (c) and (d) Spectral and spatial attention feature generation blocks in generators.
Fig. 4.

Four feature spreading blocks with lightweight spectral–spatial attention modules aiming for HSI feature extraction and generation in SSAT-GAN. (a) and (b) Spectral and spatial attention feature spread blocks in discriminators; (c) and (d) Spectral and spatial attention feature generation blocks in generators.

1) Spectral Attention Feature Spread Block

For the redundant spectral bands, as shown in Fig. 4(a), the spectral attention module is introduced in the p+1th layer to assign an attention weight to the spectral tensor of HSI and aggregates the intraclass correlation of the narrow spectrum. Next, the p+2th layer utilizes a 3-D convolution layer with batch normalization [44] (CONV_BN) to update the parameters according to the spectral attention feature. The skip connection is applied instead of directly mapping between the p+1th and the p+2th layers and builds the spectral attention feature extraction function {F}({\bf {X}}^{p};\boldsymbol{\theta }). If {\bf {X}}^{p} and {\bf {X}}^{{p}+1}, respectively, represent the input intermediate feature cube of the pth layer and the output feature cube of the {p}+1th spectral convolutional layer, then the architecture of {F}({\bf {X}}^{p};\boldsymbol{\theta }) can be formulated as \begin{align*} {\bf {X}}^{{p}+2}&={\bf {X}}^{p}+{F}\left({\bf {X}}^{p};\boldsymbol{\theta }\right), \tag{4} \\ {F}\left({\bf {X}}^{p};\boldsymbol{\theta }\right) &=\left({\bf {Atte}}_{\text{spc}}\otimes \text {R}\left(\hat{{\bf {X}}}^{{p}+1}\right)\right)\ast {\bf {h}}^{{p}+2}+{\bf {b}}^{{p}+2}, \tag{5} \\ {\bf {X}}& = \text {R}\left(\hat{{\bf {X}}}^{{p}}\right)\ast {\bf {h}}^{{p}+1}+{\bf {b}}^{{p}+1}, \tag{6} \\ \hat{{\bf {X}}}^{{p}} &= \frac{{\bf {X}}^{{p}}-E\left({\bf {X}}^{{p}}\right)}{\mathrm{Var}\left({\bf {X}}^{{p}}\right)} \tag{7} \end{align*} View SourceRight-click on figure for MathML and additional features. where \boldsymbol{\theta }=\lbrace {\bf {h}}^{{p}+1},{\bf {h}}^{{p}+2},{\bf {b}}^{{p}+1},{\bf {b}}^{{p}+2} \rbrace \in \mathbb {R}^{1\times 1\times m, n}. Note that \boldsymbol{\theta } is the weights and biases of the spectral convolutional kernels, which sharing their parameters for the whole training.{\bf {Atte}}_{\text{spc}} is the spectral attention weights proposed by (2). \text {R}(\cdot) is the ReLU activation function, which sets negative values to zero. E(\cdot) and \mathrm{Var}(\cdot) indicate the expectation and variance functions of the input HSI cubes, which is applied in BN, respectively. \ast represents the convolution operation, and \otimes is the elementwise multiplication. Furthermore, {\bf {Atte}}_{\text{spc}} retains the weights among spatial dimensions to the same, under aggregating intraspectrum information to improve radiant energy efficiency from each band.

2) Spatial Attention Feature Spread Block

The spatial attention feature spread block aims to explore neighboring correlation and intraclass consistency of central pixels in high-spatial regions. Fig. 4(b) shows the detail of the spatial block. The depths along the spectrum of kernels are in an identical size with that of the input cubes {\bf {X}}^{q}, which means the block extracts adaptive spatial context while maintaining the spectral attention feature. The architecture of the block can be formulated as \begin{align*} {\bf {X}}^{{q}+2}&={\bf {X}}^{q}+{F}\left({\bf {X}}^{q};\boldsymbol{\xi }\right), \tag{8} \\ {F}\left({\bf {X}}^{q};\boldsymbol{\xi }\right) &=\left({\bf {Atte}}_{\text{spa}}\otimes \text {R}\left(\hat{{\bf {X}}}^{{q}+1}\right)\right)\ast {\bf {H}}^{{q}+2}+{\bf {b}}^{{q}+2}, \tag{9} \\ {\bf {X}} &= \text {R}\left(\hat{{\bf {X}}}^{{q}}\right)\ast {\bf {H}}^{{q}+1}+{\bf {b}}^{{q}+1}, \tag{10} \\ \hat{{\bf {X}}}^{{q}}& = \frac{{\bf {X}}^{{q}}-E\left({\bf {X}}^{{q}}\right)}{\mathrm{Var}\left({\bf {X}}^{{q}}\right)} \tag{11} \end{align*} View SourceRight-click on figure for MathML and additional features. where \boldsymbol{\xi }=\lbrace {\bf {H}}^{{q}+1},{\bf {H}}^{{q}+2},{\bf {b}}^{{q}+1},{\bf {b}}^{{q}+2} \rbrace \in \mathbb {R}^{a\times a\times d, 1}. {\bf {Atte}}_{\text{spa}} is the spectral attention weights proposed by (3). In contrast to spectral attention, spatial attention can be also regarded as image denoising widely used in computer vision, that is, {\bf {Atte}}_{\text{spa}} searches an adaptive relationship from local spatial, and feedback to the input feature tensor.

3) Spectral–Spatial Attention Feature Generation Blocks

To overcome the challenge of a small-sample scenario, the idea of spectral–spatial attention is extended to the generator to improve the variety of generation. Fig. 4(c) and (d) shows the details of spectral–spatial attention generation blocks; they embed both attention modules to spread feature generation, which contains successive transposed 3-D convolution (CONV^{-1}_BN) and generates HSI cubes with spectral–spatial distributions. The architecture of the spectral attention generation block takes the form \begin{align*} {\bf {z}}^{{p}+2}&={\bf {z}}^{p}+{F}\left({\bf {z}}^{p};\boldsymbol{\theta }\right), \tag{12} \\ {F}\left({\bf {z}}^{p};\boldsymbol{\theta }\right) &= \left({\bf {Atte}}_{\text{spc}}\otimes \text {R}\left(\hat{{\bf {z}}}^{{p}+1}\right)\right)\ast ^{T}{\bf {h}}^{{p}+2}+{\bf {b}}^{{p}+2}, \tag{13} \\ {\bf {z}}&=\text {R}\left(\hat{{\bf {z}}}^{p} \right) \ast ^{T}{\bf {h}}^{{p}+1}+{\bf {b}}^{{p}+1} \tag{14} \end{align*} View SourceRight-click on figure for MathML and additional features. where each element of \boldsymbol{\theta } indicates parameters of spectral transposed convolutional layers, {\bf {Atte}}_{\text{spc}} is the continuation of (2), and \ast ^{T} denotes the transposed convolution operation. \hat{{\bf {z}}}^{p} is the normalization result of batch feature cubes {\bf {z}}^{p}, whose calculation refers to (7). Similarly, the spatial attention generation block takes the form \begin{align*} {\bf {Z}}^{{q}+2}&={\bf {Z}}^{q}+{F}\left({\bf {Z}}^{q};\boldsymbol{\xi }\right), \tag{15} \\ {F}\left({\bf {Z}}^{q};\boldsymbol{\xi }\right) &= {\bf {Atte}}_{\text{spa}}\otimes \text {R}\left(\hat{{\bf {Z}}}^{{q}+1}\right)\ast ^{T}{\bf {H}}^{{q}+2}+{\bf {b}}^{{q}+2}, \tag{16} \\ \boldsymbol{Z}&=\text {R}\left(\hat{\boldsymbol{Z}}^{q} \right) \ast ^{T} \boldsymbol{H}^{q+1}+ \boldsymbol{b}^{q+1} \tag{17} \end{align*} View SourceRight-click on figure for MathML and additional features. where \boldsymbol{\xi } denotes parameters of spatial transposed convolutions, and {\bf {Atte}}_{\text{spa}} is obtained from (3). Furthermore, \hat{\boldsymbol{Z}}^{q} is also the BN results of batch feature input {\bf {Z}}^{q}, which is computed as (11).

Unlike traditional feature representation blocks, which perform the attention mechanism after the feature extraction for HSI data characterization, the proposed spectral–spatial attention feature spread and generation blocks are feature-efficient, i.e., the attention maps are executed during the feature extraction. It can be described from two aspects. 1) The consecutive spectral–spatial attention feature spread blocks of the discriminator draw the SSAT into the architecture for training, which provides learnable spectral and spatial attributes. On the one hand, being similar to SNR enhancement, the spectral attention weight {\bf {Atte}}_{\text{spc}} retains the high-frequency details of the HSI data and improves the discrimination of high-level semantic description. On the other hand, the spatial attention weights {\bf {Atte}}_{\text{spa}} with the denoising theory can emphasize a broader receptive field to learn adaptive neighborhood relations. Under the guidance of SSAT, the discriminator can always obtain excellent interpretation ability for the HSI data, whether in high-purity spectral domains or high texture local regions. 2) The spectral–spatial attention feature generation blocks of the generator share both {\bf {Atte}}_{\text{spc}} and {\bf {Atte}}_{\text{spa}} with that of feature spread blocks. It means that the implicit synthetic HSI cubes produced by generator help the discriminator learn more robust and efficient characteristics.

C. Semisupervised SSAT-GAN

Taking the Pavia University (UP) dataset as raw input HSI cubes, Fig. 5 details the SSAT-GAN algorithm stream. The discriminator D contains a spectral attention feature spread block, spatial attention feature spread block, and one FC, and it outputs the vectors with the softmax layer. The generator G includes one FC, spectral attention generation block, and spatial attention generation block to generate HSI cubes. In addition, we extend the SSAT-GAN to semisupervised classification, which adopts unlabeled training samples of a raw HSI cube to improve HSI classification.

Fig. 5. - Spectral–spatial discriminator (top), which contains successive spectral and spatial attention feature spread blocks and outputs a vector consisting of an indicative entry of real or fake data and categorical probabilities; spectral–spatial generator (bottom), which contains successive spectral and spatial attention feature generation blocks and transforms a vector from random noise to a synthetic HSI cube.
Fig. 5.

Spectral–spatial discriminator (top), which contains successive spectral and spatial attention feature spread blocks and outputs a vector consisting of an indicative entry of real or fake data and categorical probabilities; spectral–spatial generator (bottom), which contains successive spectral and spatial attention feature generation blocks and transforms a vector from random noise to a synthetic HSI cube.

In contrast to original GANs, semisupervised SSAT-GAN leads a supervised item into the GAN loss to achieve the HSI classification. The labeled HSI cube {\bf {X}}^{1}=\lbrace {\bf {x}}{{_{\bf {i}}}^{1}} \rbrace \in \mathbb {R}^{7\times 7\times 103} has its corresponding annotation labels {\bf {Y}}^{1}=\lbrace {\bf {y}}_{i}^{1} \rbrace \in \mathbb {R}^{1 \times (1+{n}_{y}) }, where {n}_{y} is the total number of ground truth category, and the extra “1” category denotes whether the HSI cube is from synthetic or real data. Therefore, the prediction of the well-trained D can take the form \begin{equation*} \hat{{\bf {Y}}}^{1}=D\left({\bf {X}}^{1};\theta _{D} \right) \tag{18} \end{equation*} View SourceRight-click on figure for MathML and additional features. where \theta _{D} denotes parameters for training D for each element of {\bf {y}}_{i}^{1}, which includes (1+{n}_y) entries. In particular, {\bf {y}}_{i}^{1}[0] is the authenticity of {\bf {x}}{{_{\bf {i}}}}, and {\bf {y}}_{i}^{1}[1:{n}_{y}] denotes the output vectors of softmax, which contain probabilities that {\bf {y}}_{i}^{1} belongs to each category.

Semisupervised GAN aims to alleviate the issue of small samples by labeled and unlabeled data of HSI. The point of view referred in [33] illustrated that D needs a bad G as a regularizer for training GANs. An opposite theory cited in [34] has pointed out that high-quality synthetic samples help D improve generalization ability for HSIs. In our proposal, we extend our spectral–spatial attention weights to G, reconstructing HSI cubes, implicitly. It can be divided into two phases. First, G is considered as the regularizer of D to improve HSI classification, and it updates the penalty factor with the discriminative loss. Thus, the optimized loss function of D takes the form \begin{equation*} \begin{split} L_{\text {SEMI}}\left(\theta _{D},\theta _{G} \right) & =L_{\text {SUP}}\left(\theta _{D},\theta _{G} \right) +L_{\text {UNSUP}}\left(\theta _{D},\theta _{G} \right) \\ & = L_{\text {SUP}}\left(\theta _{D} \right) + L_{D1}\left(\theta _{D}\right) \\ & \qquad\; + L_{D2}\left(\theta _{D},\theta _{G} \right) \end{split} \tag{19} \end{equation*} View SourceRight-click on figure for MathML and additional features. where \theta _{D} and \theta _{G} are the optimization parameters of the D and G, respectively. L_{\text {SEMI}} is the total objective loss for optimizing SSAT-GAN. L_{\text {SUP}}, L_{D1}, and L_{D2} are, respectively, the unsupervised and supervised items of D, and the unsupervised item of G. These items are all formulated as \begin{align*} \begin{split} L_{\text {SUP}}\left(\theta _{D} \right) & = -E_{{\bf {X}}^{1}\sim {p}_{\text{data}}}\text{log}D\left({\bf {X}}^{1};\theta _{D} \right) \left[ 1:{n} \right]\\ & = -E_{{\bf {X}}^{1}\sim {P}_{\text{data}}}\text{log}\hat{{\bf {Y}}}^{1}\left[ 1:{n} \right] \end{split}, \tag{20} \\ \begin{split} L_{D1}\left(\theta _{D} \right) & = -E_{{\bf {X}}^{1}\sim {p}_{\text{data}}}\left(1-\text{log}D\left({\bf {X}}^{1};\theta _{D} \right) \left[ 0 \right] \right) \\ & = -E_{{\bf {X}}^{1}\sim {p}_{\text{data}}}\text{log}\left(1-\hat{{\bf {Y}}}^{1}\left[ 0 \right] \right) \end{split}, \tag{21} \\ \begin{split} L_{D2}\left(\theta _{D},\theta _{G} \right) & = -E_{{\bf {z}}\sim {p}_{z}}\text{log}D\left(G\left({\bf {z}};\theta _{D} \right) \right) \left[ 0 \right]\\ & = -E_{{\bf {z}}\sim {p}_{z}}\text{log}D \left({\bf {Z}};\theta _{D} \right) \left[ 0 \right]\\ & = -E_{{\bf {z}}\sim {p}_{z}}\text{log}\hat{{\bf {Y}}}^{1}\left[ 0 \right] \end{split} \tag{22} \end{align*} View SourceRight-click on figure for MathML and additional features. where L_{\text {SUP}} is applied for optimizing the real HSI predictions of softmax vectors, which corresponds to {\bf {y}}_{i}^{1}[1:{n}_{y}] from (18). L_{D1} aims at updating the recognition degree by unlabeled HSI cubes, and L_{D2} focuses on increasing the authenticity of generated samples, which both correspond to {\bf {y}}_{i}^{1}[0] from (18).

It is to be observed that the optimization of a semisupervised GAN focuses on exploring a real HSI data distribution by limited labeled samples, which often causes overfitting. As the high dimensional feature learning of LD1 is not constrained, it will contribute little to and even jeopardize the discriminator to enhance the capability of HSI classification. Thus, we minimize the high-dimensional output of (21) to update the gradient in reverse and decrease the value and variance to inhibit overfitting, which is available in another work [36] called mean minimization loss. The function takes the form \begin{equation*} \theta ^{\ast }=\text {arg}\underset{\theta }{\text {min}}\left(\frac{1}{N}\sum _{i=1}^{N}\text{average}\left(f\left(x_{i}; \theta \right) \right) \right) \tag{23} \end{equation*} View SourceRight-click on figure for MathML and additional features. where N is the total entities of batch samples, x_{i} is the training sample, and f(x_{i}; \theta) indicates the high-dimensional output of a model, which, in this article, is the output before the FC. Second, we employ the predictive spectral–spatial attention weights for generating high-quality samples. Furthermore, L_{D1}+L_{D2} is also part of the GAN loss for training G, whose corresponding loss function L_{G} can be formulated as \begin{equation*} \begin{split} L_{G}\left(\theta _{D},\theta _{G} \right) & = -E_{{\bf {z}}\sim {p}_{z}}\text{log}\left(1-D\left(G\left({\bf {z}};\theta _{D} \right) \right) \left[ 0 \right] \right) \\ & = -E_{{\bf {z}}\sim {p}_{z}}\text{log}\left(1-D \left({\bf {Z}};\theta _{D} \right) \left[ 0 \right] \right) \\ & = -E_{{\bf {z}}\sim {p}_{z}}\text{log}\left(1-\hat{{\bf {Y}}}^{1}\left[ 0 \right] \right). \end{split} \tag{24} \end{equation*} View SourceRight-click on figure for MathML and additional features.

Algorithm 1: Training Process of SSAT-GAN.

Input: The labeled training data: \boldsymbol{X}_{\text{train}}^{l}, unlabeled training data \boldsymbol{X}_{\text{train}}^{u}, and the test data \boldsymbol{X}_{\text{test}} from n_{y} classes, corresponding annotation of training data Y^{l}, the batch size \mathit {bt}, and the number of training epochs \mathit {e}.

Output: The labels of the test samples \boldsymbol{X}_{\text{test}}

1:

Begin

2:

Initialize: Randomly initialize the parameters \theta _{D} and \theta _{G} of the discriminator D and the generator G;

3:

for i=0 to epoch \mathit {e} do

4:

for \mathit {bt} training samples of each batch do

5:

Generate \mathit {bt} noises \lbrace \boldsymbol{z}_{1},\boldsymbol{z}_{2},\ldots,\boldsymbol{z}_{\mathit {bt}}\rbrace from the Gaussian distribution \mu (-1,1);

6:

Concatenate noises with labels \lbrace {y}_{1},{y}_{2},\ldots,{y}_{\mathit {bt}}\rbrace;

7:

Input \boldsymbol{X}_{\text{train}}^{l} into D to obtain real HSI features via (4) and (8);

8:

Calculate {\bf {Atte}}_{\text{spc}} via (2);

9:

Calculate {\bf {Atte}}_{\text{spa}} via (3);

10:

Predict classification vectors D(\boldsymbol{x}_{i}^{l};\theta _{D})[1:n_{y}];

11:

Compute L_{\text {SUP}} via (20);

12:

Input noises \lbrace \boldsymbol{z}_{1},\boldsymbol{z}_{2},\ldots,\boldsymbol{z}_{\mathit {bt}}\rbrace, class labels \lbrace {y}_{1},{y}_{2},\ldots,{y}_{\mathit {bt}}\rbrace, {\bf {Atte}}_{\text{spc}}, and {\bf {Atte}}_{\text{spa}} to G;

13:

Generate samples \boldsymbol{Z} via (13) and (16);

14:

Input \boldsymbol{X}_{\text{train}}^{u} and \boldsymbol{Z} to D;

15:

Predict authentic vectors D(\boldsymbol{x}_{i}^{u};\theta _{D})[0];

16:

Compute L_{D1} and L_{D2} via (21) and (22);

17:

Compute L_{G} via (24)

18:

Update \theta _{D} by minimizing L_{\text {SUP}}+L_{G}

19:

Update \theta _{G} by minimizing 1-D(\boldsymbol{z}_{i};\theta _{D})[0];

20:

\mathit {bt} = \mathit {bt} + 1;

21:

end for

22:

i = i + 1;

23:

end for

24:

Classify \boldsymbol{X}_{\text{test}} by the well-trained D;

The training of SSAT-GAN involves two alternating steps through rms or adjacent optimization fashions in every epoch. First, the gradients of the discriminator -\mathbf {\triangledown }_{\theta _{D}}L_{\text {SEMI}} are employed to adjust \theta _{D} to capture discriminative spectral–spatial features of HSI. Second, the gradients of -\boldsymbol{\triangledown }_{\theta _{D}}L_{G} are applied to adjust \theta _{G} to ameliorate the adversarial training. The detailed training process of SSAT-GAN is described in Algorithm 1.

SECTION IV.

Experimental Analysis

We detail the experimental results from three real hyperspectral datasets, including the Indian Pines (IN), the University of Pavia (UP), and the Kennedy Space Center (KSC). Each of them is standardized by mean variance operation. Three classification evaluation metrics, including overall accuracy (OA), average accuracy (AA), and kappa coefficient (\kappa), are employed to validate the experimental performance of SSAT-GAN and the comparison algorithms. In particular, OA considers the total percentage of correctly classified pixels; AA details the average percentage to the sum of correctly classified pixels in each category; the kappa coefficient provides the corrected percentage of correctly classified pixels as expected purely by chance based on confusion matrix. All experiments are implemented with an NVIDIA TITAN V GPU with 12-GB graphic memory, TensorFlow GPU 1.8.0 with CUDA 9.0, and Python 3.5.

A. Experimental Datasets

1) Indian Pines

IN was acquired by airborne visible/infrared imaging spectrometer (AVIRIS) from Northwest Indiana in 1992 and includes 16 vegetation categories, with an imbalance in pixel numbers over categories. It contains 145\times 145 spectral pixels with a spatial resolution of 20 m per pixel, retaining 200 bands of spectrum from 400 to 2500 nm after removing corrupted water-absorption effects.

2) University of Pavia

UP was gathered by reflective optics system imaging spectrometer (ROSIS) in 2001 from Northern Italy, consisting of 610\times 340 spectral pixels with nine urban land-cover classes, and 1.3 m spatial resolution per pixel, employing 103 bands of spectrum from 430 to 860 nm after abandoning 20 noisy bands.

3) Kennedy Space Center

KSC was obtained by AVIRIS in 1996 from Florida and includes 13 upland and wetland land-cover types, with 512\times 614 spectral pixels and 176 bands of spectrums to assess the classification capacity, after discarding information with a low-SNR, with a range from 400 to 2500 nm.

Figs. 6–​8 illustrate the dataset, the corresponding ground reference maps, and category information. All labeled samples are split into two groups: the training group and the test group. For the unlabeled group, the unlabeled training samples are randomly selected from the background. GANs contain relatively higher computational complexity, which is often guided to the mode collapse. Thus, we refer to Monte Carlo sampling [45] which is mentioned in [33] to marginalize noise during training.

Fig. 6. - Indian Pines dataset. (a) False-color image. (b) Ground-truth labels.
Fig. 6.

Indian Pines dataset. (a) False-color image. (b) Ground-truth labels.

Fig. 7. - Pavia University dataset. (a) False-color image. (b) Ground-truth labels.
Fig. 7.

Pavia University dataset. (a) False-color image. (b) Ground-truth labels.

Fig. 8. - Kennedy Space Center dataset. (a) False-color image. (b) Ground-truth labels.
Fig. 8.

Kennedy Space Center dataset. (a) False-color image. (b) Ground-truth labels.

B. Parameter Tuning

Fig. 5 takes the UP neighboring cube as an instance to show the detail of the discriminator D and generator G.The 7\times 7\times 103 HSI cubes are randomly directly extracted from raw 3-D HSI data as the real input, followed by feeding them into D. G utilizes 1\times 1\times 200 noise vectors as the input and outputs 7\times 7\times 103 fake HSI cubes. We alternately update the parameters of the SSAT-GAN through backpropagation of the gradients. For the efficiency of the grid search, we set the learning rate to 0.0005 and the batch size to 16 and employ the RMSProp optimizer [46] to alternately optimize them. Once the hyperparameters of SSAT-GAN are configured, we analyze four factors that avoid model collapse and influence the HSI interpretation performance of SSAT-GAN.

1) Evaluation of Different Depths of Spectral–Spatial Attention Block

We assessed the impact of different depths of the spatial–spectral attention feature spreading blocks on classification results. For SSAT-GAN, the depths of blocks were validated from four convolutional layers to eight convolutional layers on all datasets. To maintain the stability of the model, the depth of the generator was symmetric to that of the discriminator.As illustrated in Fig. 9, it achieved the highest evaluation results on both IN and UP datasets, when set the depths of spectral-spatial attention feature spread blocks to “3 + 3”, i.e. the discriminator which consists of 3 spectral and 3 spatial convolutional layers, compared with other settings of convolutions. As for the KSC, the differences of OAs between deeper SSAT-GANs and their corresponding shallow depth get a small value. Meanwhile, in contrast to the obvious overfit deeper layers under limited training sample effects reviewed in [33], the quantitative HSI classification performance of SSAT-GANs with varying depths illustrated that our attention modules mitigate the overfitting effects to other GANs.

Fig. 9. - OAs of SSAT-GAN with different depths of convolutional layers in their spectral–spatial attention feature spread blocks using 500 labeled samples on IN and UP, and 250 on KSC for training. The $\mathit {x}+\mathit {y}$ formation on the abscissa indicates $\mathit {x}$ spectral and $\mathit {y}$ spatial convolutional layers in discriminator.
Fig. 9.

OAs of SSAT-GAN with different depths of convolutional layers in their spectral–spatial attention feature spread blocks using 500 labeled samples on IN and UP, and 250 on KSC for training. The \mathit {x}+\mathit {y} formation on the abscissa indicates \mathit {x} spectral and \mathit {y} spatial convolutional layers in discriminator.

2) Evaluation of Different Numbers of Kernels for SSAT-GAN

Kernel numbers of each layer from feature spreading blocks greatly affects computation consumption and expressiveness of SSAT-GANs. We evaluated the impact of different numbers of kernels of the spectral–spatial attention feature spreading blocks on the results. In Fig. 10, the discriminator and the generator of SSAT-GAN set the same kernel number in their convolution and transposed convolution layers, with the number of kernels verified from \lbrace 20, 24, 28, 32, 36\rbrace. As can be seen from Fig. 10, when the kernel numbers were fixed at 28 and 32, it achieved the highest classification results on all three datasets.

Fig. 10. - OAs of SSAT-GAN for varying kernel numbers in their spectral–spatial attention spreading blocks using 500 labeled samples on IN and UP, and 250 on KSC for training.
Fig. 10.

OAs of SSAT-GAN for varying kernel numbers in their spectral–spatial attention spreading blocks using 500 labeled samples on IN and UP, and 250 on KSC for training.

3) Influence of Unlabeled Real HSI Cubes

To evaluate the influence of unlabeled real HSI cubes, we tested SSAT-GAN and its three extensions using different numbers of unlabeled HSI samples on the IN, UP, and KSC datasets. The three extensions of SSAT-GAN are denoted as Spa-AT-GAN (the ones that only contain the spatial attention feature spreading part), the Spc-AT-GAN (the ones that only contain the spectral attention feature spreading part), and the Spa-Spc-AT-GAN (the ones that contain both spreading blocks, where the spatial attention module is set before the spectral attention module). Table I recorded the classification results of SSAT-GANs. Each experiment randomly selected 0, 300, 1000, and 5000 unlabeled samples for training.

TABLE I OAs (\%) of SSAT-GANs Using Various Number of Unlabeled Samples and 300 Labeled Samples in the IN, UP, and KSC Datasets
Table I- OAs ($\%$) of SSAT-GANs Using Various Number of Unlabeled Samples and 300 Labeled Samples in the IN, UP, and KSC Datasets

For IN and KSC, the classification of SPA-SPC-AT-GAN did not efficiently improve with the increase of unlabeled samples. Among the four methods, SSAT-GAN had the best evaluation on each dataset with various unlabeled samples due to the spectral–spatial attentive feature learning guidance. Moreover, models with 300 unlabeled samples had the most accurate evaluation on all three datasets, and the improvement of 1000 unlabeled samples was not obvious. When the number increased to 5000, the results showed a downward trend in all extensions of SSAT-GAN on three datasets. This proves that adding too many real samples does not greatly improve the classification, which is caused by the abnormal distribution of unlabeled pixels. In addition, it can be seen that the HSI classification has been significantly improved if the unlabeled samples are set equal to the labeled samples. This conclusion is consistent with the opinions reported by Zhong et al. [33] and Liang et al. [36].

4) Evaluation of Different Spatial Data Sizes

To assess the impact of spatial size on the experimental results, we tested SSAT-GAN with spatial data sizes of \lbrace 5\times 5, 7\times 7, 9\times 9, 11\times 11,13\times 13\rbrace. Fig. 11 shows that SSAT-GAN could capture relatively high and stable results while the spatial size was equal to or greater than 7\times 7. This is mainly because a larger spatial size has more abundant spatial information. These experimental results also indicate that spatial contexts gradually gain an important role in HSI classification.

Fig. 11. - OAs of SSAT-GAN containing various spatial sizes of input cubes on three datasets.
Fig. 11.

OAs of SSAT-GAN containing various spatial sizes of input cubes on three datasets.

C. Comparison With Various Algorithms

This experiment aimed to compare the performance of the proposed SSAT-GAN with the EPF-SVM [12] (EPF-based SVM) and the state-of-the-art deep learning derived methods, such as SSRN [18], 3D-Conv-Capsule [20], and HSI-BERT [26]. To verify the improvement of GANs, we exploited three GAN-based methods for comparison, including 3D-GAN [32], GAN-CRF [33], and AD-GAN [34]. Moreover, to demonstrate the effectiveness of the SSAT module, we also introduced the extensions of SSAT-GAN: Spa-AT-GAN (only comprises one spatial attention feature spreading block), Spc-AT-GAN (only comprises one spectral attention feature spreading block), and Spa-Spc-AT-GAN (comprises one spatial attention feature spreading block and one spectral attention feature spreading block). To make a fair comparison, all the competitive algorithms were tuned to their optimal settings.

Regarding the EPF-SVM, the two parameters of the joint bilateral filter were set as follows: \delta _{s}=4 and \delta _{r}=0.2. Meanwhile, the hyperparameters of SVM were set as follows: \gamma = 4 and \epsilon = 0.01. For SSRN and HSI-BERT, we set the input HSI cubes with the same spatial size of 7\times 7. For 3D-Conv-Capsule, the routing interaction was set to three times to determine its coupling coefficients. For 3D-GAN, the first three principal components of HSIs were applied for channel input, and the spatial size was set to 64\times 64. For GAN-CRF, the neighborhood of 9\times 9 pixels was employed and configured three spectral and spatial convolutional layers in the discriminator. For AD-GAN, the 3-D HSI cubes of size 27 \times 27\times 3 were considered as input, and an AdapDrop block was executed once at both the discriminator and the generator, each of which set k=40 and b\_size=7.

As for the proposed SSAT-GAN, we set the spatial size of input HSI cubes to 7\times 7 and trained 300 epochs. Both D and G were built with consecutive spectral–spatial attention feature spread blocks and spectral–spatial attention feature generation blocks, each of whose kernel number was 28. The minibatch was 16. To avoid the mode collapse, we set unlabeled samples with the same number of the labeled training samples that were used for training. Furthermore, all the comparison methods were trained and evaluated using 10 randomly sampled experiments, and the average results and their standard deviations for the report were recorded.

1) Experimental Results on IN Dataset

For various methods, 500 labeled pixels were employed as training samples on the IN dataset. Table II lists the quantitative classification results of comparison methods, and the visualization maps are illustrated in Fig. 13. As shown in Table II, EPF-SVM yielded poor accuracies in the “Corn,” “Soybean-notil,” and “Buildings-Grass-Trees-Drivers” classes, which are 63.09%, 67.93%, and 52.74%, respectively. This is caused by their similarity of spectral curves, which makes them difficult to identify. In contrast, we observed that SSRN, 3D-Conv-Capsule, and HSI-BERT acquired better results than EPF-SVM in the three classes. However, in the HSI-BERT, it improved at least 12.21% in the “Corn” class. It can be analyzed that deep-learning methods have a certain positive effect on interpreting complex spectral characteristics. Different from the former, GAN-based methods showed superior prediction in the three classes, and 3D-GAN improved at least 26.66% in “Corn”. Besides, GAN-CRF achieved 93.05% in “Soybean-notil,” and AD-GAN had classified the “Corn” completely accurate. As for SSAT-GANs, both Spa-AT-GAN and Spc-AT-GAN achieved advanced prediction in the three classes.

TABLE II Classification Accuracies and Training and Testing Times of Various Comparison Methods Using 500 Labeled Samples and 500 Unlabeled Samples for the IN Dataset
Table II- Classification Accuracies and Training and Testing Times of Various Comparison Methods Using 500 Labeled Samples and 500 Unlabeled Samples for the IN Dataset
Fig. 12. - Accuracy value of the IN dataset with SSAT-GAN model under different training data sampling. We report the average results of ten experiments. (a) Randomly selection strategy. (b) Monte Carlo sampling strategy.
Fig. 12.

Accuracy value of the IN dataset with SSAT-GAN model under different training data sampling. We report the average results of ten experiments. (a) Randomly selection strategy. (b) Monte Carlo sampling strategy.

Fig. 13. - Classification visualization of comparison models on IN dataset. (a) EPF-SVM. (b) SSRN. (c) 3D-Conv-Capsule. (d) HSI-BERT. (e) 3D-GAN. (f) GAN-CRF. (g) AD-GAN. (h) Spa-AT-GAN. (i) Spc-AT-GAN. (j) Spa-Spc-AT-GAN. (k) SSAT-GAN.
Fig. 13.

Classification visualization of comparison models on IN dataset. (a) EPF-SVM. (b) SSRN. (c) 3D-Conv-Capsule. (d) HSI-BERT. (e) 3D-GAN. (f) GAN-CRF. (g) AD-GAN. (h) Spa-AT-GAN. (i) Spc-AT-GAN. (j) Spa-Spc-AT-GAN. (k) SSAT-GAN.

As SSAT can improve the intraclass aggregation, which effectively distinguishes the difference between spectra during hyperspectral interpretation.In the “Alfalfa,” “Grass-pasture-mowed,” and “Oats” categories, only 5, 3, and 5 pixels were used as training samples. SSAT-GAN gains superior classification, all with accuracies of 100%. This indicates that our SSAT-GAN can extract sensitive features under the classes of small samples. Among the competitive methods, SSAT-GAN also gathered the best accuracies in the “Soybean-mintil” and “Soybean-clean,” which contain redundant spectral signatures. Meanwhile, SSAT-GAN outperformed various comparison methods according to OA, AA, and kappa. In contrast to SSAT-GAN with its extensions, SSAT-GAN improved the OA with extensions by at least 1.01%, AA by 1.35%, and kappa by 1.25%. This illustrates that the SSAT module can extract discriminative spectral signatures and adaptive homogeneous areas to mitigate the impact of interfering pixels of HSIs. Besides, it should be noted that the experiment in Spa-Spc-AT-GAN showed inferior performance because the abundant spectral features are more difficult to learn than spatial features.

To verify the results of the SSAT-GAN under the Monte Carlo sampling, we also experiment with the randomly selection strategy under the training sampling ratio (SR) fixed as 5% (500 training samples) and 10% (1000 training samples).As shown in Fig. 12(a), the detailed class accuracy (CA) of each class shows qualitative comparisons with different circumstances, in which the “Soybean-notil” class obtained unsatisfactory results under the randomly selection strategy, with 79.79% of SR = 5% and 91.65% of SR = 10%. Besides, it is worth mentioning that the experiment under the SR = 10% achieves better realistic performance compared to that of SR = 5%. The reason is that the randomly chosen samples for any other classes contain more outstanding characteristics which confuse that of small-sample classes, when the data distribution is imbalanced. To confirm this stated opinion, we adopted the Monte Carlo sampling to redo the experiment with our SSAT-GAN with the SR MCR = 5% and MCR = 10%. The Monte Carlo sampling considers the interclass sample distribution under preserving the total random sampling size (as shown in Table II). As can be seen from Fig. 12(b), the performance of each class in the style of Monte Carlo sampling has superior observations than the indication in Fig. 12(a). From Fig. 13(a)–(k), EPF-SVM and Spa-Spc-GAN has got more visual noise and had the most misclassified pixels; besides, visualizations of SSRN and HSI-BERT got rough boundaries in most classes. The reason is that the imbalanced sample distribution of the IN dataset, in which part of classes with a large number of samples may contain more discriminative characteristics to the identification. 3D-GAN, GAN-CRF, Spa-AT-GAN, and Spc-AT-GAN gained relatively little visual noisy scatter. In contrast, 3D-Conv-Capsule and AD-GAN significantly reduced the impact of noise and established homogeneous areas. Among them, SSAT-GAN had more uniform regions and set up an adaptive neighboring relationship, from which it can be noted that SSAT can effectively suppress information detrimental to classification.

2) Experimental Results on UP Dataset

The evaluation of the comparison methods on the UP dataset is listed in Table III using 500 labeled samples. We can see that OAs yielded with SSRN, 3D-GAN, and GAN-CRF are 95.31%, 93.89%, and 94.95%. Our proposed model can further increase the performance to 98.09% by incorporating the SSAT module. In similarity, the AA values are 94.33%, 94.25%, and 97.16% for the 3D-Conv-Capsule, HSI-BERT, and AD-GAN, respectively. As can be observed, the proposed SSAT-GAN has a relatively stable and balanced classification effect for each category under high-resolution neighboring relationships and acquired the maximum AA value (98.21%). It can be noted that the proposed SSAT module can capture discriminative interclass differences and is essential and beneficial for the proposed architecture.

TABLE III Classification Accuracies and Training and Testing Times of Various Comparison Methods Using 500 Labeled Samples and 500 Unlabeled Samples for the UP Dataset
Table III- Classification Accuracies and Training and Testing Times of Various Comparison Methods Using 500 Labeled Samples and 500 Unlabeled Samples for the UP Dataset

The classification visualization on the UP dataset is described in Fig. 14. It can be seen that the comparison methods produced rough prediction maps, especially in 3D-GAN and Spa-Spc-AT-GAN, which was caused by atmospheric effects and instrument noises. SSAT-GAN aimed at neighboring correlation context as auxiliary information and had the smoothest results and clearest boundary.

Fig. 14. - Classification visualization of comparison models on UP dataset. (a) EPF-SVM. (b) SSRN. (c) 3D-Conv-Capsule. (d) HSI-BERT. (e) 3D-GAN. (f) GAN-CRF. (g) AD-GAN. (h) Spa-AT-GAN. (i) Spc-AT-GAN. (j) Spa-Spc-AT-GAN. (k) SSAT-GAN.
Fig. 14.

Classification visualization of comparison models on UP dataset. (a) EPF-SVM. (b) SSRN. (c) 3D-Conv-Capsule. (d) HSI-BERT. (e) 3D-GAN. (f) GAN-CRF. (g) AD-GAN. (h) Spa-AT-GAN. (i) Spc-AT-GAN. (j) Spa-Spc-AT-GAN. (k) SSAT-GAN.

3) Experimental Results on KSC Dataset

The last experiment is performed at the KSC dataset using 250 labeled pixels as training samples. As shown in Table IV, the SSAT-GAN achieved the best OA of 97.72% higher than GAN-CRF (95.38%) and AD-GAN (96.15%). In comparison, the SSRN yielded an OA of only 94.19%. The reasonable analysis is that the KSC dataset contains a relatively sparse characteristic so that the traditional network generally has more difficulties in interpreting spectral–spatial features. With the SSAT operation, the proposed model achieved superior performance in contrast to the other state-of-the-art methods. In addition, it needs to be noted that PCA-based 3D-GAN yielded the worst assessment with an OA of 93.38%, which illustrates that the representation of the primary components gains poor effect in the spectral–spatial feature extraction for HSIs with the characteristic of high sparsity. In contrast, our proposed architecture with the SSAT model acquires better robustness for the sparsity.

TABLE IV Classification Accuracies and Training and Testing Times of Various Comparison Methods Using 250 Labeled Samples and 250 Unlabeled Samples for the KSC Dataset
Table IV- Classification Accuracies and Training and Testing Times of Various Comparison Methods Using 250 Labeled Samples and 250 Unlabeled Samples for the KSC Dataset

Classification maps are shown in Fig. 15. In contrast, it can be seen that SSAT-GAN achieved smoother and more adaptive visual results, which indicates that its SSAT module can both emphasize the intraclass consistency and increase interclass differences for HSI classification with high sparsity distribution. All the quantitative experiments conducted on the three datasets demonstrated that the SSAT-GAN framework reflects the excellence and robustness of HSI classification.

Fig. 15. - Classification visualization of comparison models on KSC dataset. (a) EPF-SVM. (b) SSRN. (c) 3D-Conv-Capsule. (d) HSI-BERT. (e) 3D-GAN. (f) GAN-CRF. (g) AD-GAN. (h) Spa-AT-GAN. (i) Spc-AT-GAN. (j) Spa-Spc-AT-GAN. (k) SSAT-GAN.
Fig. 15.

Classification visualization of comparison models on KSC dataset. (a) EPF-SVM. (b) SSRN. (c) 3D-Conv-Capsule. (d) HSI-BERT. (e) 3D-GAN. (f) GAN-CRF. (g) AD-GAN. (h) Spa-AT-GAN. (i) Spc-AT-GAN. (j) Spa-Spc-AT-GAN. (k) SSAT-GAN.

D. Investigation of the Impact of Attention Mechanism

To evaluate the effectiveness and the contribution of the attention mechanism, we compared various classical and representative attention modules which were executed over our GAN-baseline in Table V, including SE_Block [47], CBAM [40], FA [48], and MAFN [24], and reported OAs of three datasets. It can be seen that both CBAM and our SSAT can obtain a considerable result on all three cases. This is caused by their forms of cascade connection that fit our architecture better. Besides, the FA module has a more promising result on both UP and KSC datasets in contrast to the IN dataset. The reason is that it requires a high spatial resolution to calculate the utility of covariance matrices over FA.

TABLE V Overall Accuracies (\%) of Semisupervised GAN Methods With Varying Attention Modules Using 300 Unlabeled Samples and Labeled Samples on the IN, UP, and KSC Datasets
Table V- Overall Accuracies ($\%$) of Semisupervised GAN Methods With Varying Attention Modules Using 300 Unlabeled Samples and Labeled Samples on the IN, UP, and KSC Datasets

Moreover, we also investigated feature visualization with the guidance of the attention weights under the SSAT modules. In this experiment, only the 7 \times 7 neighboring HSI cubes were used to train the SSAT-GAN over the UP dataset. Each category of HSI cubes with the false color and their corresponding feature maps from the penultimate layer of the discriminator are shown in Fig. 16. As illustrated in Fig. 16, the more significant the features, the darker the gradient distribution of the attention. However, some target pixels to be classified in the fact do not exactly belong to the same category as their neighboring pixels in their corresponding HSI cubes, such as the central pixels with its surroundings in Fig. 16(c), (e), and (g). In contrast, it can be seen that there also consist of some bright areas in the corresponding mixed attention distribution. The reason is that our SSAT modules can effectively activate the spectral–spatial attribute and assign an independent attention weight for each pixel of the HSI cubes. In this case, our SSAT-GAN performs the guidance of the attention weights with the feature extraction simultaneously, which can improve the efficiency of hyperspectral characterization.

Fig. 16. - Feature visualization with the guidance of the attention weights over SSAT on the UP dataset. Each land cover category is randomly selected from the labeled training set and described with the false color. The corresponding feature visualization is obtained by applying Grad-CAM [49].
Fig. 16.

Feature visualization with the guidance of the attention weights over SSAT on the UP dataset. Each land cover category is randomly selected from the labeled training set and described with the false color. The corresponding feature visualization is obtained by applying Grad-CAM [49].

E. Execution Time Analysis on Different Datasets

The training and testing time on the three datasets are also illustrated in Tables II–​IV. To assess the computational complexity, we reported the execution time (in milliseconds, i.e., ms) at each epoch or iteration of various methods.

In general, EPF-SVM obviously consumed the shortest time for training in all three cases. 3D-Conv-Capsule took the longest time as the reason it needs to construct a dynamic route for optimal vector search during training. GAN-based methods need to optimize the discriminator and the generator alternately and, thus, gathered a relatively long time for training. In addition, Spa-AT-GAN took the shortest time to train on all three datasets, among the deep learning methods, which took about 4–6 times faster than GAN-CRF. In contrast, we can find that the time cost is relatively similar between SSAT-GAN and HSI-Bert, while our SSAT-GANs contains better accuracy as illustrated in Tables II–​IV.

For testing, 3D-GAN took more time to test because of its large candidate neighboring areas and the deep network architecture. In contrast, SSAT-GAN consumes relatively less time due to the high efficiency that existed in the feature representation of the spectral–spatial attention spread and generation blocks. In summary, it can be concluded that the proposed framework is the most efficient method with advanced performance under fair comparison.

F. Sensitivity Analysis on Different Number of Labeled Sample for Training

To observe the effect of different number of labeled samples on OAs, we randomly selected labeled pixels in the range of \lbrace 100, 300, 500, 700, 1000 \rbrace on the IN, \lbrace 50, 100, 200, 350, 500 \rbrace on the UP, and \lbrace 50, 150, 250, 350, 500 \rbrace on the KSC with the Monte Carlo sampling strategy. Fig. 17(a)–(c) reports the OAs of all competitors on three datasets, respectively. It should be noted that the OAs gradually increase and then stabilize under a different number of labeled samples on the IN, UP, and KSC datasets. The reason is that the Monte Carlo sampling strategy can provide sufficient labeled samples and, thus, construct a complete dictionary for training. In addition, SSAT-GAN has an obvious advantage in classification performance in contrast to other methods.

Fig. 17. - Impact of different number of labeled samples on OA results for training. OA results were obtained by all algorithms on (a) IN dataset, (b) UP dataset, and (c) KSC dataset.
Fig. 17.

Impact of different number of labeled samples on OA results for training. OA results were obtained by all algorithms on (a) IN dataset, (b) UP dataset, and (c) KSC dataset.

To verify the contribution of different categories to both AA and kappa with modification of labeled samples for training, a new experiment was performed on the three datasets over the proposed SSAT-GAN. Fig. 18 illustrated the CA of each class on the three datasets. It is observed that the SSAT-GAN acquires stable CAs for “Grass-trees,” “Hay-windowed,” and “Woods” class, no matter how many amounts of total labeled samples are considered, owing to the discriminative spectral characteristics of three ground materials in the IN dataset. Therefore, it still achieves a satisfactory classification performance, even in a relatively small labeled samples. Furthermore, the contribution to the CA of the remaining classes in the IN dataset is improved and then stabilizes since the number of labeled samples increases.

Fig. 18. - Class accuracy results for each class with different number of total labeled samples for training over the SSAT-GAN on (a) IN dataset, (b) UP dataset, and (c) KSC dataset.
Fig. 18.

Class accuracy results for each class with different number of total labeled samples for training over the SSAT-GAN on (a) IN dataset, (b) UP dataset, and (c) KSC dataset.

For the UP dataset illustrated in Fig. 18(b), for the “Meadows,” “Painted metal sheets,” and “Shadows” class, the CAs detail a negligible variation as the labeled samples increase. As for other classes, the accuracy values of the proposed SSAT-GAN tend to be stabilized as the number of samples increases. Similar achievements can be found in Fig. 18(b) and (c). Overall, not all the classes contribute to both AA and kappa to the same degree with modification of labeled training samples. The reason may be that the spectral signatures suffer from the challenge of spectral variability which stems from the illumination and atmospheric conditions. However, our SSAT modules can alleviate such limitations of spectral characteristics, which can be illustrated at those advanced accuracy values in the three datasets.

SECTION V.

Discussion

There are three differences between the proposed SSAT-GAN and the GAN-based methods for HSI classification [29], [32], [33]. First, SSAT-GAN takes the attention information of HSIs into account for both the discriminator and the generator. Second, the discriminator in the adversarial framework adds unlabeled samples for semisupervised learning and alleviates the impact of small samples. Third, a mean minimization loss is employed for the unsupervised learning of SSAT-GAN to reduce the complex calculation parameters of high-dimensional features so as to achieve steady-state performance of GAN.

The SSAT-GAN models incorporate the SSAT as the feature perception enhancement step in the feature extraction stage, which builds a strong SNR spectral domain and a physical denoising contextual area upon both spectral and spatial dimensions, respectively. Compared with those attention mechanisms used in the vision community, the SSAT considers the long-range correlations between neighboring HSI cubes. This property helps the SSAT-GAN framework to better filter noises in the areas with different spectral purity and texture information.

We gain three major insights from the semisupervised HSI classification outcomes of GANs in all three datasets. First, by taking the spectral–spatial discriminative features of training data into account, the discriminators of SSAT-GANs extract efficient and significant HSI characteristics and achieve better classification accuracies. Second, the unlabeled samples and generated HSI samples of unsupervised learning make discriminators more robust among adversarial framework and learning complex real data distribution of HSIs to predict. This alternate training mode enables semisupervised GANs to promote superior classification outcomes than that of supervised deep learning derived frameworks. Third, the mean minimization loss takes the constrained optimization of the high-dimensional feature maps generated by the discriminators as the smooth filtering by calculating the efficiency values, which imposes the correlation in homogeneous regions including high texture areas or purity spectral domain.

SECTION VI.

Conclusion

In this article, an SSAT-GAN approach for HSI classification is proposed by using a cascade feature representation of spectral–spatial attributes with the SSAT. The proposed model improves the transmission of the characteristics with extended spectral–spatial attention feature spread and generation blocks to represent the feature. It effectively applied the attention weights to emphasize both spectral bands and spatial correlations to improve the characterization during feature extraction. Besides, SSAT-GAN constructs a semisupervised architecture by adding unlabeled samples for training to alleviate the scarcity of training samples. Furthermore, we employ the mean minimization loss for unsupervised learning of the discriminator to avoid the mode collapse. In terms of the accuracy and computation of the experiments, an analysis on the three HSI datasets indicates that our model achieves an excellent performance.

ACKNOWLEDGMENT

The authors would like to thank the Associate Editor and the three anonymous reviewers for their outstanding comments and suggestions, which greatly helped the authors to improve the technical quality and presentation of this article.

References

References is not available for this document.