Introduction
As an essential observation technology, hyperspectral remote sensing can simultaneously capture the spectral and spatial features of ground objects in a scene. By far, hyperspectral images (HSI) are widely used in urban planning [1], agricultural monitoring [2], mineral exploration [3], and military reconnaissance [4]. HSI classification is a prerequisite for many applications of HSI, nevertheless.
Various supervised methods have been applied to the HSI classification field in the machine learning community. Generally, these algorithms divide the spectral space with the obtained decision plane as the boundary, such as
Compared with the abovementioned shallow models, deep learning-based models can train the classifier in a data-driven manner and extract the hierarchical features simultaneously, forming a unified end-to-end framework. Consequently, deep learning has gradually become a powerful tool for HSI classification in recent years [13]. Typical deep learning models include convolutional neural network (CNN) [14], [24], stacked autoencoder (SAE) [25], recurrent neural network (RNN) [26], and deep belief network (DBN) [27]. Among the abovementioned deep learning-based methods, the inputs of RNN, SAE, and DBN are composed of spectral vectors, without containing the spatial features, thereby largely giving rise to unsatisfactory classifications. Nevertheless, CNN can simultaneously extract the spectral and spatial features of HSI and adopt the strategies of local connections and weight sharing to reduce the number of parameters, drawing great attention in the field of HSI classification. Hu et al. [14] first designed a five-layer CNN with the spectrum of each pixel as input and extracted the spectral features to perform the classification. Besides, a pixel-pair voting strategy enabled a one-dimensional convolutional neural network (1D-CNN) to achieve a promising classification result in the case of limited training samples [15]. However, due to the lack of texture and context information of the samples, 1D-CNN is prone to suffer from misclassification. Therefore, some scholars [16], [21] have introduced spatial features into the network to construct joint spectral-spatial frameworks, which can be roughly divided into two categories. One paradigm is to construct a two-branch structure, in which each branch extracts the spectral or spatial features respectively, and then concatenates these features for the classification [16], [18]. For example, Xu et al. [16] developed a spectral-spatial unified network (SSUN), employing a long short-term memory model (LSTM) and multiscale convolutional neural network to respectively extract the spectral and spatial features. The other paradigm receives the 3D cubes containing the spectral and spatial information and extracts the joint features by one or more convolutional operators [19], [21]. For instance, the multiscale 3D deep convolutional neural network (M3D-DCNN) [20] utilized 3D convolutional operators to extract the multiscale spatial and spectral features, announcing impressive results. In addition, there are also some studies [22], [24] that combine CNN with self-supervised learning, using a large number of unlabeled data and achieving promising classification results.
Although CNN-based methods have achieved excellent classification results, they are prone to overfitting when tuning the substantial learnable parameters with limited training data [28], [29]. However, gathering data is expensive and time-consuming in the field of remote sensing, and the obtained data generally take on long-tail distribution, which hinders the application of CNN.
Generative adversarial network (GAN) [30] was put forward to generate high-quality images through its unique adversarial training process between the generator and discriminator. With the advancement of GAN, hundreds of its variations have been derived. Among them, the relatively popular models are conditional generative adversarial network (CGAN) [31], deep convolutional generative adversarial network (DCGAN) [32], and Wasserstein GAN [33]. To alleviate the above overfitting problem of CNN-based methods, some scholars [34], [44] introduced GAN into HSI classification, which yields encouraging classification results under the circumstance of small-size samples. Zhan et al. [34] proposed a semi-supervised classification method based on 1D-GAN, which is the first application of GAN for HSI classification. A DCGAN-based method was proposed, in which the discriminator leveraged the first three principal components after the operation of principal components analysis (PCA) upon the original image as the inputs, with commendable classification results obtained [35]. Zhan et al. [36] further classified the samples via the voting mechanism of the dynamic neighborhood after the first classification using the spectral feature only. A novel multiclass spatial-spectral GAN (MSGAN) method [37] was developed with two generators to produce the fake spectral and spatial samples, respectively, and defined the novel adversarial objectives for multiclass, which achieves astounding results. For the sake of excavating the rich information from unlabeled samples, the generator network in multitask GAN (MTGAN) [38] was designed to simultaneously undertake the reconstruction and the classification tasks. To improve the generalization performance, a self-attention-based GAN [39] was combined with the variational auto-encoder (VAE) [45], in which the generator received the encoder-generated and random latent vectors to produce more enhanced virtual samples.
Even if the above GAN-based models have gained satisfying HSI classification performance, the training quality of models hinges on the gradients transmitted from the discriminator to the generator. Hence, the gradients may disappear due to accumulation when the layer of GAN is too deep. Furthermore, Arjovsky and Bottou [46] put forward the point that when there is an insubstantial overlap between the distribution of the real and the generated data, the discriminator will pass uninformative gradients to the generator. The above problems are the major contributors to the training instability of GAN, which hinders its classification accuracy. To improve the training stability, a multiscale gradients GAN (MSG-GAN) [47] was developed for synthesizing the high-resolution faces, which connected the intermediate layers of the generator with that of the discriminator, making the multiscale gradients can be directly passed from the discriminator to the generator. To solve the training instability problem of GAN for the task of HSI classification, this article establishes the multiscale connections between the discriminator and generators inspired by MSG-GAN. The main contributions of this article are summarized as follows.
We propose a two-branch generative adversarial net-work with multiscale connections (TBGAN) for HSI classification. Generators in TBGAN will produce the virtual spectrums and spatial patches to alle-viate the small-size sample problem.
To improve the training stability, the multiscale connections are established between the discriminator and two generators. Moreover, a feature-matching term is added to the loss function to further increase the stability.
The discriminator with two branches is designed in TBGAN to extract the joint spectral-spatial features. The trained discriminator can be employed as a classifier.
Methodology
A. Basic Framework of GAN
Before formally introducing the TBGAN method, we first review the basics of GAN. Motivated by the two-person zero-sum game theory, the GAN model [30] is proposed by taking the adversarial training process to optimize deep learning models as a new framework, which consists of a generator \begin{align*} \min _{G} \max _{D} V(D,G)=&E_{x\sim p_{data} \left ({x }\right)} \left [{ {\log D(x)} }\right] \\&+\,E_{z\sim p_{z} \left ({z }\right)} \left [{ {\log \left ({{1-D(G(z))} }\right)} }\right]\tag{1}\end{align*}
B. Proposed Method
Inspired by the adversarial training mechanism of GAN, this article proposes a TBGAN framework for the classification of ground objects by extracting the joint spectral-spatial features. Similar to the traditional GAN, TBGAN also consists of the generator and the discriminator. As can be seen from Fig. 2, there are two branches devised in TBGAN, which is specifically composed of three modules: the spectral generator
It is worth noting that the intermediate layers of
1) Spectral and Spatial Generators of TBGAN
The generators
However, existing models still find obstacles when capturing long-term dependencies across the spectral bands due to extensive bands in HSI [48]. Recently, the self-attention mechanism [49] has become a breakthrough with high hope to effectively address the above issue by obtaining global information of the feature maps through simple query and assignment operations [50]. Therefore, self-attention is drawn into
2) Discriminator of TBGAN
In this article, a two-branch discriminator
Fig. 5 exhibits the structure of Conv-Block in the spatial branch, which is nearly consistent with that in the spectral branch. For the input feature maps, its height and width are labeled as 2w, and c is the number of channels. To obtain the spatial features, the Conv-Block first performs strided convolution to halve the size of feature maps and double the number of channels and then concatenates the handled feature maps with the multiscale features. These multiscale features consist of the intermediate layer outputs of the generator and the downsampled versions of the real data. After that, the concatenated feature maps are delivered into a convolution layer, whose kernel size is
To avoid the network training instability caused by gradient accumulation, the intermediate layers of
Meanwhile, the quantity of the intermediate layer outputs in
Here
3) Loss Function of TBGAN
The discriminator in the classical GAN utilizes the sigmoid classifier to distinguish whether an input is true or false, which pertains to binary classification. For the circumstance of multi-classification, the discriminator in ACGAN [51] method is imbued with a softmax classifier to undertake the multi-classification task. In recent years, to improve the adversarial training effects upon multi-classification, a multi-class adversarial strategy [37] is devised, which enables the softmax layer simultaneously complete the discrimination of input sources and the classification task. For this reason, this multi-class adversarial strategy is also introduced into TBGAN. Meanwhile, a feature-matching term is also added to the loss function, thus facilitating the generated samples preferably subject to the distribution of the real data. Consequently, the loss function of TBGAN can be defined as follows:\begin{align*} \begin{cases} L_{G} =L_{c} +\lambda L_{s} \\ L_{D} =L_{real} +L_{fake} \\ \end{cases}\tag{2}\end{align*}
\begin{align*} \begin{cases} L_{c} =CE\left ({{D(G_{\textrm {spec}} (z_{\textrm {spec}},y),G_{\textrm {spat}} (z_{\textrm {spat}},y)),y} }\right) \\ L_{s} =\left \|{ {f_{1} (X_{\textrm {spec}})-f_{1} (G_{\textrm {spec}} (z_{\textrm {spec}},y))} }\right \|_{2}^{2} \\ \qquad +\,\left \|{ {f_{2} (X_{\textrm {spat}})-f_{2} (G_{\textrm {spat}} (z_{\textrm {spat}},y))} }\right \|_{2}^{2} \\ L_{real} =CE\left ({{D(X_{\textrm {spec}},X_{\textrm {spat}}),y} }\right) \\ L_{fake} =CE\left ({{D(G_{\textrm {spec}} (z_{\textrm {spec}},y),G_{\textrm {spat}} (z_{\textrm {spat}},y)),y_{fake}} }\right) \\ \end{cases}\tag{3}\end{align*}
Besides, to alleviate the overconfidence of the discriminator, the labels in (3) can be smoothed complying with the strategy adopted in [52]. Concretely, by introducing a hyperparameter of
4) Procedure of TBGAN
As shown in Table 1, the specific procedure of the TBGAN method consists of the virtual sample generation, extracting the joint spectral-spatial features, and the ground object classification.
Experiments
To demonstrate the classification performance of the proposed TBGAN, the experiments are conducted upon the Pavia University, the Salinas, and the Indian Pines dataset. In the experiments, 10% of the labeled samples are randomly selected for training, and the remainder is used for testing. Besides, class accuracy, average accuracy (AA), overall accuracy (OA), and Kappa coefficient are employed as indicators for measuring the classification results.
A. Data Description
1) Pavia University Dataset
The Pavia University dataset actually captured pictures of Pavia, an Italian city, by the Reflective Optics System Imaging Spectrometer (ROSIS) sensor during a flight campaign in 2003. The imaging wavelength of ROSIS ranges from 430 to 860 nm, in which 103 spectral bands are retained after removing 12 bands significantly affected by noise. This dataset contains
2) Salinas Dataset
The Salinas dataset was gathered by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over the Salinas Valley. The imaging wavelength range of AVIRIS is from 400 to 2500 nm, in which 204 bands are available after eliminating the bands absorbed by water. This dataset is in size of
3) Indian Pines Dataset
The Indian Pines dataset was also collected by AVIRIS sensors, with a size of
B. Experimental Setting
To evaluate the performance of the proposed TBGAN model, the experiments are designed comparative to six representative HSI classification methods, including RBF-SVM [12], RF [9], LSTM [14], SSUN [14], M3D-DCNN [20], and DCGAN [35]. Meanwhile, the exploratory experiments are additionally conducted for the models of TBGAN containing the spectral branch or spatial branch only, which are named TB-SPE and TB-SPA, respectively. In the comparison models, the hyper-parameters, such as gamma and
Besides, the experiments are proceeded in the Pytorch backend with NVIDIA 1080Ti (number of cores: 1, RAM:11GB, Cuda version: 11.0). Since the models may be influenced by random initializations, the mean and standard deviation of classification results after ten runs are taken as the final experimental basis.
C. Classification Results
Tables 7–9 present the classification results of nine methods upon the Pavia University, the Salinas, and the Indian Pines dataset, respectively. Each table records, from top to bottom, the means of class accuracy, AA, OA, and Kappa coefficients after ten runs, as well as the standard deviations of the latter three evaluation metrics. As can be seen from Tables 7–9, the deep learning methods generally behave better in classification performance than traditional machine learning methods by the exploitation of hierarchical features. Furthermore, the GAN-based methods can generate more training samples, which is very helpful for network training and makes them achieve higher accuracy compared with other deep learning methods. Among these GAN-based methods, TBGAN exceeds TB-SPE and TB-SPA, which strongly demonstrates the superiority of utilizing the joint spectral-spatial features. DCGAN and TB-SPA achieve encouraging results, which can be attributed to the utilization of PCA transformation, partly introducing spectral information. By virtue of the multiscale connections and the two-branch structure, TBGAN obtains the best classification results among these nine methods. For the Pavia University dataset, TBGAN increases by 5.67%, 1.78%, and 0.13% respectively in terms of the OA index compared with LSTM, M3D-DCNN, and SSUN. For the Salinas dataset, TBGAN attains the optimal class accuracy for 12 classes, 7 of which reach 100% and the 10 times average of OA reaches 99.98%. In addition, TBGAN also achieves surpassing performance on the imbalanced Indian Pines dataset. For example, the prediction accuracy of 96.80% is given for the Grass-pasture-mowed class in the scenario of only 3 training samples.
In addition to the quantitative comparisons of the classification results in Tables 7–9, the qualitative visualization is also provided by creating the classification maps for each method on three HSI datasets. As exhibited in Fig. 6–8, it can be observed that the classification maps obtained by the TBGAN are closer to the ground truths and have fewer outliers compared with other methods, which further confirms the effectiveness of the proposed method. Moreover, because the input of TB-SPA is a 3D cube in size of
Classification maps of different models on the Pavia University dataset. (a) Ground truth, (b) RBF-SVM, (c) RF, (d) LSTM, (e) M3D-DCNN, (f) SSUN, (g) TB-SPE, (h) TB-SPA, and (i) TBGAN.
Classification maps of different models on the Salinas dataset. (a) Ground truth, (b) RBF-SVM, (c) RF, (d) LSTM, (e) M3D-DCNN, (f) SSUN, (g) TB-SPE, (h) TB-SPA, and (i) TBGAN.
Classification maps of different models on the Indian Pines dataset. (a) Ground truth, (b) RBF-SVM, (c) RF, (d) LSTM, (e) M3D-DCNN, (f) SSUN, (g) TB-SPE, (h) TB-SPA, and (i) TBGAN.
D. Model Complexity
To assess the complexity of the proposed TBGAN, Table 10 presents the number of parameters (Params) and floating-point operations (FLOPs) of seven deep learning methods. The results suggest that TBGAN has fewer parameters than advanced SSUN and DCGAN, but the actual computation is slower than that of other models due to the two-branch structure.
For a more comprehensive evaluation, the running time of nine methods upon each dataset is provided in Table 11. Generally speaking, shallow models in the machine learning community are more efficient than deep learning algorithms. More significantly, the four GAN-based models take more time during the training stage than other deep learning models because both the generator and the discriminator need to be trained simultaneously. In particular, the proposed TBGAN and its sub-models TB-SPE and TB-SPA all adopt such a training strategy of updating the discriminator three times while updating the generator once. TBGAN requires longer training time than TB-SPE and TB-SPA, this is probably because TBGAN needs to extract the joint spectral-spatial features and update the two generators in each training iteration.
Discussion
The relevant experiments are carried out for exploring the impacts of some significant influencing factors such as patch size, hyper-parameter
A. Impacts of the Patch Sizes
Obviously, the performance of TBGAN is susceptible to the patch size. The larger patches may contain redundant information resulting in lower classification accuracy and heavier computation. In contrast, the smaller patches may provide insufficient spatial features for training the model, leading to false discriminants. In the experiments, four spatial neighborhoods of
B. Optimal Choice of Hyper-Parameter \lambda
in L_{G}
The hyper-parameter of
C. Advantages of Self-Attention Mechanism
To capture the long-term dependencies in the spectral sequences,
D. Sensitivity to the Number of Training Samples
To investigate the sensitivity of different classification methods to the number of training samples, 10%, 9%, 8%, 7%, and 6% of the labeled samples are successively picked out from the three datasets in the experiments. As shown in Fig. 9, with the reduction of training samples, the classification accuracy of all nine methods declines to varying degrees. As well known to all, deep learning methods require extensive training samples to optimize the parameters, and insufficient samples tend to result in overfitting of the model, thus reducing the classification accuracy. Whereas, the four GAN-based models, by virtue of generating real-like samples, can alleviate the overfitting problem caused by the reduction of training samples. Specifically, as the ratio of training samples from three datasets decreases from 10% to 6%, the OA of TBGAN declines by 0.2%, 0.09%, and 2.3%, respectively, which are significantly slower than the other methods.
OA of RBF-SVM, RF, LSTM, M3DCNN, SSUN, TB-SPE, TB-SPA, and TBGAN with different ratios of training samples on (a) Pavia University dataset, (b) Salinas dataset, and (c) Indian Pines dataset.
Conclusion
This article proposes a novel TBGAN model for HSI classification. Specifically, there are two generators devised in TBGAN to produce the spectral and spatial real-like data, respectively, which alleviates the small sample size problems. Furthermore, the spectral generator is integrated with the self-attention mechanism, ameliorating the manipulation ability of the long-term dependency relationship. For the multi-classification task, an elaborate discriminator with two branches is designed in TBGAN to extract the spectral and spatial features more thoroughly. It is particularly worth mentioning that the multiscale connections are placed between the discriminator and two generators in TBGAN to improve the network stability and the classification capability. Meanwhile, a feature-matching term is added to the loss function to make the training process more stable. The experimental results demonstrate that TBGAN behaves the superior classification performance and shows lower sensitivity to the number of training samples, which exerts great potential for classification under the circumstance of small size samples. In future research, more innovative strategies are highly expected to be developed in GAN-based supervised frameworks for further improving the performance of HSI classification.
ACKNOWLEDGMENT
The authors gratefully appreciate the editor and anonymous reviewers for their efforts and constructive comments, which have greatly improved the technical quality and presentation of this study.