Introduction
The remote sensing technology for earth observation is widely used in land census, environmental protection, and disaster monitoring. The classification of remote sensing images is the basis and an important technical support in remote sensing applications. However, with the continuous improvement in the spatial resolution of remote sensing images, the phenomena of “different objects with the same spectrum” and “the same object with different spectra” are more serious and the details of object features are more complicated than before. These conditions increase the requirements for the classification of remote sensing images with high spatial resolution. Meanwhile, the effect of supervised classification is closely related to the quality and quantity of samples, but the production of samples is time consuming and laborious. This situation poses a challenge to the classification of remote sensing images with high spatial resolution under limited samples.
Generative adversarial networks (GAN) [1], as excellent frameworks, have attracted widespread attention from researchers in recent years. GAN is a type of deep learning models that can generate the expected value through a game between two neural networks. It is inspired by a two-player, zero-sum game, that is, the sum of the interests of the two players is zero or a constant; one party gains, whereas the other loses. The two parties in GAN are the generation model (G) and the discriminant model (D). Random noise Z, such as Gaussian distribution, is input in G model, and “false” data of the same size as the training image are output. In the discriminant model D, the similarity between the “false” data generated by G and the real data is calculated to feed back to the generation model G, where D is a binary classifier. The generation and discriminant models compete with each other in the training process; hence, the “false data” generated using the generation model are sufficient to deceive the discriminant model to output high probability. GAN shows strong capabilities in the field of image generation. On the basis of GAN model, scholars have proposed various structures for different image applications to improve the quality of generated images. The main variants include improving the stability of GAN training and the quality of generated results through the fusion of multiple network structures [2]; modifying the loss function to alleviate the problems of unstable GAN training and poor diversity of generated image quality [3], [4]; adding a self-encoder as classifier, and matching the loss distribution of the self-encoder on the basis of the Wasserstein distance, in which the generator and discriminator are balanced to improve the quality and diversity of generated images [5]–[7]; building a perceptual loss function containing antagonistic loss and loss of content, a realistic image with a magnification of four times is generated for superresolution image generation [8], [9]. All of the aforementioned variants generate images by unsupervised learning, they exhibit randomness for generated image category, target color, and texture. On the contrary, conditional GAN (CGAN) can generate images of designated category by adding condition variables to the generator and discriminator [10]. Many scholars have used these networks to achieve improved applications in different fields [11]–[17]. Nevertheless, most of these applications are based on nonpixel-level image generation. Isola et al. proposed an end-to-end CGAN based on label conditions named as pix2pix, which can generate multiclass pixel-level images [18]. The images generated using pix2pix, however, are inadequate in texture features and insufficiently accurate in boundary features. The information of texture, spectrum, and edge is important in the application of remote sensing images with high spatial resolution. Pix2pix has defects in generating remote sensing images with high spatial resolution and cannot meet the needs for corresponding applications. Centimeter-level spatial resolution remote sensing images have extremely complex spatial, textural, and spectral characteristics. For this type of data classification, the robustness of the network is mainly improved by learning a large number of training samples. The number of training samples is an important factor restricting the accuracy of the classification of remote sensing images with high spatial resolution. The collection and production of samples require excessive manpower and material resources.
Therefore, in view of the abovementioned two problems, we proposes a CGAN based on boundary and edge features for generating remote sensing images with high spatial resolution (ECGAN) in this article. And it is used for data enhancement and augmentation to solve the problem of insufficient training samples. The key contributions of the article are as follows.
An end-to-end ECGAN, which adds interclass boundary and intraclass edge feature factors into condition variables to improve the accuracy of the texture and edge information of generated images, is designed. The boundary feature focuses on improving the boundary accuracy of the generated images and reducing the “false” edges that do not exist. The edge feature focuses on improving the accuracy of the internal texture of the generated images.
An objective function combining cross-entropy loss and multilevel feature loss with L1 loss is designed to train the discriminator and generator. The loss function of multilevel features with L1 loss can minimize the differences between the features of real and generated images at different spatial scales, thereby improving the details features of the generated images.
A new augmentation method of remote sensing images based on ECGAN is proposed. We classify the high spatial resolution remote sensing images with semantic segmentation method under different data augmentation methods. Compared to the traditional augmentation methods, the generated images by ECGAN has more diverse spatial information and spectral information. The proposed method can improve training stability and classification accuracy of semantic segmentation network.
The rest of this article is organized into four sections. Section II introduces the proposed methods and network architecture. Results and discussion are presented in Section III, which includes the data set we used to validate the network architecture and model parameters. We draw the conclusion in Section IV.
Methodology
A. Condition Variables With Boundary and Edge Features
In this article, boundary feature refers to the rough features between the categories (heterogeneous) and edge feature refers to the detail features within the categories (homogeneous). The texture and spatial information of remote sensing images with high spatial resolution, especially aerial remote sensing images, are extremely complex, causing the generation of such images to be challenging. Therefore, in this article, boundary feature factors are added to enhance the distinction among different objects, and edge feature factors are added to enhance the details and texture inside the objects.
A boundary feature map is obtained using the difference between bounded and unbounded label data, and an edge feature map is obtained using the edge detection algorithm (see Fig. 1). Edge detection operators comprise three types: Laplace [19], Sobel [20], and Canny [21] operators. Among them, the Laplace operator is more sensitive to noise and is thus rarely used to detect edges; the Sobel operator is better for image processing with gray gradient and more noise, but it is inadequately accurate for edge location; the Canny operator is insusceptible to noise and can detect real weak edges. The latter uses two different thresholds to detect strong and weak edges. Therefore, in this article, we extract edge features by using the Canny detection operator.
(a) High-resolution remote sensing image. (b) Boundary feature map, which is obtained bounded and unbounded label data. (c) Edge feature map, which is obtained by Canny.
The Canny edge detection algorithm is divided into the five steps.
1) A Gaussian filter is used to smooth the image and filter out the noise. The generating equation for a Gaussian filter kernel of size 3 × 3 is given by the following formula (σis the standard deviation of Gaussian function)
\begin{equation*}
{h_{ij}} = \frac{1}{{2\pi \sigma }}\exp \left[ { - \frac{{{{(i - 2)}^2} + {{(j - 2)}^2}}}{{2{\sigma ^2}}}} \right];1 \leq i,j \leq 3
\end{equation*}
\begin{equation*}
H = \left[ {\begin{array}{rcl} {{h_{11}}}&{{h_{12}}}&{{h_{13}}}\\ {{h_{21}}}&{{h_{22}}}&{{h_{23}}}\\ {{h_{31}}}&{{h_{32}}}&{{h_{33}}} \end{array}} \right].\tag{1}
\end{equation*}
After Gaussian filtering, the brightness value of the center pixel e is
\begin{equation*}
e = H*A = \left[ {\begin{array}{rcl} {{h_{11}}}&{{h_{12}}}&{{h_{13}}}\\ {{h_{21}}}&{{h_{22}}}&{{h_{23}}}\\ {{h_{31}}}&{{h_{32}}}&{{h_{33}}} \end{array}} \right]*\left[ {\begin{array}{rcl} a&b&c\\ d&e&f\\ g&h&i \end{array}} \right]\tag{2}
\end{equation*}
where * denotes the convolution operation.
2) The gradient intensity and direction of each pixel in the image are calculated, which are calculated by Sobel operator [i.e., (3)] in this article, the calculation process is shown as
\begin{align*}
{S_x} &= \left[ {\begin{array}{rcl} { - 1}&0&1\\ { - 2}&0&2\\ { - 1}&0&1 \end{array}} \right],{S_y} = \left[ {\begin{array}{rcl} 1&2&1\\ 0&0&0\\ { - 1}&{ - 2}&{ - 1} \end{array}} \right]\tag{3}\\
{G_{\rm{x}}}& = {S_x}*A\\
{G_y} &= {S_y}*A\\
G &= \sqrt {{G_x}^2 + {G_y}^2} \\
\theta &= arc\tan ({G_y}/{G_x})\tag{4}
\end{align*}
3) Nonmaximum suppression is applied to eliminate the spurious response caused by edge detection. The gradient intensity of the current pixel is compared with the two pixels along the positive and negative gradient directions, if the gradient intensity of the current pixel is the largest compared with the other two pixels, the pixel is retained as an edge point, otherwise the pixel will be suppressed.
4) Dual-threshold (high and low threshold) detection is applied to determine real and potential edges.
5) Edge detection is completed by suppressing the unreal potential edge.
B. Loss Function Considering Multiscale and Multilevel Features
The objective of a conditional GAN can be expressed as
\begin{align*}
{L_{cGAN}}(G,D) = & {E_{x\sim{p_{data}}(x)}}[\log D({\rm{x,y}})] \\
&+ {E_{z\sim{p_z}(z)}}[\log(1 - D(G(y,z)))]\tag{5}
\end{align*}
Research has shown that the loss function combining CGAN loss function with L2 distance can achieve enhanced effects of image generation [22]. Most of the data we used can have multiple peaks, but L2 loss will use a single peak Gaussian to fit the data, which will make the image smooth and indistinct. In response to this problem, L1 distance loss [i.e., (6)] is used to replace L2 distance loss in this article to reduce the edge fuzzy phenomenon of the generated image
\begin{equation*}
{L_{L1}}(G) = {E_{x,y,z}}[\left\| {{{\left. {x - G(y,z)} \right\|}_1}} \right.].\tag{6}
\end{equation*}
L1 distance loss calculated for the true and false features of each layer in the discriminator and summed losses obtained using all layers to obtain the final multilevel feature loss.
We suppose that the edge information under different scales and levels of features is useful. Therefore, a new objective function is proposed by combining L1 distance with multiscale and multilevel feature loss. As shown in Fig. 2, the objective function can fully consider the features of different levels, thereby minimizing the difference between the features of real and generated images to make generator acquire an image that is highly similar to the real image. The multiscale and multilevel objective function is shown as follows:
\begin{align*}
&{L_{f\_m}}(G,D) =\\
&\quad \frac{{\sum\nolimits_{i = 1}^n {({E_{x,y,z}}[\left\| {D{f_{(i)}}} \right.{{\left. {(x) - D{f_{(i)}}(G(y,z))} \right\|}_1}})]}}{n}\tag{7}
\end{align*}
We combine the multiscale and multilevel loss function with the CGAN loss function, and the final ECGAN loss function in this article is
\begin{equation*}
{G^ * } = \arg mi{n_G}mi{x_D}{L_{cGAN}}(G,D) + \lambda \min{L_{f\_m}}(G,D).\tag{8}
\end{equation*}
C. Network Architecture
The ECGAN model constructed in this article consists of two parts: generator and discriminator networks.
1) Generator Network
For the field of image generation, the input structure of generator network is required to be roughly aligned with the output structure, and many previous solutions [23]–[26] have used encoder–decoder networks to construct the generator network. U-Net network with symmetric structure of skip connections [27], [28] has become the most commonly generator network. The network combines low-level features with high-level features. As shown in Fig. 3, skip connections are added between layers i and n-i (n is the total number of layers), and each skip connection connects only the channels between the two layers.
This article uses the generator network structure based on “U-Net,” as shown in Fig. 4, which is composed of seven downsampling convolutional layers and seven upsampling convolutional layers. Downsampling is realized using the maximum pooling layer, and upsampling is realized using the upsampling layer with the skip connection layer. In the last layer, the upsampling convolutional layer uses the “tanh” [29] activation function to output the image, and the other convolutional layers use a leaky relu [30] activation function.
2) Discriminator Network
We focus on the generation of a remote sensing image with high spatial resolution, which has complex texture and spatial details. To highlight the high-frequency information of the generated image, this article draws on the idea of PatchGAN [18] to divide the image in the discriminator into multiple patches to determine the authenticity of each patch and averages the authenticity responses of all patches as the final output of the discriminator D, and the process of patch-based discriminator is shown in Fig. 5(a). In this way, the local high-frequency information of the image can be considered. This high-frequency information actually refers to the texture features reflected on the remote sensing image. The discriminator (D) network is composed of five convolutional layers. As shown in Fig. 5(b), the first four layers of convolution are followed by batch normalization [31] and leaky relu function. The last layer of convolution is connected to a sigmoid function to generate the distribution of “0” and “1.”
(a) Process of Patch-based discriminator. The input that include fake image, real image and condition variables is divided into multiple N × N patches (the value of N in this article is 70), and each patch is input into the discriminator to distinguish the difference. Finally, the weighted sum of these differences is used to get the average, in order to judge the authenticity of the generated image. (b) Network of discriminator.
3) Flowchart
The flowchart of the proposed method for training and testing processes to generate a remote sensing image is illustrated in Fig. 6. Above the broken line in Fig. 6 is the training process for ECGAN. First, random noises and condition variables with label, boundary features, and edge features are regarded as input by using the generator to obtain a fake image with the same size as the real image. Then, the real image is concatenated with the fake image and condition variables to use as the input of the discriminator. The discriminator module performs true and false identification of the input. The adversarial loss in consideration of multiscale and multilevel features is acquired by calculating the difference between the real and false images in each layer of the discriminator. Meanwhile, the multiscale adversarial and cross-entropy losses are back-propagated using gradient descent, and the entire network parameters are continually updated until the training process is completed. The bottom half of the broken line in Fig. 6 demonstrates the testing process of ECGAN. In comparison with the training network, the testing network is relatively simple because the testing process does not need to update the network parameters and does not require the discriminator to perform true or false discrimination. Therefore, the testing process only needs to input the label, boundary feature map, and edge feature map of the test image into the trained generator to generate the image we need.
Flowchart of the generation of a high-resolution remote sensing image based on ECGAN.
Results and Discussion
A. Experimental Data
Experiments are performed on the Potsdam and Vaihingen 2-D dataset images of ISPRS to assess the performance of the proposed ECGAN method. As displayed in Figs. 7 and 8, the Potsdam dataset and Vaihingen dataset, respectively, which were acquired in Germany. While Potsdam shows a typical historic city with large building blocks, narrow streets, and dense settlement structure, Vaihingen is a relatively small village with many detached buildings and small multistory buildings. Both two datasets contain the same six most common land cover categories which are labeled in different colors: impervious surfaces (white); buildings (blue); low vegetation (cyan); trees (green); cars (yellow); and backgrounds (red). Potsdam has a spatial resolution of 0.05 m and contains four multispectral bands: red (R), green (G), blue (B), and near-infrared (NIR) bands. Vaihingen has a spatial resolution of 0.09 m and contains three multispectral bands: near-infrared (NIR); red (R); and green (G) bands.
Experiment data1. (Left) Potsdam data set containing 38 patches (of the same size, 6000 × 6000). (Right) Data provided by each patch. (a) True image. (b) DSM. (c) Ground truth.
Experiment data 2. (Left) Vaihigen data set containing 33 patches (of different sizes). (Right) Data provided by each patch. (a) True image. (b) DSM. (c) Ground truth.
B. Evaluation Method and Indicator
1) Evaluation Method
For a long time, the evaluation of the quality of images acquired using generation models has been challenging. From the proposal of GAN to now, many methods have been proposed to evaluate the quality of generated images. The evaluation methods are mainly divided into two categories. One is to evaluate the image quality by statistically generating the information entropy and feature divergence distance between generated and real images, such as Frechet perception distance [32], kernel MMD [33], and Wasserstein distance [34]. This type of method based on statistical feature extraction can evaluate image features in accordance with their presence or absence but not the spatial relative position of the features. The other is to evaluate the generated data through the classification scores of specific functions, such as inception score [35]. This type of method [36]–[39] evaluates on the basis of a specific pretrained model, without considering the effect of real data and lacking the authenticity evaluation of generated images.
In this article, the DeepLab v3 network architecture [40], [41] is used to evaluate the scores of generated images, and we propose two evaluation methods: GAN-test and GAN-train. The GAN-test method is based on the DeepLab v3 network to train real images and perform classification tests on the images generated using ECGAN. The GAN-train method is to use the images generated by ECGAN as a training sample, train on the basis of the DeepLab v3 network, and perform classification tests on the real images.
2) Evaluation Indicator
In order to quantitatively compare and estimate the capabilities of the proposed models, overall accuracy (OA), Kappa coefficient [42] and mean intersection over union (MIoU) are used as performance measurement.
The OA refers to the ratio of the number of samples correctly classified to the total number of samples; Kappa coefficient is an index to measure the coincidence degree or accuracy between two images, the closer the coefficient is to 1, the better the classification effect is; MIoU is a typical measure of semantic segmentation, it is evaluated by calculating the ratio of the intersection and union of the real value and the predicted value.
Putting them (9)–(11) all together, which can be calculated separately by
\begin{align*}
OA =& \frac{{\sum\nolimits_{i = 0}^n {{x_{ii}}} }}{N} * 100\%.\tag{9}\\
K =& \frac{{N\sum\nolimits_{i = 0}^n {{x_{ii}} - \sum\nolimits_{i = 0}^n {({x_{i + }}{x_{ + i}})} } }}{{{N^2} - \sum\nolimits_{i = 0}^n {({x_{i + }}{x_{ + i}})} }}.\tag{10}\\
MIOU =& \frac{1}{{{\rm{n}} + 1}}\sum\nolimits_{i = 0}^n {\frac{{{x_{ii}}}}{{\sum\nolimits_{j = 0}^n {{x_{ij}} + \sum\nolimits_{j = 0}^n {{x_{ji}} - {x_{ii}}} } }}}.\tag{11}
\end{align*}
C. Behavior and Analysis of Condition Variables Based on Edge Feature
1) Visual Analysis
Considering the limitation of GPU running memory, for Potsdam dataset, each patch was cut into 36 images of 1024 × 1024 size, in which we use 400 images for training and the 400 other images for testing in our experiment. For Vaihingen dataset, each patch was cut into 30 images of 512 × 512 size, in which we use 450 images for training and the 450 other images for testing in our experiment. In this article, the Adam optimizer with adaptive learning is used to replace the traditional stochastic gradient descent optimization algorithm, and the total number of iterations is set to 500.
We compare the generated results under three different condition variables: using label data as condition variables to generate images; using label data and boundary features as condition variables to generate images; and using label data, boundary features, and edge features as condition variables to generate images. Figs. 9 and 10 show the generation results of three different conditional variables on the Potsdam and Vaihingen images, respectively. As depicted in Figs. 9 and 10, the images generated using only label data as condition variables have a large number of “false” edges, resulting in fuzzy boundaries and insufficient texture; the images generated by adding a boundary feature map can eliminate the “false” edge information and enhance the boundary, but the detailed texture information is still blurred; the images generated by adding boundary and edge feature maps at once to condition variables can highly restore the truth images. The texture information and details of the images are rich and realistic, and the edges of the features are clearly visible. For example, small targets attached to a roof, a zebra crossing on the road, and vegetation texture details are clear and realistic, which almost perfectly reproduce the real image features.
Generation Results of Potsdam images with three different condition variables. (a) Label data. (b) Real remote sensing image. (c) Image generated using only the label condition variable. (d) Image generated using label and boundary feature condition variables. (e) Image generated using label, boundary feature, and edge feature condition variables.
Generation Results of Vaihingen images with three different condition variables. (a) Label data. (b) Real remote sensing image. (c) Image generated using only the label condition variable. (d) Image generated using label and boundary feature condition variables. (e) Image generated using label, boundary feature, and edge feature condition variables.
2) Quality Evaluation
We propose the two evaluation methods of “GAN-test” and “GAN-train” to analyze the classification accuracy of the real images and the images generated using the different condition variables in 4.3.1 and then to conduct quality evaluation.
In GAN-test experiment, 5000 real images for training, 100 images are selected separately from the data generated using the three different condition variables and the real data as testing data; In GAN-train experiment, 5000 images are selected from the data generated using the three different condition variables and the real data, respectively, for separate training, 100 real images for testing.
Figs. 11 and 12 qualitatively illustrates the GAN-test and GAN-train classification results of the images generated under three different condition variables based on Potsdam dataset. As depicted in Fig. 11, the GAN-test classification results of the images generated using only label condition variables are poor, the error classification phenomenon is serious, and the number of noise categories is excessive. After adding the boundary feature condition variable, the result is significantly improved, but many subphenomena mistakes still remain. The classification results of the images generated by adding boundary and edge feature condition variables at the same time are closest to the ground truth map, the error classification phenomenon is reduced, and the noise information after classification is significantly reduced. As depicted in Fig. 12, With the addition of boundary and edge feature condition variables, the GAN-train classification effect is gradually improved, and the misclassification phenomenon is gradually decreased. For instance, acquiring the category of trees in the classification results by using only label condition variables is difficult, and most of the categories are mistakenly divided into buildings and low vegetation. After adding boundary features, the classification effect of trees is greatly improved, and the misclassification of buildings and low vegetation is significantly reduced. However, the classification effect of impervious surface and car is poor. After edge features are added, except for the background noise category, other categories present enhanced classification results, which not only reduce the misclassification of buildings, low vegetation, and trees but also can accurately identify impermeable surface and car. For the Vaihingen dataset, the proposed method with the addition of boundary and edge feature condition variables also achieved the best GAN-test and GAN-train results, as shown in Figs. 13 and 14.
GAN-test classification results of Potsdam images with different condition variables. (a) Remote sensing image with high spatial resolution. (b) Ground truth. (c) GAN-test results of the generated images with label condition variables. (d) GAN-test results of the generated images with label and boundary feature condition variables. (e) GAN-test results of the generated images with label, boundary feature, and edge feature condition variables.
GAN-train classification results of Potsdam images with different condition variables. (a) Remote sensing image with high spatial resolution. (b) Ground truth. (c) GAN-train results of the generated images with label condition variables. (d) GAN-train results of the generated images with label and boundary feature condition variables. (e) GAN-train results of the generated images with label, boundary feature, and edge feature condition variables.
GAN-test classification results of Vaihingen images with different condition variables. (a) Remote sensing image with high spatial resolution. (b) Ground truth. (c) GAN-test results of the generated images with label condition variables. (d) GAN-test results of the generated images with label and boundary feature condition variables. (e) GAN-test results of the generated images with label, boundary feature, and edge feature condition variables.
GAN-train classification results of Vaihingen images with different condition variables. (a) Remote sensing image with high spatial resolution. (b) Ground truth. (c) GAN-train results of the generated images with label condition variables. (d) GAN-train results of the generated images with label and boundary feature condition variables. (e) GAN-train results of the generated images with label, boundary feature, and edge feature condition variables.
Tables I–IV list the GAN-test and GAN-train scores which contain OA, Kappa, and MIOU of generated images of Potsdam and Vaihingen dataset, where the best scores are marked in bold. For Potsdam data, as given in Table I, the GAN-test score, where the OA, Kappa, and MIOU of generating images by adding boundary and edge feature condition variables at the same time are 73.4%, 0.655, and 0.453, respectively. It is the highest in three condition variables, where the OA, Kappa, and MIOU are 16.2%, 0.192, and 0.132 higher than generated images using only label condition variable. It is the closest to the scores of real images. In particular, the classification accuracy of impervious surfaces, buildings, trees is greatly improved compared with that of the images generated using the two other condition variables. As given in TableII, the GAN-train score, where the OA, Kappa, and MIOU of generating images by adding boundary and edge feature condition variables at the same time are 75.0%, 0.672, and 0.537, respectively, also perform the best in three condition variables. Especially its MIOU even exceeds the classification result of real images, which is 0.016 higher than real. For Vaihingen data, the proposed method by edge condition also achieved the highest scores. It can be seen from Table III that the GAN-test scores, are 6.5%, 0.092, and 0.064 higher than method by only label condition. And as given in Table IV, the GAN-train scores, where the OA, Kappa, and MIOU of proposed method by edge condition are 90.4%, 0.867, and 0.641, respectively, not only perform the best in three condition variables, but also exceed the scores of real images which are 83.8%, 0.786, and 0.586. In particular, the accuracy of the impervious surface and tree of proposed method by edge condition has a greatly improvemet, which is 3.2% and 18.4% highter than that of real images, respectively.
D. Behavior and Analysis of Different Loss Function
1) Visual Analysis
The images generated using two loss functions are evaluated through GAN-test by using the same training samples of Potsdam dataset. Loss functions contain the L1 distance CGAN loss function (L1+CGAN) and the CGAN loss function that considers multiscale and multilevel features (f_m +CGAN). The generated images are shown in Fig. 15. They have a high degree of restoration in terms of brightness, color, and texture characteristics; especially the texture of trees is clear and complicated.
Generated results of different objective functions. (a) Ground truth. (b) Truth remote sensing image. (c) Generated images with objective function combining L1 and CGAN (L1+CGAN). (d) Generated images with CGAN objective function considering multiscale and multilevel features (f _m+CGAN). The red oval part shows that d has richer texture detail features than c, especially vegetation, building surfaces, and impervious surfaces.
2) Quality Evaluation
Table V givens the scores which contain OA, Kappa, and MIOU of the images generated using the two different loss functions. The images generated using the CGAN loss function that regards multiscale and multilevel features as the loss function have a significant improvement compared with the images generated using the combination of L1 and CGAN as the loss function (except for the slightly lower accuracy of the car category). Moreover, the problem that the generated model is insensitive to background noise information is solved, and a good classification capability is shown in the category of background noise, which is improved by 23.2%. In terms of overall scoring, the OA is increased by 1.6%, the kappa coefficient is increased by 0.025, and the MIOU is increased by 0.063.
E. Analysis of Sample Augmentation Experiment
The traditional sample augmentation methods mainly include image rotation (multiangle), image scaling, and randomly adding noise. These methods mainly expand the data by changing the spatial scale and imaging angle of the original image, which improves the classification accuracy to a certain extent. In this article, the traditional method of image augmentation is divided into three steps:
1) the images are randomly flipped at three angles (90°, 180°, 270°);
2) random Gaussian noise is added to the images; and
3) random Gaussian noise is added to the image first, and then three angles are flipped randomly.
We combine ECGAN image generation with the traditional augmentation method to acquire a new sample augmentation method. Based on Deeplab v3 segmentation network, the proposed method is analyzed and compared with the traditional augmentation methods. Two experiments are performed separately of two datasets in this section. For Potsdam dataset, first, 4800 images with size of 1024 × 1024 augmented using the traditional methods are selected from the 15 patch images as training samples, and other 23 patch images are used for testing; second, in the 15 patch images, 2400 images with size of 1024 × 1024 augmented using the traditional methods and 2400 images with size of 1024 × 1024 generated using ECGAN are selected as training samples, and other 23 patch images are used for testing. A patch image with size of 6000 × 6000 is regarded as an example, as shown in Fig. 16, the proposed augmentation method achieved an excellent segmentation result. In the classification results of the proposed method, the background noise category (red) is obviously reduced, and the impervious surface class (white) and building class (blue) are also distinguished more effectively. For Vaihingen dataset, first, 8000 images with size of 512 × 512 using the traditional methods are selected from the 18 patch images as training samples, and other 15 patch images are used for testing; second, in the 18 patch images, 4000 images with size of 512 × 512 augmented using the traditional methods and 4000 images with size of 512 × 512 generated using ECGAN are selected as training samples, and other 15 patch images are used for testing. A patch image with size of 2659 × 2575 is regarded as an example, as shown in Fig. 17. The proposed augmentation method also achieved excellent segmentation results. Fig. 17 illustrates that the proposed approach obtains the best segmentation result for the white category (impervious surface) and red category (background and noise).
Classification results of Potsdam image with different means of augmentation. (a) Truth remote sensing image. (b) Ground truth. (c) Classification results using traditional sample augmentation methods (Tra_ext). (d) Classification results of the combination of traditional sample and ECGAN ((Tra+ECGAN)_ext) augmentation methods.
Classification results of Vaihingen image with different means of augmentation. (a) Truth remote sensing image. (b) Ground truth. (c) Classification results using traditional sample augmentation methods (Tra_ext). (d) Classification results of the combination of traditional sample and ECGAN ((Tra+ECGAN)_ext) augmentation methods.
Tables VI and VII list the category precision accuracy, OA, Kappa, and MIOU of two kinds of augmentation methods of Potsdam and Vaihingen datasets, respectively. For Potsdam dataset, the OA, Kappa, and MIOU of proposed augmentation method are 4.1%, 0.054, and 0.058 higher than that of traditional augmentation methods. In particular, the accuracy of the “impervious surface” and “low vegetation” categories of the proposed method is significantly improved, which is 8.4% and 7.7% higher respectively. For Vaihingen dataset, the scores of the proposed augmentation method also have a significant improvement, where the OA, Kappa, and MIOU are respectively 3.4%, 0.044, and 0.045 higher than that of traditional methods. The accuracy of the “impervious surface,” “tree,” and “car” categories has been significantly improved.
Conclusion
This article presents an end-to-end ECGAN for generating remote sensing images with high spatial resolution. In the proposed model, interclass boundary and intraclass edge feature factors are added to condition variables to highlight the texture features of the generated image. In addition, a loss function combining multilevel feature and cross-entropy loss is designed to minimize the difference between the features of real and generated images. The effectiveness of the proposed method for generating images is verified via experiments by using Potsdam and Vaihingen 2-D dataset images of ISPRS. The conclusions are as follows.
1) The discriminator of ECGAN proposed in this article can fully consider multiscale and multilevel hierarchical features.
2) ECGAN can generate remote sensing images with high spatial resolution that are highly similar to real images, which have clear space and texture, close colors, and accurate boundaries.
3) ECGAN is an effective method for the sample augmentation of remote sensing image data with high spatial resolution, which guarantees the accuracy of supervised classification under the condition of limited samples.
It also effectively solves the unsatisfactory classification effect when the supervised classification samples are insufficient.