Introduction
Building footprint generation is of great importance to urban planning and monitoring, land use analysis, and disaster management. High-resolution satellite imagery, which can provide more abundant detailed ground information, has become a major data source for building footprint generation. Due to the variety and complexity of buildings, building footprint requires significant time and high costs to generate manually (see Fig. 1). As a result, the automatic generation of a building footprint not only minimizes the human role in producing large-scale maps but also greatly reduces time and costs.
Previous studies focusing on building footprint generation can be categorized into four aspects: 1) edge-based; 2) region-based; 3) index-based; and 4) classification-based methods. In edge-based methods, regular shape and line segments of buildings are used as the most distinguishable features for recognition [1]. Region-based methods identify building regions through image segmentation [2]. For index-based methods, a number of building feature indices are used to describe the characteristics of buildings, which indicate the possible presence of buildings [3]. Classification-based methods, which combine spectral information with spatial features, are among the most widely used approaches since they can provide more stable and generalized results than the other three methods.
Traditional classification-based methods consist of two steps: feature extraction and classification. Among them, the support vector machine (SVM) and random forest (RF) are two popular classification approaches in the remote sensing (RS) domain. However, an SVM will consume too many resources when used for big data applications and large area classification problems, and multiple features should be engineered to feed the RF classifier for efficient use. Recent advances in traditional classification methods, e.g., [4] and [5], show promising results.
Over the past few years, the most popular and efficient classification approach has been deep learning (DL) [6], which has the computational capability for big data. DL methods combine feature extraction and classification and are based on the use of multiple processing layers to learn good feature representation automatically from the input data. Therefore, DL usually possesses better generalization capability, compared to other classification-based methods. In terms of particular DL architectures, several impressive convolutional neural network (CNN) structures, such as ResNet [7] and U-Net [8], have already been widely explored for RS tasks. However, since the goal of CNNs is to learn a parametric translation function by using a data set of input-output examples, considerable manual efforts are needed for designing effective losses between predicted and ground truth pixels. To address this problem, generative adversarial networks [9] were recently proposed, which learn a mapping from input to output images and tries to classify if the output image is real or fake.
In this regard, one of the motivations of this letter was to explore the potential of generative adversarial networks (GANs) in building footprint generation by comparing their performance with other CNN structures. However, GANs also have their own limitations: 1) there is no control over the modes of data being generated and 2) and the training is delicate and unstable. Therefore, several studies have proposed alternatives to traditional GANs, such as conditional GANs (CGANs) [10] and Wasserstein GANs (WGANs) [11]. In order to direct the data generation process and improve the stability of training, we propose combining a CGAN, a WGAN, and a gradient penalty term for building footprint generation, which are exploited for the first time in the remote sensing community.
The proposed building footprint generation method is described in Section II. In Section III, the details of the data sets and the experimental results are presented and analyzed. The final conclusions follow in Section IV.
Methodology
A. Review of GANs
GANs were first proposed in [9] and consist of two neural networks: generator \begin{equation*} \mathcal {L}_{\text {GAN}}= E_{p_{x}}[\log D(x)] + E_{p_{z}}[\log (1- D(G(z)))]\tag{1}\end{equation*}
To address the problem of no control over the modes of data being generated in GANs, Mirza et al. [10] extended GANs to a conditional model, where both the generator and discriminator are conditioned on certain extra information \begin{equation*} \mathcal {L}_{\text {CGAN}}= E_{p_{x}}[\log D(x|y)] + E_{p_{z}}[\log (1- D(G(z|y)))].\tag{2}\end{equation*}
In order to improve the stability of learning of GANs and remove problems such as mode collapse, WGANs were proposed by Arjovsky et al. [11], which use an alternative cost function that is derived from an approximation of the Wasserstein distance. They are more likely to provide gradients that are useful for updating the generator than the original GANs.
B. Proposed Method
In this letter, we want to exploit the superiorities of both CGANs and WGANs. Therefore, we propose conditional Wasserstein generative adversarial networks (CWGANs), which can impose a control on the modes of data being generated and can also achieve more stable training as well. The objective function of CWGANs is given by \begin{equation*} \mathcal {L}_{\text {CWGAN}} = E_{p_{x}}[D(x|y)] - E_{p_{z}}[D(G(z|y))]\tag{3}\end{equation*}
However, due to the use of weight clipping in WGANs, CWGANs may still generate low-quality samples or fail to converge in some settings. Therefore, we used an alternative to clipping weights: the addition of a gradient penalty term [12] with respect to its input, whose objective function can be written as \begin{equation*} \mathcal {L}_{\text {GP}} = \lambda _{1} E_{p_{x,z}}[(||\nabla D (\alpha x+(1-\alpha) G(z|y))||_{2}-1)^{2}]\tag{4}\end{equation*}
In order to let the generator to be located near the ground truth output and to decrease blurring, a traditional loss \begin{equation*} \mathcal {L}_{L_{1}} = \lambda _{2} E_{p_{x,z}}[||x-G(z|y)||_{1}]\tag{5}\end{equation*}
\begin{equation*} \mathcal {L} = \arg \min \limits _{G} \max \limits _{D} \mathcal {L}_{\text {CWGAN}} + \mathcal {L}_{\text {GP}} + \mathcal {L}_{L_{1}}.\tag{6}\end{equation*}
C. Network Architectures
The network architecture in this letter is shown in Fig. 2, which is used to generate the building footprint from satellite imagery.
We used the U-Net as the generator architecture. It is an encoder–decoder network with skip connections to concatenate all channels at layer
As for the discriminator architecture, the PatchGAN proposed in [13] is exploited to model a high-frequency structure. This network tries to classify whether each patch in an image is real or fake. With the discriminator running convolutionally across the image, the ultimate output of
Experiments
A. Description of Data sets
In this letter, we chose two study areas in Germany, which were Munich and Berlin. We used PlanetScope satellite imagery with three bands (R, G, and B) and a spatial resolution of 3 m to test our proposed method. The corresponding building footprints were downloaded from OpenStreetMap (OSM). We processed the imagery using a
B. Experimental Setup
The number of both generator and discriminator filters in the first convolution layer was 64. The downsampling factor is 2 in both the discriminator and the encoder of the generator. In the decoder of the generator, deconvolutions were performed with an upsampling factor of 2. All convolutions and deconvolutions had a kernel size of
C. Results and Analysis
In this letter, we evaluated the inference performances using metrics for a quantitative comparison: overall accuracy (OA), F1 scores, and IoU scores. Specifically, the F1 and IoU metrics are defined as follows:\begin{align*} \textrm {F1}=&\frac {2\times \textrm {precision}\times \textrm {recall}}{\textrm {precision}+\textrm {recall}}\tag{7}\\ \textrm {IoU}=&\frac {\textrm {TP}}{\textrm {TP}+\textrm {FP}+\textrm {FN}}\tag{8}\end{align*}
Fig. 3 shows the visual results of one patch with different depths compared to the ground truth. As it can be seen from Fig. 3, a large number of roofs are omitted by the network with
Comparison of results generated by U-Net structure with different depths. (a) Depth (
Second, we have chosen different coefficients (
Visualized comparison of different networks and coefficients
Training and inferencing time of different methods. (a) Training time (in seconds). (b) Inferencing time (in milliseconds).
When the coefficient of
Finally, we applied the selected coefficient of
Fig. 6 presents a section of the entire Munich test area. The red color indicates the building footprint generated by the proposed method and overlays an optical image.
Section of the entire Munich test area. Red: building footprint generated by the proposed method and overlays an optical image.
Conclusion
GANs, which have recently been proposed, provide a way to learn deep representations without extensively annotated training data. This research aimed to explore the potential of GANs in the performance of building footprint generation and improve its accuracy by modifying the objective function. Specifically, we proposed two novel network architectures (CWGAN and CWGAN-GP) that integrate CGAN and WGAN, as well as a gradient penalty term, which can direct the data generation process and improve the stability of training. The proposed method consists of two networks: 1) the U-Net architecture in the generator and 2) the PatchGAN in the discriminator. PlanetScope satellite imagery of Munich and Berlin was investigated to evaluate the capability of the proposed approaches. The experimental results confirm that the proposed methods can significantly improve the quality of building footprint generation compared to existing networks (e.g., CGAN, U-Net, and ResNet-DUC). In addition, it should be noted that the stability of our proposed method CWGAN-GP nearly removes all hyperparameters tuning.
ACKNOWLEDGMENT
The authors would like to thank Planet for providing the data sets.