Journals & Magazines >IEEE Journal of Selected Topi... >Volume: 14

Automated High-Resolution Earth Observation Image Interpretation: Outcome of the 2020 Gaofen Challenge

Abstract:

In this article, we introduce the 2020 Gaofen Challenge and relevant scientific outcomes. The 2020 Gaofen Challenge is an international competition, which is organized by...Show More

Topic: 2020 Gaofen Challenge

Metadata

Abstract:

In this article, we introduce the 2020 Gaofen Challenge and relevant scientific outcomes. The 2020 Gaofen Challenge is an international competition, which is organized by the China High-Resolution Earth Observation Conference Committee and the Aerospace Information Research Institute, Chinese Academy of Sciences and technically cosponsored by the IEEE Geoscience and Remote Sensing Society and the International Society for Photogrammetry and Remote Sensing. It aims at promoting the academic development of automated high-resolution earth observation image interpretation. Six independent tracks have been organized in this challenge, which cover the challenging problems in the field of object detection and semantic segmentation. With the development of convolutional neural networks, deep-learning-based methods have achieved good performance on image interpretation. In this article, we report the details and the best-performing methods presented so far in the scope of this challenge.

Topic: 2020 Gaofen Challenge

Published in: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing ( Volume: 14)

Page(s): 8922 - 8940

Date of Publication: 24 August 2021

ISSN Information:

DOI: 10.1109/JSTARS.2021.3106941

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

With the significant progress of various earth observation missions, a large amount of high-resolution data has been widely acquired, providing a variety of earth information. Automated high-resolution earth observation image interpretation has a wide range of applications, such as flight management, urban planning, and water-body monitoring [1]–[7]. However, the automatic interpretation of high-resolution remote sensing images is still challenging due to complex background and various objects in remote sensing images [8].

The 2020 Gaofen Challenge has covered two main approaches proposed in the field of automated interpretation: first, object detection and recognition, and second, semantic segmentation. The main purpose of object detection and recognition is to obtain the categories and locations of objects in an image. In the field of interpreting remote sensing images, object detection and recognition is significant to many rigid objects, such as airplanes, ships, and bridges. Common object detection and recognition algorithms consist of two categories: anchor-based algorithms and anchor-free algorithms. Anchor-based algorithms include one-stage methods and two-stage methods. For the two-stage methods, proposals are generated using a region proposal network (RPN) first. Then, they further classify and locate objects with candidate region proposals [9]–[12]. Compared with two-stage object detection algorithms, one-stage algorithms do not need to generate region proposals and, instead, they predict the classification and localization directly. One-stage methods are more efficient than two-stage methods due to their simple structures [13]–[17]. Recently, anchor-free object detection methods have been proposed in several works. For example, CornerNet [18] and CenterNet [19] regard an object as a pair of keypoints. Anchor-free methods have few hyperparameters and can also achieve relatively good performance [18], [20], [21].

Compared to object detection, semantic segmentation needs to obtain the categories for each pixel in an image. Common semantic segmentation methods consist of encoder–decoder models and dilated-based models. For instance, the fully convolutional network [22], UNet [23], and SegNet [24] use the encoder–decoder structure to exploit the high-level feature maps. DeepLab [25] and ENet [26] adopt atrous convolutions to enlarge the receptive field of filters and aggregate multiscale context information. The aforementioned methods have made great progress in the field of image processing. However, automated high-resolution earth observation image interpretation is challenging due to the inherent characteristics of remote sensing scenes [27]–[29]. More specifically, remote sensing images typically cover large and often complex scenes with diverse background and a wide variety of objects exhibiting large differences in size. Some object categories even reveal high intracategory and low intercategory variations, making the interpretation even more challenging [28], [30].

To promote the development of this domain, the 2020 Gaofen Challenge on automated high-resolution earth observation image interpretation serves to bring together researchers from both computer vision and earth observation domains to discuss cutting-edge technologies on image interpretation and their applications.¹ It is an international competition, which is hosted by the China High-Resolution Earth Observation Conference Committee and the Aerospace Information Research Institute, Chinese Academy of Sciences and technically cosponsored by the IEEE Geoscience and Remote Sensing Society and the International Society for Photogrammetry and Remote Sensing.

We set six tracks in the 2020 Gaofen Challenge to meet different application requirements. Tracks 1, 2, and 3 aim to promote the research of object detection and recognition in optical images and synthetic aperture radar (SAR) images. Specifically, fine-grained airplane detection, bridge detection, and ship detection tasks are set in these tracks. The other three tracks focus on semantic segmentation in optical images and SAR images with respect to object categories, such as water body, road, tree, building, vehicles, and land.

To satisfy the high-resolution earth observation system construction requirements for major national scientific and technological projects, images used in the scope of the 2020 Gaofen Challenge are collected from the Gaofen-2 satellite and Gaofen-3 satellite. Specifically, we use the Gaofen-2 optical satellite data with 0.8–4 m resolution for airplane detection, bridge detection, and water-body segmentation tasks. And the Gaofen-3 SAR data with 1–5 m resolution are used for the tasks addressing ship detection, and semantic segmentation in polarimetric SAR data. To obtain high-quality data, we invited hundreds of experts taking more than three months to prepare the dataset. Finally, a large-scale and challenging dataset with various categories and tremendous object instances has been published for the 2020 Gaofen Challenge.

The rest of this article is organized as follows. We introduce the relevant details about the organization and dataset of the challenge in Section II. The overall information and results of participants in the challenge are discussed in Section III. We report the methods proposed by the winning teams of each track in Sections IV–IX. Finally, Section X, we make a conclusion to the 2020 Gaofen Challenge.

SECTION II.

Data of the 2020 Gaofen Challenge

Data from Chinese Gaofen satellites are provided for all six tracks of the 2020 Gaofen Challenge. The data used in the challenge include multiscale, multiview, multiresolution optical remote sensing images and SAR images, which are all collected from Gaofen-2 and Gaofen-3 satellites with the resolution ranging from 1–4 and 1–5 m, respectively. The data containing more than 10 000 images are annotated by more than 100 experts over three months. Some images and corresponding ground truth labels of each track are shown in Figs. 1 and 2. Details of the data provided for the 2020 Gaofen Challenge Tracks 1 to 6 are presented in the following.

Fig. 1.

Sample images and ground truths of object detection and recognition tasks. (a) Airplane detection. (b) Ship detection. (c) Bridge detection.

Show All

Fig. 2.

Sample images and ground truths of semantic segmentation tasks. (a) Semantic segmentation in optical images. (b) Water-body segmentation in optical images. (c) Semantic segmentation in fully polarimetric SAR images.

Show All

Data for Track 1 (airplane detection and recognition in optical images) are provided by the Gaofen-2 satellite. The scenes include the main civil airports in the world, such as Sydney Airport, Beijing Capital International Airport, Shanghai Pudong International Airport, Hong Kong Airport, Tokyo International Airport, and many more. The data contain 3000 satellite images with a spatial resolution of 0.8 m. Each image is of the size 1000 × 1000 pixels and contains ten categories of airplanes (i.e., Boeing 737, Airbus A321, Airbus A330, Boeing 747, Boeing 777, Boeing 787, Airbus A220, COMAC ARJ21, Airbus A350, and other) exhibiting a wide variety of orientations and scales.
Data for Track 2 (ship detection in SAR images) are collected from Gaofen-3 satellite. It contains 1000 SAR images with a spatial resolution ranging from 1–5 m. Each image is of the size 1000 × 1000 pixels and includes ships exhibiting a wide variety of orientations and scales. The scenes include the main civil ports in the world, such as Victoria Harbour, Port of Sanya, Incheon Port, etc.
Data for Track 3 (automatic bridge detection in optical satellite images) are provided by the Gaofen-2 satellite with the resolution ranging from 1–4 m. Each image contains at least one bridge. There are 3000 images with different sizes ranging from 667 × 667 to 1001 × 1001 pixels in the bridge dataset.
Data for Track 4 (semantic segmentation in optical satellite images) are provided by the Gaofen-2 satellite with 0.8 m resolution. Each image is annotated with respect to nine categories of ground objects at the pixel level, including road, building, shrub and tree, lawn, land, water body, vehicle, impervious ground, and others. There are 1800 images with the size ranging from 512 to 5000 pixels.
Data for Track 5 (automatic water-body segmentation in optical satellite images) are provided by the Gaofen-2 satellite with the resolution ranging from 1–4 m, covering rivers and lakes in large scope. There are 2500 images with a size ranging from 492 to 2000 pixels in the water-body dataset.
Data for Track 6 (semantic segmentation in fully polarimetric SAR images) are provided by the Gaofen-3 satellite with 1–3 m resolution, containing four polarization modes (i.e., HH, VV, HV, and VH). Six categories, including water body, building, industrial area, lawn, land, and others, are annotated at the pixel level for each image. There are 1200 images with the size ranging from 512 to 1500 pixels.

The aforementioned datasets are provided for the training set, preliminary test set, and final test set of the 2020 Gaofen Challenge. More information about the distribution of the data provided for the different tracks is shown in Fig. 3 and Table I.

TABLE I Dataset Statistics for Challenge

Fig. 3.

Dataset distribution of 2020 Gaofen Challenge.

Show All

SECTION III.

Organization, Submissions, and Results

Six independent and distinctive tracks were organized in the 2020 Gaofen Challenge. Considering the practical application, three of the six tracks addressed the task of object detection and recognition, and the remaining three addressed the task of semantic segmentation, as described in Sections III-A–III-F. For the tracks on object detection and recognition (Tracks 1–3), the mean Average Precision (mAP) [31] with the Intersection over Union (IoU) of 0.5 is used to evaluate the results. For a given ground truth and the predicted result, TP, FP, and FN are selected according to an IoU threshold of 0.5. Then, the precision and recall are calculated as

$\begin{align*} \text{Precision} = \frac{\text{TP}}{\text{TP}+\text{FP}} \tag{1} \\ \text{Recall} = \frac{\text{TP}}{\text{TP}+\text{FN}}. \tag{2} \end{align*}$ View Source

According to Pascal VOC 2012, the AP of each class is calculated based on precision and recall, and then the mAP can be obtained. For the tracks on semantic segmentation (Tracks 4–6), the frequency weighted IoU (FWIoU) [32] is used as an accuracy evaluation indicator, and its calculation method is as follows:

$\begin{equation*} \text{FWIoU} = \frac{1}{\sum \nolimits _{i=0}^{N}\sum \nolimits _{j=0}^{N}s_{ij}}\sum \nolimits _{i=0}^{N}\frac{\sum \nolimits _{j=0}^{N}s_{ij}s_{ii}}{\sum \nolimits _{j=0}^{N}s_{ij}+\sum \nolimits _{j=0}^{N}s_{ji}-s_{ii}} \tag{3} \end{equation*}$ View Source

where N is the number of categories, and

$s_{ij}$

represents the number of pixels belonging to category i and predicted to be category j.

In addition to the accuracy of image interpretation, the inference time and the quality of technical reports are also taken into account in the final results. The final score is defined as

$\begin{equation*} \text{score} = 70\% \cdot \text{accuracy} + 20\% \cdot \text{speed} + 10\% \cdot \text{report}. \tag{4} \end{equation*}$ View Source

Section III-G shows baseline solutions achieved for the 2020 Gaofen Challenge, whereas the participating and winning methods are analyzed in Sections III-H and III-I, respectively.

A. Track 1: Airplane Detection and Recognition in Optical Images

Track 1 is dedicated to the detection and recognition of airplanes in optical satellite images. For each image in the dataset, there is an XML file with the same name for describing annotation information, such as the image coordinates and object information of airplanes. Each airplane instance in the images is annotated by the corresponding category information and location with an oriented bounding box [33].

B. Track 2: Ship Detection in SAR Images

Track 2 is dedicated to the detection of ships in SAR images, where the goal is to locate the ships in SAR images. In each image, the coordinates of ships are described in a predefined format. Compared with Track 1, each XML file corresponds to one image, including the coordinates of the horizontal bounding box for each ship.

C. Track 3: Automatic Bridge Detection in Optical Satellite Images

The goal for Track 3 is to locate bridges in large-scale optical satellite images. The labeling format is similar to Track 2, and the coordinates of each bridge are given as horizontal bounding boxes.

D. Track 4: Semantic Segmentation in Optical Satellite Images

Track 4 is dedicated to semantic segmentation in optical satellite images. In this case, a pair of images are provided for each scene, as shown in Fig. 2(a). One is the original optical satellite image, and the other is an image annotated with the ground truth whose size is the same as for the previous satellite image. In ground truth images, different categories are marked with different RGB values in pixel level.

E. Track 5: Automatic Water-Body Segmentation in Optical Satellite Images

To detect the water body in remote sensing images, the 2020 Gaofen Challenge set up Track 5 whose purpose is to locate the water body in the optical satellite images with pixel level. Same as Track 4, the original optical satellite images and ground truth images are provided for water-body segmentation.

F. Track 6: Semantic Segmentation in Fully Polarimetric SAR Images

In addition to the track for semantic segmentation in optical satellite images, a semantic segmentation track for SAR images was also set up. Its goal is to classify the features in SAR satellite images with pixel level. As shown in Fig. 2(c), the dataset format is the same as for Tracks 4 and 5.

G. Baseline Solutions

Classic object detection and semantic segmentation networks are used as baseline solutions of each track separately. A two-stage object detection method in the form of a Faster RCNN [10] based on ResNet-50 is used for object detection tracks. The Faster RCNN is a detector with good performance, which generates anchors through an RPN and completes regression and classification after Region-of-Interest (RoI) pooling. For Track 1, an angle information regression is added to realize rotated boxes regression. For semantic segmentation tracks, we use DeepLab V3 [34] based on ResNet-50 as a baseline solution. DeepLab V3 improves the atrous spatial pyramid pooling (ASPP) structure and uses multiple scales to obtain better segmentation results.

H. Participation

There are 701 teams from 253 affiliations, with 2023 competitors joining in the 2020 Gaofen Challenge. The competitors come from more than 20 countries, including China, England, Germany, France, Japan, Australia, Singapore, India, Sweden, etc. The total number of track registrations is 1584 times, of which the tracks for object detection were registered 860 times with 54%, and the tracks for semantic segmentation were registered 724 times with 46%. It can be seen that the popularity of object detection tracks and semantic segmentation tracks is similar, indicating that both of them are widely studied in the field of the automated interpretation of high-resolution earth observation data. In total, there were 5719 submissions for all tracks. The specific numbers of submissions for each track are shown in Fig. 4.

Fig. 4.

Total submission of each track.

Show All

I. Best-Performing Approaches and Discussion

The top six teams of each track were awarded winning places. In this article, we mainly introduce the methods of champion teams. The brief introduction of the champion teams for Tracks 1–6 is as follows.

First place in Track 1: The Detect AI team; Chen Yan, Wenxuan Shi, Tao Qu, Chu He, and Dingwen Wang from Wuhan University, China; with attention mechanism and deformable convolution based on Faster RCNN.
First place in Track 2: The challenger_nriet team; Guo Jie, Zhuang Long, Xie Cong, and Zheng Ping from the Nanjing Research Institute of Electronics Technology, China; with SPPNet and an ensemble of adaptively spatial feature fusion (ASFF) module with Faster RCNN.
First place in Track 3: The MDIPL-lab team; Yuxuan Sun, Wei Li, Wei Wei, and Lei Zhang from Northwestern Polytechnical University, China; with ResNet50 and HRNet-w32 on Faster RCNN.
First place in Track 4: The BUCT Tu Xiang Jie Yi Xiao Fen Dui team; Fei Ma, Jun Ni, Ruirui Li, Yingbing Liu, Feixiang Zhang, and Fan Zhang from the Beijing University of Chemical Technology, China; with an ensemble of ResNet101-V2 on DeepLab V3+ [35]–[37].
First place in Track 5: The Wu Da Ti Shui Gao Fen Dui team; Bo Dang, Jintao Li, Tianyi Gao, and Yansheng Li from Wuhan University, China; with multistructure deep segmentation network [38]–[41].
First place in Track 6: The BUCT Tu Xiang Jie Yi Xiao Fen Dui team; Fei Ma, Jun Ni, Ruirui Li, Yingbing Liu, Feixiang Zhang, and Fan Zhang from the Beijing University of Chemical Technology, China; with an ensemble of conditional random field (CRF) with DeepLab V3+ [42].

Looking at the overall trend, the methods used by the winning teams were all improved and extended on the basis of the well-established models. The methods used by the champion teams of each track are described in detail in Sections IV–IX.

SECTION IV.

First Place in the Airplane Detection and Recognition in Optical Images: Detect AI

In this section, we introduce the winning method proposed for airplane detection and recognition in optical images. Airplane detection is one of the most common detection applications in rotation detection. The similarity of airplanes increases the difficulty of fine-grained detection regarding different types of airplanes. To solve this problem, Detect AI team proposes a rotation detection method based on an attention mechanism. First, they use the attention mechanism to extract the texture features of the aircraft in the feature representation stage for classification and add deformable convolutional network (DCN) to extract the irregular structure features of the aircraft. Finally, Detect AI team used many common techniques in the training process without spending extra time.

Detect AI team first selected R2CNN [43], RRPN [44], RoI transformer [45], S $^{2}$ A-Net [46], and other algorithms in the competition. After basic training and verification of these algorithms, S $^{2}$ A-Net has achieved the highest detection performance, so Detect AI team uses S $^{2}$ A-Net as their detection benchmark.

An S $^{2}$ A-Net-based airplane detection method is proposed and optimized in the feature representation stage and the object regression stage. The overall framework of the method is shown in Fig. 5. The optimization of each part will be introduced below.

Fig. 5.

Network of airplane detection method.

Show All

A. Deformable Convolutional Network

It is challenging to acquire the structural features and information of the airplanes by common convolution because of their irregular shapes. The common convolutional neural network mainly uses regular square grid points to sample the fixed position, which cannot learn the structural characteristics of the airplanes.

To solve the aforementioned problems, this section introduces deformable convolution by adding two-dimensional offset values and pooling operations to achieve the freedom of convolutional kernel and pooling to learn the irregular shape of the airplanes [47]. Specifically, the bias value of the convolutional kernel and pooling layer are obtained through an additional convolutional layer and the feature map with the RoI together, respectively. Since the biased models are all simple layers, the number of parameters and calculations required for this process are relatively small, and end-to-end training can be achieved through the gradient backpropagation algorithm.

B. Orientation-Sensitive Regression

Detect AI team first adopts active rotating filters (ARFs) to learn the orientation information. The ARF filter can rotate several times during convolution to generate orientation features. Using ARF in the deep learning network can obtain orientation-invariant features with encoded orientation information. Object classification tasks benefit from orientation-invariant features, whereas bounding box regression tasks require sensitive features. Then, Detect AI team conducts the pooling layer to the orientation-invariant feature and obtains the orientation-sensitive features for the bounding box (bbox) regression.

C. Experiment

There are 1000 images with ground-truth labels in the training data, and the size of each image is 1024 × 1024 pixels. The data contain ten types of airplane samples. Detect AI team first divides the training set into two parts, 800 images are used for training, and 200 images are used for validation. They randomly rotate the training set and expand the training set to five times. They made an automatic contrast argumentation based on the dataset and applied mixup [48] to the dataset, which greatly expands the training samples. At the same time, they collect airplane images from the public remote sensing dataset as training data for pretraining. The data source is mainly from DOTA [59], UCAS-AOD [49], NWPU VHR-10 [50], and RSOD-Dataset [51]. A total of 7449 images containing airplanes are collected for pretraining, and the model is tested on the competition data.

This article verifies the proposed method on the test set, and compares the performance of the S $^{2}$ A-Net. The object detection results are provided in Table II. The visualization of airplane detection results are shown in Fig. 6.

TABLE II Comparative Experimental Results on Airplane Detection

D. Discussion

Airplane detection and recognition play an important role in both military and civilian fields. Detect AI team analyzes the characteristics of optical airplane remote sensing images, and carry out research on its object characteristics. According to its existing problems and challenges, they improve and optimize the existing detection framework. On the one hand, they use attention and DCN to learn the texture features and irregular shape features of the airplanes. On the other hand, they propose a new orientation-sensitive bbox regression method, with which the bbox of the object is regressed more accurately.

SECTION V.

First Place in the Ship Detection in SAR Images: Challenger_nriet

In this section, we introduce the winning method proposed for ship detection in SAR images. There are a few particular challenges for SAR ship detection, as analyzed as follows.

A large number of small objects. Compared with natural scenes, there are many objects in small size in remote sensing imagery. The SAR images provided by the official website are acquired from the Gaofen-3 satellite with a spatial resolution ranging from 1–5 m. This means that for a 20-m ship, it will be only 4–20 pixels in the provided SAR images.
Rotation invariance. Objects in satellite imagery may have any orientation. For example, a ship can sail at any angle on the sea.
Insufficient training data. Compared with optical images, it is more difficult to obtain SAR images [52]. Therefore, the number of available SAR images is less than that of optical images.
Wide range of aspect ratios. Ships may have a relatively large aspect ratio in satellite images compared with most other objects. Therefore, anchor-based CNN methods have difficult setting anchors covering ships with different aspect ratios [53], [54].

Challenger_nriet team uses a bag of tricks to alleviate these problems, which are described in the following sections.

A. Baseline Model

In the face of these challenges, they adopt YOLOv3 [55] as the baseline model. Ships have a large range of aspect ratios in SAR images compared with general objects in optical images. Thus, the nine anchors in the YOLOv3 model cannot cover scales and aspect ratios of ships in SAR images very well. Therefore, they use guided anchors to adjust the shape of the anchor to fit the desired shape.

As shown in Fig. 7, a spatial pyramid pooling (SPP) layer added in the YOLOv3 model can combine local and global features, making features contain richer information and have stronger representation power.

Furthermore, they use the ASFF [56] model to filter conflictive information to control the inconsistency between different scales outputted by the feature pyramid network (FPN) of YOLOv3. An extra IoU loss function [57] is added to the original smooth L1 loss for more accurate bounding box regression.

The baseline model is trained with the 300 training images downloaded from the official website for 100 epochs. The proposed model is trained using stochastic gradient descent (SGD) [58] algorithms with the cosine learning rate schedule from 0.001 to 0.00001. The values of weight decay and momentum are 0.0005 and 0.9, respectively.

B. Bells and Whistles

In this part, we introduce some bells and whistles to improve the model's ability in their method.

1) Data Augmentation

They add SAR-Ship-Dataset [59] to train their model. The image size of SAR-Ship-Dataset is 256 × 256 pixels. Therefore, they randomly select 2 × 2/3 × 3/4 × 4 images and stitch them together and rescale the stitched images to a size of 1000 × 1000 pixels. The Fig. 8 shows the results of data augmentation. They also involve mirroring, cropping, distorting, and random-affine transformations for data augmentation. Moreover, Challenger_nriet team adds the HRSID dataset [60] to the training set to train the model.

2) Finer-Grained Features and Denser Grid

Many ships in the SAR images are relatively small compared with objects in natural scenes. As a result, they remove stage 5, which has a stride of 32 in the YOLOv3 backbone network. Instead, they output stage 2, stage 3, and stage 4 to detect ships in different scales. To keep the depth of output features consistent, they add more convolutional layers with shortcut connections in stage 2. Finally, they get finer-grained features while still keeping enough semantic information.

3) Multiscale Training

As shown in Fig. 9, challenger_nriet team adopts multiscale training with the random crop. First, they randomly crop image patches from images in the dataset, and the scale of cropped patches is randomly sampled from 384, 416, 448, 480, 512, 544, 576, 608, and 640, then the cropped patches are rescaled to a fixed size of 512 × 512 pixels for training. They only keep those patches with ships.

4) Scale-Aware Loss Function

To focus more on small ships, Challenger_nriet team set different weights on the loss function according to the size of ships. The weight of L1 loss and IoU loss of small ships are larger than for large ships. The weights are calculated as

$\begin{equation*} \text{weight} = {\begin{cases}1, & \text{if } \frac{(w \cdot h)}{(W \cdot H)}>0.01 \\ 3-\frac{(200 \cdot w \cdot h)}{(W \cdot H)}, & \text{otherwise} \end{cases}} \tag{5} \end{equation*}$ View Source

where w and h represent the width and height of the ship, respectively. W and H represent the width and height of the image, respectively.

5) Multiple Weights Fusion

Challenger_nriet team trains the model for 100 epochs, then averages the weights from epochs of 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, and 100 to achieve the final network weights for more robust testing.

6) Deeper Network and More Training Epochs

They add more convolutional layers in stage 2 and train the models for more epochs (200 epochs). This strategy can keep finer-grained features while still catching enough semantic information.

7) Multiscale Testing

They adopt three scales for testing, 800 × 800, 1056 × 1056, and 1248 × 1248 pixels. They first obtain the network outputs of each scale. Then, they concatenate them together and perform nonmaximum suppression (NMS) to get the final detection results. Besides, they change the NMS threshold from 0.65 to 0.55.

C. Results and Discussion

Challenger_nriet team reports the detection results on the preliminary test set downloaded from the official website of the contest. Details are demonstrated in Table III. The detection results of ships are shown in Fig. 10.

TABLE III Results Achieved on the Test Set Dataset

The model achieves 59.95% mAP for the test dataset provided in phase 3 of the contest. Finally, to ensure wider testing scales, they instead adopt three scales of 736 × 736, 1056 × 1056, and 1344 × 1344 pixels for testing. And finally get 60.58% mAP for the test dataset provided in phase 3 of the contest.

SECTION VI.

First Place in the Automatic Bridge Detection in Optical Satellite Images: MDIPL-lab

Bridge detection aims at automatically detecting and locating bridges in remote sensing images. As a branch of the object detection task, many detection methods for natural scenes can be also used for bridge detection. For example, Faster-RCNN [61], a representative two-stage detector, can have stable performance under different tasks. Therefore, the proposed method is modified based on Faster-RCNN. To solve the problem of the small dataset and single scene, ResNet50+DCN and HRNet-W32 are adopted as the backbone network in this method. In view of the characteristics of remote sensing images with large variation in object orientation and complicated illumination conditions, we adopt horizontal flip and random 90 $^\circ$ rotation in data argumentation. In addition, FPN and multiscale training are adopted to deal with the variance of object sizes.

A. Model Structure

Faster RCNN+FPN and the random horizontal flip with probability P = 0.5 are used as the benchmark methods. Considering that some of the bridges are located on the diagonal of the target box, and most of these target boxes are rivers, we used DCNv2 to extract features more effectively and focus on effective information.

It was observed that the captured scenes in the bridge dataset are relatively monotonous while the size of the bridges varied greatly and there are many small bridges. Therefore, we chose the backbone network HRNet-W32, which is more advantageous in integrating multiscale features compared with ResNet50 and ResNet101. At the same time, due to the relatively monotonous scene, deeper networks, such as ResNet101, are not significantly improved over ResNet50.

Integration of multiple different models has proven to be a relatively effective way to improve accuracy. For the experiments based on test dataset, they tried NMS, SoftNMS [62], VOTE, and NMS using IOF instead of IOU, and finally chose SoftNMS as the integration method.

B. Data and Training Strategy

Reasonable data argumentation can artificially control the prior rules of scene distribution and increase the amount of data, which is another strategy to improve the performance of the model. To simulate the change of camera rotation angle during data acquisition and enhance the diversity of data, we added random rotations of 0 $^\circ$ , 90 $^\circ$ , 180 $^\circ$ , and 270 $^\circ$ to the random horizontal flip in the method.

A multiscale method is introduced in the training and testing process to solve the large object size difference problem. At the same time, considering that the images in the dataset have two resolution sizes of 1001 × 1001 and 668 × 668 pixels, the size of image is randomly scaled between 600 and 1200 pixels during the training.

C. Experiment

All experiments are conducted on the object detection framework MMDetection [63]. There are a total of 2000 images in the dataset. Since some consecutive images are taken with the same scene, the first 667 images are selected as the validation set to avoid data duplication. Twelve epochs are trained in each experiment. The initial learning rate is 0.00125, the batch size is usually 4 or 8, and the learning rate decreased by 1/10 in the 8th and 11th epoch. SGD with a momentum of 0.9 and weight attenuation of 0.0001 is used as the optimizer. The probability of both a horizontal flip and a subsequent rotation of 90 $^\circ$ is 0.5.

On the baseline model, the results using different methods and strategies are shown in Table IV. The team adds the DCNv2, multiscale training, and data enhancement strategy. The detection results of bridges are shown in Fig. 11.

TABLE IV Results of Different Methods or Strategies With Baseline

The different models and their integration effects are shown in Table V. HRW32 represents HRNet-W32, and the Aug represents the argumentation strategy. Finally, the integration of the two models using SoftNMS yielded slightly better results than either vote or NMS.

TABLE V Results of Different Models

D. Discussion

This method is aimed at the automatic bridge detection in remote sensing imagery. On the basis of Faster-RCNN, it adjusts the backbone network selection and data argumentation strategy according to the characteristics of single scenes in the dataset, and finally selects two models to integration and obtain 83.6% AP.

SECTION VII.

First Place in the Semantic Segmentation in Optical Images: BUCT

In this section, we introduce the winning method proposed for the semantic segmentation in optical images. The method proposed by the team is a deep semantic segmentation network combined with multiscale spatial features. The purpose is to obtain features of different scales and use the regional features of superpixels to combine global information to improve the performance of segmentation. This method first uses ResNet101-V2 as the backbone network of Deeplab V3+ [37] to extract image features and then uses two subnetworks of “pixel-level semantic segmentation” and “superpixel-level semantic segmentation based on boundary feature enhancement” for semantic segmentation. The framework is shown in Fig. 12.

Fig. 6.

Airplane detection results of the proposed method.

Show All

Fig. 7.

YOLOv3 architecture with SPP module and ASFF module.

Show All

Fig. 8.

Randomly selecting 2 × 2/3 × 3/4 × 4 images and stitching them together.

Show All

Fig. 9.

Adopting multiscale training with random crop.

Show All

Fig. 10.

Detection results of ships in SAR images.

Show All

Fig. 11.

Visualization of the bridge detection results.

Show All

Fig. 12.

Framework of the proposed semantic segmentation method.

Show All

A. Data Preprocessing

Due to the complexity of the categories of the objects in the dataset, the BUCT team augments the existing data. BUCT team used the following methods to augment the data.

Overlap cropping: The training images are cropped into a fixed size with overlap.
Spatial transformation: It includes horizontal and vertical flipping, random rotation at any angle with the image center as the origin, scaling outward or inward with a certain proportion, random cropping, and shifting in the X or Y direction (or both).
Random noise addition: Gaussian noise is randomly added to the data to prevent the CNN from learning useless high-frequency features, thereby reducing the probability of overfitting.

B. Deep Semantic Segmentation Network Combined With Multiscale Spatial Features

1) Feature Extraction Based on ResNet101

Using the activation function on the residual branch, the information propagation speed of ResNet101 will be faster in the back propagation and forward propagation. It allows the network to get better results and avoids the problem of vanishing gradients.

2) Pixel-Level Semantic Segmentation Based on Deeplab V3+

The pixel-level feature classification subnetwork uses the DeepLab V3+ network to segment objects at the pixel level. The feature maps embedded in the first four convolutional blocks of ResNet101 are sent to the ASPP module to represent different local and global information proportions. Then, the feature extraction result and the low-resolution information in the encoder are cascaded up-sampling, and finally the pixel loss is obtained. DeepLab is a method that combines deep convolutional neural networks (DCNNs) and probabilistic graphical models (Dense CRFs). DCNNs use atrous convolution to expand the receptive field to solve resolution reduction caused by down-sampling or pooling in DCNNs. Dense CRFs can consider the mutual influence between adjacent pixels.

3) Superpixel-Level Semantic Segmentation Branch Based on Edge Feature Enhancement

The segmentation results of DeepLab V3+ at the edge are not very good, and there is strong segmentation noise and fuzzy edge. As a result, the BUCT team uses a high-precision end-to-end superpixel generation method. This method can be implemented using a deep convolutional network, which is trained together with the semantic segmentation network, so the segmentation accuracy is greatly improved. Then, with the help of the pixel and each superpixel correlation matrix and ground truth, the superpixel-level loss function is calculated. The goal of the loss function is to ensure that the labels of pixels belonging to the same superpixel are as consistent as possible. In addition, when calculating the loss function, the BUCT team uses the ground truth to give more weight to the pixels near the edge of the objects.

At the superpixel generation stage, traditional simple linear iterative clustering method has nondifferentiable step. Therefore, this method cannot be introduced into convolutional neural networks. As a result, the BUCT team proposes a differentiable linear iterative clustering method. This method models the association between pixels and superpixels $Q\in R^{n\times m}$ . For the pixel $p$ and superpixel $i$ in the $t\text{th}$ step, the association is denoted as

$\begin{equation*} Q_{pi}^{t}=e^{-D(I_{p},S_{i}^{t-1})}=e^{-\Vert I_{p}-S_{i}^{t-1}\Vert ^2} \tag{6} \end{equation*}$ View Source

where

$n$

and

$m$

denote the number of pixels and superpixels, respectively.

$I_{p}$

and

$S_{i}$

are the features of pixels and superpixels, respectively. The cluster center of the superpixels is defined as the weighted sum of pixel features

$\begin{equation*} S_{i}^{t}=\frac{1}{Z_{i}^{t}} \sum _{p=1}^{n} Q_{p i}^{t} I_{p}. \tag{7} \end{equation*}$

View Source

$Z_{i}^{t}=\sum _{p} Q_{p i}^{t}$ is the normalized constant. Considering the calculation to obtain $Q_{pi}$ , m is set to be 9 in the training stage. Since the superpixel is an oversegmentation of the image, the segmentation label of the image can be used as the supervision information of the superpixel segmentation. Associated matrix $Q_{(p, s p)}^{t}$ represents the relationship between pixels and superpixels. The annotation results of pixels can be mapped to the superpixels by applying the column normalization to $Q_{(p, s p)}^{t}$ . Similarly, the annotation results of superpixels can be mapped to the pixels by applying the row normalization to $Q_{(p, s p)}^{t}$ . If the annotation of pixels is defined as G, that of superpixels can be denoted as

$\begin{equation*} G^{*}=Q_{\text{row }} Q_{\text{col }}^{\top } G. \tag{8} \end{equation*}$ View Source

C. Loss Function

In the training process, the pixel-level segmentation branch outputs the predicted pixel label matrix $P \in {{\mathcal R}^{n \times 1}}$ . The superpixel segmentation branch outputs the pixel-superpixel correlation matrix $Q \in {{\mathcal R}^{n \times m}}$ , where $n$ and $m$ are the number of pixels and superpixels, respectively, and the true label matrix is denoted as $G \in {{\mathcal R}^{n \times 1}}$ .

1) Pixel-Level Loss Function

The pixel-level loss can be described as the cross-entropy loss between the predicted label and the ground truth, which is defined as

$\begin{equation*} {{\mathcal L}_{\text{pixel}}} = {\mathcal L}(G,P). \tag{9} \end{equation*}$ View Source

2) Superpixel-Level Loss Function

We calculate the area loss between the superpixel reconstruction result and the ground truth, which can be defined as

$\begin{equation*} \begin{aligned} {{\mathcal L}_{\text{region}}} &= {W_{\text{over}}}{\mathcal L}(G,{G^*}) \\ & = {W_{\text{over}}}{\mathcal L}(G,{Q_{\text{row}}}Q_{\text{col}}^ \top G). \end{aligned} \tag{10} \end{equation*}$ View Source

Among them,

${W_{\text{over}}}$

represents the excessive subdivision matrix. For the pixels on the edge of the ground truth, let

$\omega {\rm { = 1 + }}{\gamma _i}$

, otherwise

$\omega {\rm { = 1}}$

. The overall loss function can be denoted as the sum of pixel loss and area loss

$\begin{equation*} \begin{aligned} {\mathcal L} &= {{\mathcal L}_{\text{region}}} + {{\mathcal L}_{\text{pixel}}} \\ & = {W_{\text{over}}}{\mathcal L}(G,{Q_{\text{row}}}Q_{\text{col}}^ \top G){{\kern1.0pt}} + {{\kern1.0pt}} {{\kern1.0pt}} {\mathcal L}(G,P). \end{aligned} \tag{11} \end{equation*}$

View Source

D. Implementation Details

For the DeepLab V3+, the number of convolutional kernels in each convolutional layer is set to 256, and the stride of the atrous convolutional is set to [6, 12, 18]. The initial learning rate and batch size are set to 0.007 and 5, respectively.

E. Results and Discussion

The experiment uses multiple models to analyze the performance, including U-Net [23], D-LinkNet [64], DeepLab-V3, DeepLab-V3+, and various forms of DeepLab-V3+. At the same time, it is analyzed whether to use data expansion and augmentation. The models use ResNet50 and ResNet101 to be baseline models to conduct experiments. All experiments used 400 images as the validation set and other images as the training set. The experimental results are shown in Table VI.

TABLE VI Comparison of Semantic Segmentation Results in Optical Images

As a single network, DeepLab-V3+ has better feature extraction capabilities than U-Net, D-LinkNet, and DeepLab-V3. As a result, the obtained segmentation accuracy using DeepLab-V3+ is the highest. As shown in Table VI, for each model, the segmentation accuracy after the data augmentation has been improved. For the backbone network, the segmentation accuracy of the models using ResNet101 network is higher than that of the ResNet50 network. In addition, the accuracy of multinetwork segmentation model is higher than the single-network segmentation model, but it has longer inference time [65].

Based on the comprehensive results of inference time and segmentation accuracy, using DeepLab-V3+ network and ResNet101 backbone network can achieve the best performance of semantic segmentation.

SECTION VIII.

First Place in the Automatic Water-Body Segmentation in Optical Images: WHU

In this section, we introduce the winning method designed for the automatic water-body segmentation in optical images. The WHU team proposes water-body extraction method based on spatial consistency boundary optimization and rotation consistency constraint in multistructure segmentation network. The method integrates three network architectures with different characteristics, including large receptive field, high-resolution representation, and reduction of information loss caused by pooling. Thus, the noise and missing points caused by accidental errors can be reduced. The fully connected CRF is used for postprocessing of the predicted results. Then, weighted fusion of the postprocessing results and the original network prediction results are performed. In the testing stage, the original image and the image after rotation of 90 $^\circ$ , 180 $^\circ$ , and 270 $^\circ$ are comprehensively predicted with multiangle rotation consistency. Weighted fusion of the results and CRF postprocessing results are performed to obtain the multichannel water-body prediction results, and then the multichannel prediction results are voted. Finally, the automatic extraction results of optical images are obtained.

A. Multistructure Deep Segmentation Network

This method trains three deep segmentation networks with different characteristics, as shown in Fig. 13. The training set is processed using data enhancement methods, such as rotation and stretching. The focal loss function [16] is used for water-body segmentation.

Fig. 13.

Structure of multistructure deep segmentation network optimization.

Show All

1) Context Encoder Network (CE-Net)

The CE-Net is first used in 2-D medical image segmentation [66]. A context extraction module is added into the traditional encoder–decoder structure to capture higher level features and obtain the spatial information for semantic segmentation, thus reducing the loss of information caused by pooling and convolution. Its network structure mainly includes a feature encoder module, a context information extraction module, and a decoder module. Among them, the ResNet-152 is used as the fixed feature extractor. The context information extractor module consists of a dense atrous convolutional (DAC) module and a residual multikernel pooling (RMP) module, whereas the decoder uses convolutional layers and transposed convolutional layers. At the same time, the weight of pretraining on the ImageNet dataset is used to accelerate the network convergence. The DAC module aims to enlarge the receptive field, and the parallel structure reduces the conflict between the segmentation and image details. The RMP module uses different scales of the pooling kernel to segment water body of various sizes.

2) Deep Segmentation Network With Dense Convolutional Pooling (CEWI-Net)

Inspired by the CE-Net and Inception V1 [67], the WHU team proposes a deep segmentation network based on dense convolutional pooling (CEWI-Net), which adds a dense convolutional pooling block (DCP Block) to the encoder–decoder structure. This module is composed of convolutional layers with three convolutional kernel scales (1 × 1, 3 × 3, 5 × 5) and a maximum pooling layer. Each layer in the module can learn the characteristics of “sparse” and “not sparse,” which has the advantage of multiscale. At the same time, they use 1 × 1 convolutional layer to reduce the dimension of channels so as to reduce the number of network parameters and accelerate the convergence while ensuring accuracy.

3) Deep Segmentation Network of Multiscale Object Context (HR-Net)

In general, existing methods encode the input image as a low-resolution representation by a module and then recovering the high-resolution representation. Instead, HR-Net [68] takes a high-resolution subnet and adds four stages from high-resolution to low-resolution subnet one by one. Four kinds of resolution subnets are connected in parallel. The information in the parallel multiresolution subnet is exchanged in the whole network to complete the repeated multiscale fusion. Finally, bilinear up-sampling of the low-resolution output in the network is carried out to obtain the high-resolution output.

Given the complex types of water body and the relationship between ground objects in high-resolution remote sensing images, we introduce the object context representation (OCR) based on high-resolution representation [69]. It is difficult to segment water body according to a single-pixel point. OCR can effectively extract context information. OCR combines the category information of water body and nonwater body to weigh each pixel and connects with the original feature to obtain the feature representation of each pixel.

4) Optimization Loss Function

Water bodies in remote sensing images mainly include rivers, lakes, and ponds with different scales and shapes, which bring different difficulties to the deep semantic segmentation network. Focal loss [70] is used as the loss function of network optimization to address the problem of an unbalanced number of difficult and easy samples in the training images. The calculation method is as follows:

$\begin{align*} L_{{\rm {Focal}}} =& \frac{1}{N}\sum \limits _{i = 1}^N - \alpha y_i^\prime {\left({1 - {y_i}} \right)^\gamma }\log \left({{y_i}} \right)\\ & - (1 - \alpha) \left({1 - y_i^\prime } \right)y_i^\gamma \log \left({1 - {y_i}} \right) \tag{12} \end{align*}$ View Source

where

$y_i^{\prime}$

is the ground-truth and

$y_i$

denotes the predicted result. Focal loss uses two parameters

$\alpha$

and

$\gamma$

to make the network pay more attention to difficult images. To prevent the loss of simple samples from being too small, both of them adjust together to achieve balance.

B. Spatial Consistency Boundary Optimization and Rotation Consistency Constraints Based on Multistructure Segmentation Network

The testing phase includes the comprehensive prediction of rotation consistency from multiple angles, spatial consistency boundary optimization, and the voting of the multistructure segmentation network, as shown in Fig. 14.

Fig. 14.

Structure diagram of multistructure deep segmentation network of water-body extraction based on spatial consistency boundary optimization and rotation consistency constraints.

Show All

1) Comprehensive Prediction of Rotation Consistency From Multiple Angles

In remote sensing images, water body is characterized by diverse types, various scales, and complex spatial relations, which restricts the consistency of regional prediction and the integrity of extraction results. The method is to improve the accuracy of different water-body extraction results and reduce misclassification and hole phenomena by synthesizing the prediction results of the original image and the image rotated from three angles.

The concrete method structure is shown in Fig. 15. First, the original image and the image rotated by 90 $^\circ$ , 180 $^\circ$ , and 270 $^\circ$ are sent into the segmentation network in turn for prediction, and the probability matrix for prediction of ${P_0}$ , ${P_{90}}$ , ${P_{180}}$ , and ${P_{270}}$ is obtained. It is then rotated to correspond to the pixels of the original image. The prediction probability matrix of water body is then obtained by averaging the prediction probability values of four water bodies. The calculation formula is as follows:

$\begin{equation*} {P_D} = ({P_0} + {P_{90}} + {P_{180}} + {P_{270}})/4. \tag{13} \end{equation*}$ View Source

Fig. 15.

Comprehensive prediction of consistent rotation (red in the prediction maps indicates the water, blue indicates the background, and colors between red and blue represent the confidence score.)

Show All

Here, ${P_D}$ is the water-body prediction probability matrix, which is activated by the sigmoid function. $i$ and $j$ are the rows and column numbers of pixel points, respectively, and the final probability matrix of the water body $P$ is the weighted fusion of ${P_D}$ and ${P_{CRF}}$ , and the formula is as follows:

$\begin{align*} {P_{{D_{ij}}}} =& \delta ({w_{ij}}) \tag{14} \\ {P_{ij}}=&\beta \cdot {P_{{D_{ij}}}}{\rm { + (1 - }}\beta {\rm {)}} \cdot {P_{CR{F_{ij}}}} \tag{15} \end{align*}$ View Source

where

$\delta (\cdot)$

is the sigmoid activation function, and

$\beta$

is an adjustable weight parameter.

2) Spatially Consistent Boundary Optimization

It uses the fully connected CRF to postprocess the segmentation results of the network, and the weighted fusion of the processed results and the original network prediction results is carried out to recover the boundary details of the predicted results. The structure of the algorithm is shown in Fig. 16.

Fig. 16.

Dense CRF weighted fusion model.

Show All

3) Multistructure Deep Segmentation Network Voting

It integrates three network architectures with different characteristics, and the prediction results are voted pixel by pixel to obtain the final water-body automatic extraction results. It uses CE-Net to reduce pooling information loss, CEWI-Net with multiscale characteristics, and HR-Net with high-resolution representation and spatial context relationships.

C. Implementation Details

In the experiment, the mean and standard deviation of optical images are used to normalize the images. The sigmoid activation function limits the output value within the range of [0, 1], indicating the probability of water-body prediction. The team selects the Adam [71] to be optimization method, and sets learning rate and batch size are 0.0001 and 4, respectively.

D. Results and Discussion

To validate the performance of this method, the WHU team compared their method with (1) U-Net [23]; (2) CE-Net; (3) CEWI-Net; (4) HR-Net; (5) the network only using multistructure voting mechanism without spatial consistency boundary optimization; (6) the network without the comprehensive prediction of rotation consistency from multiple angles.

Fig. 17 is an example of the automatic extraction results of water body on the test set of the Gaofen-2 high-resolution optical images. From the results, a single segmentation network will frequently have the missed and misclassified situations. The use of spatial consistency, the fully connected CRF weighted fusion, will optimize the predicted image boundary. As shown in Table VII, the multistructure deep segmentation network can integrate the characteristics of the three networks to improve the extraction accuracy of different types of water body. Comprehensive prediction of rotation consistency can synthesize diverse spatial information from multiple angles, thereby improving the reliability of water-body prediction.

TABLE VII Accuracy Comparison of Seven Methods on Water-Body Dataset

Fig. 17.

Visualization of water-body extraction results.

Show All

SECTION IX.

First Place in the Semantic Segmentation in Fully Polarimetric SAR: BUCT

In this section, we introduce the winning method proposed for the semantic segmentation in fully polarimetric SAR. The method proposed by the team consists of a set of fully polarized SAR image preprocessing methods and a multiscale deep network collaboration with superpixel constraints. This method uses Deeplab V3+ for pixel-level classification and simultaneously extracts local gradient ratio patterns (LGRPs) from the original fully polarimetric SAR image, then performs weighted K-means [72] clustering to generate superpixels. Under the constraints of superpixels, the classification loss function is further optimized to improve the segmentation performance. The framework of the method is shown in Fig. 18.

Fig. 18.

Framework of the proposed method for SAR image semantic segmentation.

Show All

A. Data Preprocessing

The dataset used in this method is divided into two parts, one is the Gaofen-3 fully polarimetric SAR training dataset provided by the organizers, and the other part is the fully polarimetric SAR data collected by team. BUCT team has augmented the existing data, including the following.

Overlap cropping: They crop the original image to a fixed size with overlap.
Spatial transformation: It includes horizontal and vertical flipping, random rotation of the image at any angle with the center as the origin, scaling of the image at a certain ratio, random cropping, and shifting.
Adding noise: Gamma noise fitting and noise addition of different visual numbers is performed on the image, thereby enriching the training samples.
Polarization simulation: They perform polarization simulation for the specific objects, and then obtain the HH, HV, and VH channels of the simulation data.

B. Pixel-Level Semantic Segmentation: DeepLab V3+

BUCT team uses DeepLab V3+ for semantic segmentation of the fully polarimetric SAR image. In the DeepLab V3+ network, feature extraction is performed on the input image through the backbone network to obtain low-level feature and high-level feature. In the encoding stage, the advanced features go through the FPN, including a 1 × 1 convolution, three atrous convolutional layers with different atrous rates (6, 12, 18), a global average pooling, and an up-sampling layer. Then, the outputs of the five layers are cascaded, and the number of channels is changed through 1 × 1 convolution. In the decoding stage, the low-level features are dimensionally adjusted by 1 × 1 convolution (output stride = 4), and the encoder output is up-sampled 4 times (output stride changes from 16 to 4). Then, we concatenate the features and perform 3 × 3 convolution, then up-sample 4 times to get dense prediction. All up-sampling layers in the decoder use bilinear interpolation.

C. Superpixel Segmentation Technology for Fully Polarimetric SAR Image

Due to geometric distortion and speckle noise in fully polarized SAR images, it is difficult to adopt a effective method to generate superpixels with high boundary fitting, compactness, and low computational cost. This method adopts an superpixel generation algorithm with linear feature clustering and edge constraint for SAR images [35]. There are three stages. First, BUCT team extracts the LGRP of each pixel. This feature has strong robustness to coherent speckle noise. LGRP characteristics can be defined as

$\begin{equation*} {\rm {LGR}}{{\rm {P}}_{P,R}}\left({{g_c}} \right) = \sum \limits _{P - 1}^{p = 0} {s\left({{G_{{\rm {ratio}}}}\left({{g_p}} \right) - \overline{{G_{{\rm {ratio}}}}\left({{g_c}} \right)} } \right){2^p}} \tag{16} \end{equation*}$ View Source

where

${G_{{\rm {ratio}}}}({{g_p}})$

and

${G_{{\rm {ratio}}}}({{g_c}})$

are the gradient ratio characteristics of neighboring pixels and center pixels, respectively. Second, for the edge detector, the traditional rectangular edge detector uses a series of windows with various directions to calculate the edge strength map (ESM). The windows are divided into a pair of parallel subwindows. BUCT team uses the ratio-feature-based edge detector of Gaussian windows instead of the traditional rectangular windows. The horizontal Gaussian window is defined as

$\begin{equation*} {\rm {GW}}(x,y) = \frac{1}{{\sqrt{2\pi } {\sigma _x}\sqrt{2\pi } {\sigma _y}}}\exp \left({ - \left({\frac{{{x^2}}}{{2\sigma _x^2}} + \frac{{{y^2}}}{{2\sigma _y^2}}} \right)} \right). \tag{17} \end{equation*}$

View Source

In addition to the ESM, the Gaussian window can obtain edge direction map and edge map of the SAR image. Finally, an improved superpixel generation strategy based on normalized cuts (Ncuts) is adopted, which uses distance metrics and also considers spatial proximity and feature similarity. In this strategy, the BUCT team approximates the similarity using a positive semidefinite kernel function instead of traditional feature-based algorithms. The best point can be obtained by weighted K-means and Ncuts function, thereby effectively reducing the computational cost. The weighted local K-means clustering function is denoted as

$\begin{equation*} {\Phi _{{\rm {K}} - {\rm {means}}}} = \sum \limits _{k = 1}^K {\sum \limits _{u \in \omega (k)} w } (u){\left\Vert {\Psi (u) - {m_k}} \right\Vert ^2}. \tag{18} \end{equation*}$ View Source

The Ncuts function is defined as

$\begin{equation*} {\Phi _{{\rm {Ncuts}}}} = \frac{1}{K}\sum \nolimits _{k = 1}^K {\frac{{\sum \nolimits _{u \in \omega (k)} {\sum \nolimits _{v \in \omega (k)} W } (u,v)}}{{\sum \nolimits _{u \in \omega (k)} {\sum \nolimits _{v \in V} W } (u,v)}}}. \tag{19} \end{equation*}$

View Source

Among them, for each pixel

$p$

$\Psi (u)$

is an eight-dimensional feature vector composed of LGRP features. Given two pixels

$u = ({{l_{4u}},{l_{8u}},{x_u},{y_u}})$

and

$v = ({{l_{4v}},{l_{8v}},{x_v},{y_v}})$

, the similarity measure between them is denoted as

$\begin{equation*} \hat{W}(u,v) = {\hat{W}_f}(u,v) + {\beta _{{\rm {adp}}}} \cdot {\hat{W}_s}(u,v). \tag{20} \end{equation*}$

View Source

The variation coefficient is used to learn the tradeoff factor between spatial proximity and feature similarity during linear feature clustering, which helps to adaptively adjust the shape and scale of superpixels according to image uniformity. The coefficient of variation is calculated as follows:

$\begin{equation*} {\beta _{{\rm {adp}}}} = 1 - \frac{1}{2}\left[ {{\rm {CoV}}\left({{x_u},{y_u}} \right) + {\mathop {\rm CoV}\nolimits } \left({{x_v},{y_v}} \right)} \right]. \tag{21} \end{equation*}$ View Source

The superpixel generation method used in this method has some characteristics, which are as follows.

The structure of the image can be maintained well because of edge information and Ncuts strategy.
The method is not sensitive to the coherent speckle noise.
The method has higher computational efficiency.
The shape and compactness of super pixels can be adaptively changed according to the complexity of the image.

D. Results and Discussion

In this competition, the BUCT team uses U-Net, D-LinkNet, and DeepLab V3+ for pixel-level semantic segmentation and use a CRF as postprocessing after U-Net and D-LinkNet network, defined as U-Net+CRF and D-LinkNet+CRF, respectively.

In terms of data augmentation, the BUCT team has performed methods, such as rotation, cropping, and stitching on the original image, enhancing the sensitivity of the model to image edges. The specific transformation is shown in Fig. 19. Fig. 19(a) is the image A-10 in the training dataset, and Fig. 19(b) is the corresponding colored label map. Fig. 19(c)–(g) shows the images obtained by rotating A-10 at different angles; Fig. 19(h)–(l) shows the images formed after cropping and then stitching. Performing the same operation on all the original images can get the augmented dataset. Fig. 20 shows a grayscale image of the four polarization channels of image A-10. Adding these images to the training can enhance the model's sensitivity to edges and improve the overall accuracy. However, through training, it is found that the model performs not well enough on the edges of rivers and small objects.

Fig. 19.

Image enhancement results of proposed method. (a) ground-truth image. (b) label image. (c)-(g) the images obtained by rotating the ground-truth image at different angles. (h)-(l) the images obtained by cutting and stitching the ground-truth image.

Show All

Fig. 20.

Gray image of polarization modes. (a) HH. (b) HV. (c) VH. (d) VV.

Show All

The results of different methods are shown in Table VIII. From the table, for a single network, DeepLab V3+ has good performance on feature extraction. BUCT team attempted to use a CRF as postprocessing, but the accuracy has not improved because the CRF overlooked some small objects. It is obvious that the performance of the model after data augmentation has improved, reflecting the importance of the amount of data. To get higher accuracy, the BUCT team try to parallelize the dual networks in DeepLab V3+. However, the accuracy is still slightly lower than using DeepLab V3+, and the inference time is also longer.

TABLE VIII Comparison of Experimental Results in Fully Polarimetric SAR Images

SECTION X.

Conclusion

The development of earth observation programs and accessible high-resolution data can provide abundant information about the earth and promote various applications. Due to the insufficient amount of annotated data and the complex background, it is of great challenge to apply the automated interpretation for such data. Therefore, it is significant that highly advanced techniques need to be proposed.

To enhance the academic development in this field, the 2020 Gaofen Challenge focuses on the automated high-resolution earth observation image interpretation for optical and SAR images. More than 10 000 images from Gaofen-2 and Gaofen-3 satellites are annotated for this challenge. Complex background, various scales, and fine-grained types make the 2020 Gaofen Challenge more difficult.

The 2020 Gaofen Challenge is arranged in six tracks according to different application requirements. Tracks 1–3 aim to promote the development of object detection and recognition in optical and SAR images. Tracks 4–6 focus on semantic segmentation in optical and SAR images.

The 2020 Gaofen Challenge has attracted 701 teams from 253 affiliations with 2023 competitors to participate in. The competitors come from more than 20 countries, including China, England, Germany, France, Japan, Australia, Singapore, India, Sweden, etc. All winners use deep-learning-based methods for image interpretation.

Although many excellent algorithms have emerged in the challenge, the exploration of earth observation technology cannot be stopped. After the challenge, the datasets are still accessible for further research.

In the future, we will also continue to promote this event and hope it can help the earth observation community to develop deep-learning-based methods. We will dedicate to improve the professional level of the Gaofen Challenge. For the data, we will continue to build larger scale high-resolution multisource datasets and enhance the quality of annotations. After the challenge, we will provide a repository to share datasets and codes for competitors. For the tracks in the challenge, we will set more tracks that are combined with practical applications in the field of remote sensing. For the competitors, we will encourage more foreign scholars to participate in the competition to make it more international. Moreover, we will improve the evaluation system to obtain more authoritative and fair results. With the improvement of Gaofen Challenge, we hope that more and more scholars from all over the world will participate in the challenge.

ACKNOWLEDGMENT

The authors would like to thank the IEEE Geoscience and Remote Sensing Society for the support, especially to Prof. P. Gamba, Prof. Jun Li, and the Image Analysis and Data Fusion Technical Committee for their valuable comments. They would also like to thank the International Society for Photogrammetry and Remote Sensing, especially to Prof. C. Toth and Prof. S. Hinz for their great support.

References is not available for this document.

Automated High-Resolution Earth Observation Image Interpretation: Outcome of the 2020 Gaofen Challenge

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction

Data of the 2020 Gaofen Challenge

Organization, Submissions, and Results

A. Track 1: Airplane Detection and Recognition in Optical Images

B. Track 2: Ship Detection in SAR Images

C. Track 3: Automatic Bridge Detection in Optical Satellite Images

D. Track 4: Semantic Segmentation in Optical Satellite Images

E. Track 5: Automatic Water-Body Segmentation in Optical Satellite Images

F. Track 6: Semantic Segmentation in Fully Polarimetric SAR Images

G. Baseline Solutions

H. Participation

I. Best-Performing Approaches and Discussion

First Place in the Airplane Detection and Recognition in Optical Images: Detect AI

A. Deformable Convolutional Network

B. Orientation-Sensitive Regression

C. Experiment

D. Discussion

First Place in the Ship Detection in SAR Images: Challenger_nriet

A. Baseline Model

B. Bells and Whistles

1) Data Augmentation

2) Finer-Grained Features and Denser Grid

3) Multiscale Training

4) Scale-Aware Loss Function

5) Multiple Weights Fusion

6) Deeper Network and More Training Epochs

7) Multiscale Testing

C. Results and Discussion

First Place in the Automatic Bridge Detection in Optical Satellite Images: MDIPL-lab

A. Model Structure

B. Data and Training Strategy

C. Experiment

D. Discussion

First Place in the Semantic Segmentation in Optical Images: BUCT

A. Data Preprocessing

B. Deep Semantic Segmentation Network Combined With Multiscale Spatial Features

1) Feature Extraction Based on ResNet101

2) Pixel-Level Semantic Segmentation Based on Deeplab V3+

3) Superpixel-Level Semantic Segmentation Branch Based on Edge Feature Enhancement

C. Loss Function

1) Pixel-Level Loss Function

2) Superpixel-Level Loss Function

D. Implementation Details

E. Results and Discussion

First Place in the Automatic Water-Body Segmentation in Optical Images: WHU

A. Multistructure Deep Segmentation Network

1) Context Encoder Network (CE-Net)

2) Deep Segmentation Network With Dense Convolutional Pooling (CEWI-Net)

3) Deep Segmentation Network of Multiscale Object Context (HR-Net)

4) Optimization Loss Function

B. Spatial Consistency Boundary Optimization and Rotation Consistency Constraints Based on Multistructure Segmentation Network

1) Comprehensive Prediction of Rotation Consistency From Multiple Angles

2) Spatially Consistent Boundary Optimization

3) Multistructure Deep Segmentation Network Voting

C. Implementation Details

D. Results and Discussion

First Place in the Semantic Segmentation in Fully Polarimetric SAR: BUCT

A. Data Preprocessing

B. Pixel-Level Semantic Segmentation: DeepLab V3+

C. Superpixel Segmentation Technology for Fully Polarimetric SAR Image

D. Results and Discussion

Conclusion

ACKNOWLEDGMENT

References