Introduction
As sensor technology and aerospace remote sensing technology have improved, the quality and quantity of remote sensing images have also undergone great improvement. Researchers can now conveniently acquire remote sensing images with high spatial or spectral resolutions. These remote sensing images improve researchers’ chances of understanding the image content, especially when analyzing their semantic meaning (the high-level features in remote sensing images). However, while semantic analysis is a difficult and important part of remote sensing analysis, object detection is the basic task required for semantic analysis. Consequently, the problem of object detection has attracted considerable research attention and has been extensively studied. The difference between object detection and object localization is subtle. Object detection focuses on detecting the presence of entire objects. But object localization has higher requirements than object detection does. Object localization requires that objects be located accurately. In this paper, we aim to propose an accurate object localization framework for remote sensing images.
Currently, there is small demand for accurate localization in the remote sensing field. The majority of studies focuses on detection rather than localization (the two processes have been confused by some people). Object detection in remote sensing images faces far more challenges because of more complex background information they contain than that of natural images. Remote sensing images offer information about the texture, shape, and structure of ground objects, and they can be used for precise object identification. However, in addition to providing ample information for object detection, they also present information redundancy problems. Moreover, because of noise interference, weather, illumination intensity, and other factors, object detection in remote sensing images is a troublesome issue.
In this paper, we focus on accurate localization of detected objects rather than simple object detection. Based on this aspect, we use object localization to summarize this paper. In this paper, we tackle the feature extraction problem for object detection in remote sensing images using convolutional neural network (CNN) models. CNN relies on the specific layer structure to learn the essential features of input images, thus avoiding the effort of designing a feature extraction strategy. In addition, CNN models have a wide range of application. CNNs with deeper layer structure tend to have better learning abilities. In this paper, the feature extraction strategy is based on CNN models with a deep layer that can describe objects in remote sensing images. Finally, we propose a new object localization framework for remote sensing images that can detect and locate objects accurately.
The rest of this paper is organized as follows. Section II reviews related works on object detection and applications of CNN in remote sensing images. Section III presents the details of the proposed object localization framework. Section IV discusses the object detection experiments for the proposed object localization framework, and Section V analyzes the detection performance of the proposed object localization framework. Finally, Section VI concludes this paper.
Related Work
Object detection in remote sensing images has been widely researched in recent years. Many researchers have used local features to extract characteristics, such as scale invariant feature transform (SIFT) [1], histogram of oriented gradients (HOG) [2], and Saliency [3], [4]. These local features have a certain invariance, but this invariant ability needs to be enhanced when dealing with the problems of various object orientations in remote sensing images. Cheng et al. [5] introduced a rotation-invariant layer on the basis of the existing CNN architectures to achieve rotation invariant and the proposed rotation-invariant CNN model achieves significantly performance. Xiao et al. [6] used the elliptic Fourier transform (EFT) to improve the invariance of HOG features. While local features are the low-level features of images, object detection is a part of the high-level semantic analysis, which is more closely aligned with actually understanding image content. Image understanding is the ultimate goal pursued by all image processing researchers constantly pursue. Cheng and Han [7] provided a review of the recent progress in object detection in remote sensing images and proposed two promising research directions, namely, deep learning-based feature representation and weakly supervised learning-based geospatial object detection.
As image processing theory has developed, many studies have focused on the mid-level features of remote sensing images. The most popular mid-level feature is the part-based model [8]–[12]. The main idea behind part-based models is that objects consist of several visually important parts; therefore, the object detection task can be decomposed into processes that detect these parts. To acquire the semantic information of images, some studies have applied semantic models to extract semantic information from remote sensing images [13]–[15].
Recently, deep learning models have received increased attention. The most popular deep learning methods are the CNN models. CNN does not need handcrafted features, and it requires fewer parameters than other networks, because it shares weights for the same filter. A CNN model can learn the essential features of input images based on its specific network structure. CNN has been widely used in object classification, object detection, speech recognition, and so on. Zhong [16] proposes a large patch CNN for the scene classification of high spatial resolution imagery, which contains a large patch sampling layer used to generate hundreds of possible scene patches for the feature learning.
Many theoretical studies concerning CNN have been made [17]–[29]. As computer technology has advanced, deeper and more efficient CNN models have been proposed. AlexNet, developed by Krizhevsky et al. [30], was a ground breaking CNN architecture and a winning model in the 2012 ImageNet Large-Scale Visual Recognition Challenge (ILSVRC-2012). GoogleNet [31] was a winner of ILSVRC-2014; the main hallmark of this architecture was its improved utilization of the computing resources inside the network. This improvement was achieved through a carefully crafted design that allowed the depth and width of the network to increase while keeping the computational demands constant. The VGG models proposed by Simonyan and Zisserman [32] were used to investigate the relationship between the depth of a convolutional network and its accuracy in a large-scale image recognition setting. SPP-net [33] could generate a fixed-length representation regardless of the size or scale of the image, thus eliminating the requirement for a fixed-size input image. Resnet [34] was a winner of ILSVRC-2015 and COCO-2015; the layers were reformulated as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions to ease the training of networks that are substantially deeper than those used previously.
In the field of remote sensing images, many object detection experiments use only trained CNN models on large data sets for pretraining [35]–[39]. Wu et al. [40] altered the typical CNN proposed by Lecun et al. [41] to create a model used for aircraft detection. He obtained the candidate object regions using the BING technique. Zhang et al. [38] used trained CNN models to extract surrounding features that were combined with local features (HOG) to describe oil tanks and then applied gradient orientation to select candidate regions from satellite images. Qiling Jiang et al. [42] used a graph-based superpixel segmentation to extract a set of image patches and then trained a CNN to classify these patches into vehicles and nonvehicles. Zhu et al. [43] used CNN features from combined layers to perform orientation-robust aerial object detection. Ding et al. [44] investigated the capabilities of a CNN model combined with data augmentation operations in SAR target recognition. Sevo and Avramovic [45] proposed a novel two-stage approach for CNN training and implemented a network-based method for automatic content-based object detection on high-resolution aerial images. Salberg [37] extracted features from a pretrained deep CNN and used it for automatic detection of seals in aerial remote sensing images. Zhang et al. [46] constructed an iterative weakly supervised learning framework to automatically mine and augment the training data set from the original image and combined the candidate region proposal network and a localization network to extract the proposals and locate aircraft in large-scale very high resolution (VHR) images.
In this paper, we use a suitable feature extractor based on the CNN model to extract the essential features of objects from remote sensing images. The method used to obtain the candidate object regions is crucial for object detection. The common sliding window method performs an exhaustive search, but it is time-consuming. Moreover, the window can have only one size at a time; thus, the location precision of the slide window technique is not high. Therefore, we propose a new object localization framework to address the location problem.
Proposed Framework
The proposed object localization framework follows a pipeline approach. First, when dealing with a test image, we use a selective search algorithm to generate category-independent possible regions. Then, all these candidate regions are sent to a combined model consists of 2-D reduction CNNs. Class labels and classification scores for each candidate region are an average output of two CNN models. Finally, we perform an accurate object localization process to address these classified regions. For accurate object localization, we propose using the unsupervised score-based bounding box regression (USB-BBR) method to improve box localization precision after using the nonmaximum suppression (NMS). In this section, we present our design for each procedure. The proposed object localization framework is shown in Fig. 1.
Proposed object localization framework. (a) Test image. (b) Selective search method produces most of the candidate object regions from the test image and adds some candidate regions extracted from low region density areas, where the selective search method generates few regions. (c) CNN is applied to extract the features from these candidate regions. (d) Classification results of the regions. We used two approaches to obtain the classification results of candidate regions: one is a single model strategy and the other is a model combination strategy that averages the outputs of two CNN models. (e) Classification results of the candidate regions. (f) Final detection results after the accurate object localization process.
A. Region Proposal
Traditionally, a sliding window technique has been used for object detection; however, the sliding window technique is an exhaustive search method and is computationally expensive. Recently, Uijlings et al. [47] proposed a selective search algorithm that produces object regions by taking the underlying image structure into account. The selective search algorithm yields a completely class-independent set of locations. It also generates fewer locations, which simplifies the problem because the sample variability is lower. More importantly, it frees up computational power, which can then be utilized for more robust machine learning techniques and more powerful appearance models.
Objects can occur at any scale within an image because of the diverse means for acquiring images. Moreover, images at the same scale may be different sizes. Therefore, we collect images that share approximately the same image with the aim of acquiring all the similar objects at one scale. Then, we can apply the selective search algorithm to address the objects that have different sizes within an image. Furthermore, the boundaries of some objects are less clear than the boundaries of others. The selective search applies a diverse set of strategies to deal with different size conditions, lighting conditions, and other imaging cases. These strategies make the selective search method stable, robust, and independent of the object class.
The quality of region proposals directly influences the CNN detection result and the accuracy of object localization. Consequently, we analyzed the influence of the region proposals generated by selective search on our data set. Hosang et al. [48] provided a deep analysis concerning ten different object-proposal methods and found that the recall of the selective search method is approximately 0.83 for an intersection-of-union (IoU) of 0.5 on the ImageNet 2013 validation set when 1000 proposals exist per image. The experiments also indicate that the greater the number of candidates, the higher the recall will be. While the recall value may be different for remote sending images, in any image classification task, it is a key factor that influences both the CNN detection result and the accuracy of object localization.
B. Feature Extraction
A CNN model consists of convolution layers, pooling layers, and full-connection layers. A convolution layer has several filters and generates different feature maps using these filters on local receptive fields in the maps of the previous layer or input. The filter size can be
In this paper, the CNN models chosen to extract features are AlexNet and GoogleNet due to their superior performance. To retain more information for backpropagation, for both the AlexNet and GoogleNet networks, we add a 64-D inner product layer before the last inner-product layer. For AlexNet, we reduce the dimension of the second full-connection layer from 4096 to 64, and for GoogleNet, we add a 64-D layer after the last convolutional layer. Moreover, the two CNNs are combined to detect objects simultaneously and the result of this combined CNN models is the averaged outputs of the two CNN models.
To perform feature extraction, we first extract image patches from the candidate regions generated by the region generation process. Then, we normalize the image patches to
C. Accurate Object Localization
To produce the optimum bounding box for locating the object, we propose a two-stage object accurate localization method: NMS and USB-BBR. The NMS method is widely used to tackle the problem of the redundancy of the bounding boxes, while the accurate localization issue has not been solved yet, because it lacks the ability to perform an integrated optimization of the remaining higher quality boxes. Our experiments show that NMS retains extra boxes that are not accurate for locating the object. In view of this situation, we propose the USB-BBR method, which can optimizes the bounding boxes.
Given all the scored regions in an image, we apply a greedy NMS (independently for each class) algorithm that rejects a region when it has a larger IoU with a higher-scoring selected region.
The NMS method is mainly used to eliminate region overlap; however, it may result in several regions for one object, and some regions may have little overlap with the ground truth. The detection precision and recall will be reduced in such cases. We want to use an optimal bounding box to replace these regions to enhance the location precision of object detection. We use the USB-BBR method to reduce the location errors. All the scored regions are allocated into different groups; each group belongs to one object that needs to be detected. For a set of scored candidate regions
Though this grouping process, each object that needs to be detected may have a group of scored regions. Our goal is to regress the regions of each group \begin{equation} L(I_{k}) = \text {argmin} \sum _{i=1}^{n}u_{i}c_{i}^{T}c_{i} \end{equation}
The first iteration regression result of
Algorithm 1 USB-BBR
The full set of regions
Set:
Set:
while
sort the elements of
while
get the area
get the first element
remove
get
for
get the overlap area
if
remove
end if
end for
end while
update the elements of
end while
obtain the final region grouping set
for
append
end for
Fig. 2 shows the USB-BBR process. After sorting the classified regions in
Experiments and Results
A. Data Set
Because of the lack of public data sets intended for object detection in remote sensing images, we collected 2326 images downloaded from Google Earth and Tianditu [49]. We labeled the objects in these images with four categories: oil tank, aircraft, overpass, and playground. The image resolution for each class is listed in Table I. The sensors involved are panchromatic and multispectral due to the various sources of the image data sets. In this way, the diversity of the data set poses comprehensive performance challenges.
Using a CNN is different from using other machine learning methods; CNNs need ample training samples to obtain good learning abilities. In addition to the size of the training data set, the quality of the training data set is also critical. A poor-quality training data set—even one containing large amounts of data—also impacts the learning ability of CNNs. To address the object diversity in remote sensing images and the difficult in collecting data for some object classes, we augmented all the positive samples by translation, scale, and rotation transforms. The details of these transforms are listed in Table II. The rotation transform typically performs 90°, 180°, and 270° rotation; however, when there are few instances of a class, the angle of rotation deceases to increase the number of rotations, enlarging the quantity in the data set.
The data set includes two parts: positive samples and negative samples. The collection of positive samples also has two parts. First, we randomly selected 40% of the ground-truth boxes of the collected images as the positive samples for each class. For oil tanks and aircrafts, the translation and scale transform were used to enlarge the number of samples. For overpasses and playgrounds, in addition to the first two transforms, the rotation transform was used to perform data augmentation.
As the second data source for positive samples, we applied the selective search method to produce candidate regions from each image and then computed the IoU between each region and the ground-truth box. When the IoU was greater than or equal to 0.5 compared with a ground-truth region chosen during the first positive sample collection process, the region became a positive sample; otherwise, when the IoU was less than 0.3 with the ground-truth region, the region became a negative sample. This second data source for positive samples was used to enhance the adaptability of the CNN models. Because we used the selective search to generate almost all the candidate regions, these regions do not necessarily completely surround the entire object. To maintain the consistency of the training process and detection process, we added positive samples produced by selective search to the data set. The number of regions from this second data source of positive samples is large; however, there were fewer overpasses than examples of the other classes, which is why the overpass samples required the rotation transform for data augmentation. The negative samples were also obtained through the selective search algorithm by computing the IoU for each class.
The entire data set was divided into a training data set and a validation data set at a ratio of 5:1 (training to validation, respectively). The number of positive samples of each class in the data set is 12000, containing 2000 validation samples; the number of negative samples for each class is 48000, containing 8000 validation samples. The ratio of positive samples selected from the first and second positive data set sources is 5:1, and the positive samples from the second part of the positive data set were used to enhance the adaptability of the CNN models. All the CNN data set samples were resized to
Fig. 3 shows positive and negative training samples. From top to bottom, these samples show an oil tank, aircraft, overpass, and playground.
We evaluated the object detection performance on the test data set. The sizes of the test images ranged from
B. Experiment Procedures
The experiment included two procedures, as shown in Fig. 4: the training process and the detection process. The training process used the GPU and the Compute Unified Device Architecture (CUDA) to improve the speed. The end products of the training process are the trained CNN models. The detection process was used to detect objects in test images. It has three main tasks: generate the candidate regions from each test image, extract the features, and obtain the classification results for the candidate regions using the trained CNN models. At the end of this process, the localization precision of these classified regions is enhanced by applying the accurate localization process that included both the non-maximum suppression (NMS) and USB-BBR as described in the preceding section.
We tested three types of models in this paper: a retrained model (AlexNet, GoogleNet, and AlexNet + GoogleNet), a fine-tuned model (AlexNet-finetune, GoogleNet-finetune, and AlexNet-finetune + GoogleNet-finetune), and a dimension-reduction model (AlexNet-DR, GoogleNet-DR, and AlexNet-DR + GoogleNet-DR). There are two ways to initialize network weights in the training process: one is a random initialization using small amounts of data and the other is a fine-tuned initialization by using the trained CNN models on a large data set. The training process was performed using the open source Caffe framework [50].
We used small patches to train our CNN models through the backpropagation algorithm, employing the GPU and CUDA. We set the learning rate to 0.01 and set the batch size to 256 for AlexNet, and 32 for GoogleNet. Moreover, some tricks, such as local response normalization, momentum, overlapping pooling, and dropout, have been used in these networks to improve their properties. The arguments for CNN training are listed in Table V. The training process was run on a RedHat Linux server with the Nvidia GTX Titan X GPU with 12-GB RAM.
During the detection process, we ran the selective search method on the test images to extract approximately 1500 proposed regions. Next, we warped the candidate regions and input them to the trained CNN model to perform forward propagation, obtain object features and, finally, classification results. Given all the scored regions in an image, we applied the accurate object localization process to increase the detection performance of these regions. First, the greedy NMS was applied (for each class independently) to reject regions that have an IoU overlap greater than a given threshold with a higher-scoring selected region. This process can greatly decrease the number of overlapped boxes. Second, the USB-BBR method was applied to reduce localization errors and obtain the final detection results.
The arguments for the USB-BBR method are the initial overlap ratio threshold
We used recall and precision to evaluate the performance of detection result. We obtained the ground-truth area by manual annotation. Recall represents the number of detected objects divided by the total number of actual objects (ground truth), while precision indicates the accuracy of the total detected objects. We use the widely used IoU criterion to evaluate the experiment result using our object localization framework. If the value of IoU is greater than or equal to 0.5, the region is a TruePositive; otherwise, it is a FalsePositive. The IoU for oil tanks, aircraft, and playgrounds is 0.5; for overpasses, it is 0.4 because of the uncertainty of the manually labeled ground truth.
C. Experiment Results
The baseline values for object detection based on our framework are listed in Table VII. These results are from the AlexNet-DR and GoogleNet-DR model combination. Compared with retrained model and fine-tuned model, the dimension-reduction model performs better. Moreover, the results also indicate that the fine-tuned weight initialization can improve both the detection recall and precision, and the model combination yields a better detection performance than does using a single model. In terms of computing cost, the running time of the combination model is approximately the two times of that of a single model. The size of test image we used is
The overall performance of the feature extraction method used in this paper was compared with two other methods, namely, the local binary pattern histogram Fourier feature (LBP-HF) [51] and the EFT-HOGs [6]. The LBP-HF combines a discrete Fourier Transform and the LBP to obtain the rotational invariance feature. The arguments for LBP-HF are the radius and the number of sampling points, which determine the circular region used. In our experiment, the radius was set to 3, and the number of sampling points was 24. EFT-HOG uses a circle HOG (C-HOG) feature rather than the typical rectangle HOG (R-HOG) feature, because the C-HOG feature has a better rotational invariance than does the R-HOG feature. To further strengthen the rotation invariance of the C-HOG descriptors, the author mapped the features to the Cartesian coordinate system and then performed an EFT.
There are several optional EFT-HOG arguments: the numbers of annuli, cells, bins, angle, and the number of adopted elliptic Fourier coefficients divided by the number of total coefficients. These are denoted as
After extracting features from the training data set, the feature descriptors from LBP-HF and EFT-HOG were used to train a support vector machines (SVMs) classifier. The kernel function for SVM training is the linear function, and a fivefold cross-validation method is used to avoid overfitting. The trained SVM classifier can be used to classify the candidate object regions. The candidate regions are the same as those used for the proposed CNN-based method. The classified candidate regions also underwent the accurate object localization process to obtain the final detection result. The performance comparisons between LBP-HF, EFT-HOG, and the CNN-based method are listed in Table IX. We can see that the CNN-based method performs than the artificial features. Fig. 5 shows a portion of the comparison of the detection results of the CNN, LBP-HF, and EFT-HOG methods. From top to bottom are the results of detecting the oil tank, aircraft, overpass, and playground classes; Fig. 5(a) is the detection result of the CNN method, Fig. 5(b) is the detection result of LBP-HF method, and Fig. 5(c) is the result of the EFT-HOG method. As listed in the results, the LBP-HF method suffers from many false detections and many missed detections. The EFT-HOG method achieves a better detection performance than LBP-HF, but it is worse than the CNN method.
Comparison of the detection results for the CNN, LBP-HF, and EFT-HOG methods. From left to right, the results are for oil tanks, aircraft, overpasses, and playgrounds. (a) Detection results of the CNN method. (b) Detection results of the LBP-HF method. (c) Detection results of the EFT-HOG method.
The detection performance of the proposed localization framework based on the CNN method is significantly better than the performances of LBP-HF and EFT-HOG detection frameworks. The last two object feature extraction methods enhance invariance through human intervention, but still result in a worse performance than CNN. These results also indicate that CNN can learn better-quality features than those designed by human. CNN needs only to train on the data set to obtain the intrinsic features; it does not require human intervention. CNN can conveniently extract features from a data set when the data quality and quantity of the data set meet the needs of the employed CNN models. This drastically reduces the work and difficulty involved in feature extraction.
Fig. 6 shows the examples of detection failures of the proposed localization framework. The yellow boxes denote false negatives; the red boxes are detected positives by the proposed framework. FP denotes false positives. For the oil tanks, many false negatives have low gray values; these may be difficult to distinguish from the background. As for aircraft, objects of other classes that appear similar to aircraft may be mistakenly classified as aircraft. In analyzing the failure detection results for overpasses, we found that the regions of overpasses could not easily be determined to be correct, because the manually labeled ground truth was too subjective. If a classified positive region has a low IoU value, that region is evaluated as a false positive. The playground detection results show the same situation. Other objects that have similar color compositions or shapes may also be misclassified as playgrounds.
Examples of detection result failures for the proposed localization framework. The yellow boxes denote FalseNegative and the red boxes are detected positives by the proposed framework. FP: FalsePositive.
We found that the recall and precision for overpass detection were noticeably lower than the recall and precision for the other three classes—but this result is not caused by CNNs inability to learn overpass features. Instead, the lower detection result for overpasses is caused by the manually labeled ground truth. Overpass regions are subjective; thus, it is difficult to ensure objectively correct regions for overpasses. The detection evaluation index in our experiment is strict; consequently, it results in poor detection result for overpasses. When we changed the detection evaluation index slightly, the detection performance for overpasses increased.
Table X shows the detection result for overpasses using this altered evaluation index. Fig. 7 shows a portion of the detection results with different evaluation index values. Fig. 7(a) is the detection result when IoU = 0.5, Fig. 7(b) is the detection result when IoU = 0.4, and Fig. 7(c) is the detection result when IoU = 0.3. For each result, the blue box is the ground truth, while the green and red boxes are the FalsePositive and TruePositive, respectively, as judged by the evaluation index. We found that the regions evaluated as FalsePositive using IoU = 0.5 are instead judged as TruePositive by the last two evaluation indexes. This indicates that if we choose a less rigorous evaluation, index causes the recall and precision for overpass detection to increase. In our main experiment, we used the stricter evaluation index to measure the performance of the proposed localization framework. The experimental results demonstrate that the proposed localization framework has a better detection performance than the compared methods and has good location precision.
Detection result for overpasses using different evaluation indexes. (a) Detection result when IoU = 0.5. (b) Detection result when IoU = 0.4. (c) Detection result when IoU = 0.3. For each result, the blue box shows the ground truth, the green box is the FalsePositive judged by the evaluation index, and the red box is the TruePositive judged by the evaluation index.
Analysis
In this section, we analyze the effects of the proposed framework on detection performance. For the proposed object localization framework, we used a large data set to train the CNN models and applied model fine-tuning to initialize the weighs. We also used a model combination method when classifying candidate regions, and the accurate object localization process to enhance the location precision. We will systematically analyze the performance effects caused by these strategies.
A. Size of Training Data Set
It is known that CNN models need ample training data sets to learn the essential features of tasks. However, creating a large number of data set examples with labels involves a great deal of effort and time. Moreover, some classes have few available samples for collection. In many cases, the size of the training data set is insufficient. Still, the important determining factor for the performance of a CNN is the training data set. If either the quantity or quality of the training data set is poor, the CNN model will have poor learning ability even when a good network architecture is adopted. Therefore, we first analyze the effects of different training data set sizes on detection performance.
In our detection experiments, we used four object classes: oil tank, aircraft, overpass, and playground. We chose the oil tank and playground classes specifically to analyze the effects of different training data set sizes on detection performance. The numbers of samples of oil tanks and playgrounds in the different sized positive training data sets are [500, 1000, 3000, 5000, and 11000]. The number of negative samples in each training data set was four times the number of positive samples. The ratio of samples in the training data set to those in the validation data set was 5:1. We trained the GoogleNet CNN model on these different size data sets. The weight initialization included both the random values strategy and the fine-tuning strategy. Then, all these CNN models were applied to extract features and obtain the classification results. We compared the recall and precision of each CNN model’s object detection result to analyze the effects of training data set size.
Tables XI and XII separately show oil tank detection results and playground detection results trained using different data set sizes. These detection results indicate that the CNN models require ample numbers of samples to achieve good learning ability. However, there is a peek point where the recall or precision achieves the highest value, and when the number of data set increased, the value of recall or precision decreased.
B. Feature Extraction and Classification
The common method for obtaining class labels of candidate regions is to perform forward propagation in the CNN model. In our experiment, in addition to this method, we also used a model combination to obtain the class labels of candidate regions. The model combination is used to reduce the false detection rate by using a feature combination from two different CNN models. The model combination is similar to performing forward propagations twice, and it obtains the classification result from the outputs of two CNN models.
In our experiments, we tested three types of models in this paper: a retrained model, a fine-tuned model, and a dimension-reduction model, as shown in Tables XIII–XV. These results show that weight initializations play a crucial role in the learning ability of CNN models, and the model combination strategy can improve the precision value to a certain extent. The detection results of the fine-tuned model are listed in Table XIV. From an analysis of the detection performance listed in table XIII, XIV, and XV, the fine-tuned methods result in better performance in detection results, causing an increase in recall and precision to a certain extent. Namely, the fine-tuned model can increase the learning performance of CNN models. And the detection precision of a combination model is higher than that of a single model.
Beyond the fine-tuning strategy and model combination strategy, we also explored three dimension-reduction models: AlexNet-DR, GoogleNet-DR, and AlexNet-DR + GoogleNet-DR, all of which added a 64-D inner-product layer before the last inner-product layer. The parameters are initialized by the pretrained models that trained on ImageNet data. When applying the fine-tuning method, the learning rate also decreases by 0.1%. The results of the dimension-reduction models are listed in Table XV.
C. Region Proposal and Unsupervised Score-Based Bounding Box Regression
In the first stage, we use the selective search method to obtain a lower quantity of high quality object regions. The performance of the region proposal is profoundly affected by the number of object locations and the criterion IoU criterion. To evaluate the performance of the region proposals on our test data set, we took all object locations generated by selective search into account and used an IoU of 0.4 for overpasses and 0.5 for the other three types to calculate the effect of this criterion on recall. The results are listed in Table XVI. As shown in Table XVI, the recall criterion of region proposal gradually decreases while the precision increases, indicating that the CNN detection and our localization process can improve the accuracy of the object location. This decrease in recall is expected, because after CNN detection and NMS, fewer boxes are predicted as objects of interest than before. But when using the USB-BBR, the precision improves, because the locations of regions have been optimized as well as the number of bounding boxes, particularly the bounding boxes of false positive regions, which decreased over the entire object detection framework.
In addition to the region proposals, we also tested the recall criterion of boxes detected by the GoogleNet-finetune model and the GoogleNet-finetune with NMS model. We obtained the classification results of candidate regions by the forward propagation of CNN. There are many overlapping regions in the classification results. The NMS method was used to eliminate smaller regions that have an IoU greater than a threshold value with a higher-scoring region. The number of overlapped regions decreases considerably. However, even after NMS the results still did not meet our need, because each object always has many regions.
We hoped to obtain an optimal bounding box to locate the objects more precisely. To deal with the problem of several regions corresponding to one object, we used the USB-BBR algorithm after NMS. The detection results of using the GoogleNet-finetune model with the USB-BBR method are also listed in Table XVI. Compared with the result from using a CNN with NMS, using a CNN with USB-BBR can greatly increase the localization precision, because the process optimizes several regions into one region; consequently, the location accuracy is much higher, especially for overpasses and playgrounds. As Table XVI shows, the recalls of the detection results with USB-BBR are lower than these of the GoogleNet-finetune model without USB-BBR; however, this is expected because there is an interconstraint relationship between recall and precision. The region numbers of detected regions with and without the USB-BBR for the GoogleNet-finetune model are listed in Table XVII. The average number of regions for each class is the total region number divided by the total number of test objects. The average number of overpasses and playgrounds regions after USB-BBR is greatly decreased. This result leads to the reduced detection recall values for these two classes after USB-BBR.
Fig. 8 shows a portion of the comparison detection results both with and without the USB-BBR method. The first column is the detection result without using the bounding box regression method; the second column is the detection result using the USB-BBR method. As Fig. 8 shows, the USB-BBR has better localization precision. Therefore, the USB-BBR method can increase the detection localization precision.
Comparative detection results both with and without USB-BBR. The first column shows the detection results without USB-BBR. The second column shows the detection results with USB-BBR.
Conclusion
In this paper, we proposed an object localization framework based on CNN in remote sensing images. The framework uses the CNN models to extract object features and obtain classification results. In the first stage, we used a selective search method to generate the major part of the candidate object regions. In the second stage, we designed a dimension-reduction model using trained models to initialize the network weights and then use it to extract features and classify the objects to different categories. We also tested a retrained model and a fine-tuned model. In the third stage, we proposed a new USB-BBR algorithm, as part of the accurate object localization process, to obtain better detection localization precision, and we used NMS to decrease the number of overlapped regions. The addition of the USB-BBR method can help to obtain an optimal bounding box for each group of classified regions. In addition, we investigated the influences of different sizes of training data sets, different weight initialization methods, and different model combinations on detection performance. These results can help guide other researchers to obtain good results. The results of the experiments indicate that the proposed localization framework is both simple and robust. In further work, we will continue to enhance this framework and improve its detection and localization performance.