Journals & Magazines >IEEE Access >Volume: 7

Design and Training of Deep CNN-Based Fast Detector in Infrared SUAV Surveillance System

Qualitative results achieved by our methods on the small unmanned aerial vehicle (SUAV)dataset.

Abstract:

Real-time detection of small unmanned aerial vehicle (SUAV) targets in SUAV surveillance systems has become a challenge due to their high mobility, sudden bursts, and sma...Show More

Metadata

Abstract:

Real-time detection of small unmanned aerial vehicle (SUAV) targets in SUAV surveillance systems has become a challenge due to their high mobility, sudden bursts, and small sizes. In this study, we used infrared sensors and Convolutional Neural Networks (CNN)-based detectors to achieve the real-time detection of SUAV targets. Existing object detectors generally suffer from a computational burden or low detection accuracy on small targets, which limits their practicality and further application in SUAV surveillance systems. To solve these problems, we developed a real-time SUAV target detection algorithm based on deep residual networks. In order to improve the sensitivity to small targets, a laterally connected multi-scale feature fusion approach was proposed to fully combine the context features and semantic features. A densely paved pre-defined box with geometric analysis was used for single-stage prediction. Compared with the state-of-the-art object detectors, the proposed method achieved superior performance with respect to average-precision and frames-per-second. As the training set was limited, to improve generalization, we investigate the benefits introduced by data augmentation and data balance, and proposed a weighted augmentation approach. The proposed approach improved the robustness of the detector and the overall accuracy.

Qualitative results achieved by our methods on the small unmanned aerial vehicle (SUAV)dataset.

Published in: IEEE Access ( Volume: 7)

Page(s): 137365 - 137377

Date of Publication: 16 September 2019

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2019.2941509

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Small unmanned aerial vehicle (SUAV) has become a research hotspot with their potential to revolutionize commercial industries, the public domain, and the military [1]–[6]. However, because of the portability and maneuverability of SUAVs, many dangerous items (e.g., explosives and firearms) can be loaded on them, posing a serious threat to the public security.

The detection and monitoring of SUAV targets is the basic prerequisite for defense against attacking [2]. Therefore, effective monitoring of the SUAV targets is urgently needed. In particular, there is a huge demand to develop a reliable SUAV early warning and monitoring system for protecting high-value targets. The real-time detection of SUAV targets is the key technology in SUAV surveillance systems [7]–[10].

Compared with radar imaging and visible light imaging, infrared SUAV target detection is a feasible technical path with extra benefits, e.g., strong anti-interference, long detection range, and all-weather functionality [11]–[14] In order to meet the requirements of applications such as fixed-area security monitoring, an infrared SUAV target detection algorithm is required to consider both the processing speed and detection accuracy. However, the trade-off between speed and accuracy is the main problem in these algorithms.

When hovering in a complex background, the infrared characteristics of an SUAV target experience strong noise interference, and the signal-to-noise ratio is reduced, which makes target detection in complex backgrounds extremely difficult. Owing to their powerful feature extraction and learning abilities, deep convolutional neural networks (CNN) [15]–[18].can extract features from complex images and represent them hierarchically. Therefore, CNN-based methods perform better than traditional target-detection algorithms in achieving accurate target detection of the deformation, occlusion, blur, and multi-scale changes in complex backgrounds, as proven in the ImageNet Challenges since 2012 [15]. The existing mature CNN-based target-detection methods can be divided into two groups: double-stage methods with high-precision (R-CNN [19], Fast R-CNN [20], and Faster R-CNN [21]) and single-stage detection methods with high-speed (SSD [22], Yolo [23], Yolo 9000 [24] and Yolo v3 [25]). Although double-stage detection methods [19]–[21] are excellent in detection accuracy (especially the regression accuracy of the bounding box), the speed of the network is limited and the real-time requirements cannot be met due to the two-stage architecture based on the Regions-of-Interest (RoI) extraction and bounding-box refinement. Single-stage detection methods [22]–[25] benefit from the single-step prediction strategy, which makes the network reach a rapid detection speed, but the detection accuracy is slightly inferior. These methods cannot solve the problem of trade-off between processing speed and detection accuracy.

In addition, due to the small size of the feature maps used for detection, there is a common problem with these popular detectors; detection precision of small targets is poor, which limits the overall accuracy. In the existing literature, in some studies of machine vision tasks such as image classification, semantic segmentation and face recognition, some multi-scale methods are used to solve the problem of small targets [26]–[31], e.g., Saxena et al. proposed a fabric that embeds an exponentially large number of architectures which has good performance in image classification on MNIST and CIFAR10 datasets [26]; Huang et al. proposed multi-scale dense networks for resource efficient image classification, to facilitate high quality classification early on, they use a two-dimensional multi-scale network architecture that maintains coarse and fine level features all-throughout the network [27]; The cascade of super-resolution module is an interesting solution to increase target signature resolution, which has been proved effective by Zhang et al. [28] However, as an additional pixel-level algorithm, cascading with the detector greatly reduces the speed of operation due to excessive computation. In terms of target detection, baseline methods Faster R-CNN [21] and Yolo v3 [25] use feature maps with different resolutions to identify objects with various sizes therein, especially the feature pyramid network used in Faster R-CNN, which contains three sets of feature maps with different resolution, on which anchor boxes of different sizes and quantities are paved respectively to detect objects. Compared with other solutions on small targets, we are concerned with the infrared SUAV target detection used in anti-UAV systems. Since we focus on this specific application, the prior information can be used to design the network structure and training method, thus achieving better performance than the baseline methods on specific tasks, which have not been exploited in the literature. Anti-UAV is an emerging field, we try to solve this problem with infrared sensors and deep learning methods. The purpose of this paper is to verify the feasibility of this technology path and to provide practicable model and training method for reference by other researchers.

To achieve a balance between detection accuracy and computational efficiency, we propose a single-stage detection network that combines the densely paved pre-defined box prediction strategy and employ the geometric characteristics of targets to establish the scale of the pre-defined box. Specifically, in the design of the top-level, multi-scale prediction is performed by densely and laterally connecting shallow-layered features with deep-layered features to improve the sensitivity to small-target detection.

On the other hand, the training of the detector is a key factor that significantly affects performance. As data-driven algorithms, CNN-based methods learn features from large-scale datasets automatically. Benefiting from large-scale publicly available datasets such as ImageNet, VOC, and COCO, CNN-based methods achieve impressive performance for visible images. However, for infrared images, there are only a few large-scale public datasets available. Furthermore, in infrared target detection tasks, the location and category of the target need to be manually labeled, which leads to larger workloads and higher costs. Therefore, the training of deep CNN has to be performed with a small dataset when applied to infrared SUAV surveillance systems, which may lead to the reduce of generalization. In this case, we use data augmentation and data balancing to enhance generalization and adopt a weighted augmentation approach. Specifically, the difference in geometric characteristics of the target is used to balance the distribution of training data, which is proven effective by experiments.

The contributions of this paper can be summarized as follows: (1) This paper successfully developed a SUAV surveillance system using an infrared sensor and a deep learning-based real-time detector to protecting high-value objects, which have not been previously exploited. As no training and test datasets are publicly available, we built our own infrared SUAV target dataset as the benchmark. (2) We tried to balance the accuracy and speed in the detection of SUAV, and we proposed a real-time detector on a SUAV target where lateral connections based on multi-scale feature fusion and densely arranged pre-defined boxes are adopted, which respectively improved the detection sensitivity to small targets and the accuracy of location prediction. (3) To solve the problem of poor generalization ability caused by insufficient and unbalanced training samples, this paper explored the impact of data quantity and proportionality on small sample training of the SUAV target detector and proposed a weighted augmentation method to achieve data balancing. The experimental results show that this approach can improve the robustness and average accuracy of the algorithm.

SECTION II.

Method of Multi-Scale Feature Fusion of Deep Residual Networks

A. Design of the Network Structure

To achieve a balance between the detection speed and precision, and improve the detection accuracy of small targets, we propose a deep residual network-based single-stage detector with multi-scale feature fusion and sliding pre-defined box searching. The network structure is shown in Fig.1.

FIGURE 1.

Framework of the proposed real time detector for SUAV surveillance.

Show All

The network can be divided into four parts: (1) input layer, (2) feature extraction and fusion module, (3) detection module, (4) output layer (loss layer). For each detection, the operation process is divided into four steps. First, feature extraction is performed on the input image using the residual network ResNet-50 [32] with batch-normalization (BN) [33] and random dropout models [15], which avoid the gradient diffusion and allow the network to go deeper. Second, three sets of feature maps of different sizes are merged into three scales via densely lateral connections. These three different fine-grained fusion features are used to improve the detection sensitivity for small-scale targets. Third, a single-stage multi-scale detection method based on the densely arranged pre-defined box is adopted to achieve fast detection. Fourth, we use an independent flow path with convolutional layers to simultaneously predict the loss label and regress the location offsets. In the training stage, we calculate the loss function with the loss layer and backpropagation to train the model.

B. Multi-Scale Feature Map Fusion Via Dense Lateral Connections

In deep neural networks, shallower layers can extract some local features and contextual information because of the smaller perception domain, whereas deeper layers with a larger perception domain can learn more abstract semantic information. As in Faster-RCNN [21], only semantic features extracted by the last feature layer are used in the detection; thus, targets with larger imaging areas achieve superior test results. However, these deeper features are less sensitive to the size, position, and orientation of the targets, leading to a poor performance in small target detection. Whereas the size of the input image is $512\times 512$ , pooling four times by the stride of two pixels, the size of the last feature map is so small that the extracted features are not sensitive to the small objects.

Fig.2 shows the distributed heat map of the target bounding box sizes in all training samples. The sizes of the targets are relatively fixed, mainly concentrated in the range of $W\in [{15,32}]$ and $H\in [{8,18}]$ . There are also some targets settled in the range of $W\in [{33,46}]$ and $H\in [{19,29}]$ , but the number is relatively small. Since most target sizes in our dataset are small, the spatial position of the small-size targets cannot be predicted accurately using only the deeper features.

FIGURE 2.

Distributed heat map of the target sizes.

Show All

In order to improve the sensitivity to small targets, we applied a detection algorithm based on multi-scale feature fusion via lateral connections to fully combine the context features and semantic features in the detection stage, which is shown in Fig.3.

FIGURE 3.

Dense lateral connections-based feature map fusion.

Show All

In the vertical linkage feature extraction network, we extracted feature map outputs by the last three deepest residual blocks Res-4, Res-5 and Res-6 with pixel resolutions of $(32\times 32)$ , $(16\times 16)$ and $(8\times 8)$ , namely $Feature_{1}$ , $Feature_{2}$ and $Feature_{3}$ :

These three resolution sizes are used as the scale benchmarks of the merged feature map. Feature map $Feature_{1}$ with the size of $8 \times 8$ was used as the first scale for prediction:

$\begin{equation*} Scale_{I}=\text Feature_{1}.\tag{1}\end{equation*}$ View Source

The $8 \times 8$ feature map was up-sampled by bilinear interpolation to achieve the same size as $Feature_{2}$ . Then, those two parts of the feature map were laterally connected. For channel-by-channel connections, the dimensions of two sets of feature maps were adjusted to 256 by a $1 \times 1$ convolutional layer. Pixel-level fusion is more sensitive to the shallower features; therefore, we used element-wise addition on the features of the shallow layers and up-sampled the features of the deeper layers:

$\begin{align*} Feature_{1}^{\prime }=&Upsampling_{8 \times 8}^{16 \times 16}\left ({Feature_{1}}\right), \tag{2}\\ Scale_{II}=&\frac { Feature_{1}^{\prime }+\alpha Feature _{2}}{\alpha +1}.\tag{3}\end{align*}$ View Source

Finally, we performed the up-sampling of $Feature_{1}$ and $Feature_{2}$ , and merged them with $Feature_{3}$ to obtain the third scale $Scale_{III}$ :

$\begin{align*} Feature_{1}^{\prime \prime }=&Upsampling_{8 \times 8}^{32 \times 32}(Feature_{1}) \tag{4}\\ Feature_{2}^{\prime }=&Upsampling_{16 \times 16}^{32 \times 32}(Feature_{2}) \tag{5}\\ Scale_{III}=&\frac {Feature_{1}^{\prime \prime }+\beta Feature_{2}^{\prime }+\gamma Feature_{3}}{\beta +\gamma +1}\tag{6}\end{align*}$ View Source

The values of the weighting coefficients $\alpha$ , $\beta$ , $\gamma$ were set to 1. In addition, to verify the impact of high-level semantic features on detection performance, we applied additional bypass for $Scale_{III}$ to connect the up-sampled $Scale_{I}$ and $Scale_{II}$ to $Scale_{III}$ (as shown by the dotted line in Fig.3). The plus signs in the figure represent the elementwise addition, upsampling is achieved by bilinear interpolation.

This operation typically enables high-level semantic features with fine-grained features, reflecting the relationship between high-level semantic features and low-level detail features. The analytical results of ablative studies are presented in Section IV and verify the superiority of this approach in SUAV target detection.

Faster R-CNN [21] and Yolo v3 [22] use feature maps with different resolutions to identify objects with various sizes therein (e.g. feature pyramid network in Faster R-CNN [21]). Different from these baseline detectors, the proposed method is a dedicated detector for infrared SUAV targets, so there is no requirement for large-scale objects detection. However, in the application of infrared SUAV detection, we believe that some small-sized feature maps (e.g. $7\times7$ ) have little effect on the detection of small targets. That is because, small size feature maps lead to a sparse pre-defined boxes (anchor boxes in Faster R-CNN) paving. When the pre-defined boxes are small, such a sparse laying will result in a large uncovered area on the input image (for example, $512\times512$ ), which will cause false negatives. To solve this problem, low-resolution feature maps are upsampled and normalized to a large size ( $32\times32$ ) for a dense paving of pre-defined boxes. After vertically connected, all those multi-scale feature maps passed through a rectified linear unit (ReLU) [15] and were fed to the next layer.

C. Single-Stage Prediction Based on Densely Paved Pre-Defined Boxes

For classic double-stage methods such as Faster R-CNN [21], although the regional proposal network (RPN) [21] contributes greatly to the improvement of detection accuracy, further improvement of the computational efficiency has been a bottleneck. During the evolution of R-CNN [19]–[21],the performance and operating speed of the RoI extraction is continuously improved, while the speed remains far behind that of single-stage detection methods (e.g., SSD [22] and YOLO [23]–[25]).

Therefore, based on the single-stage prediction, we combined the idea of anchor box prediction with the results of the target statistical analysis and designed a sliding window-based candidate region search structure as shown in Fig.4.

FIGURE 4.

An illustration of the sliding window-based densely paved pre-defined boxes searching.

Show All

As opposed to Faster-RCNN [21] and Yolo v3 [22], which scans the anchor boxes at different strides on the image,we predict multiple candidate regions simultaneously on each pixel of every fusion feature map, which means we employed a sliding stride of 16 pixels on the input image with the size of $512 \times 512$ . The center of each square was used as a datum point, and a total of N pre-defined boxes were generated at each point. Since all the fusion feature maps are upsampled to the same larger resolution, the N pre-defined boxes could be paved densely both on $Scale_{I}$ , $Scale_{II}$ and $Scale_{III}$ . The anchor boxes paved on the low-resolution feature maps used in Faster R-CNN and Yolo v3 is sparse (e.g. on the $8\times8$ feature map, the sliding step of the anchor frame is 64), these sparse anchor frames are usually unable to determine small targets. Our method evenly lays the pre-defined boxes on the feature maps of different scales, so that the low-resolution deep feature map is more sensitive to the detection of small targets.

The dimensions of these N pre-defined boxes correspond to the statistical results of the target geometric features in the training data so that the best match between the predicted box and the ground truth can be achieved (discussed in detail in Section IV-A and Section IV-B). For the prediction boxes generated according to these rules, we used two flow paths with two convolutional layers to simultaneously predict the loss label and regress the location. The top K results with the highest score were selected as outputs. Finally, we obtained the positions and degree of confidence of the target after the non-maximum suppression processing for those three-scale prediction results. In this stage, several suitable pre-defined candidate boxes were used instead of the offset fine-tuning in the first stage of double-stage detectors; i.e., additional intermediate layers for proposed location regression were removed (two convolutional layers in Faster R-CNN [21]). Specifically, as a dedicated detector for SUAV, the priori information of the target (geometric features) is used when making feature map selection and paving pre-defined boxes. This operation replaces the regional proposal in the double-stage detection, experimental results show that, since the pre-defined boxes is densely paved with a higher IoU with groundtruth, our method achieved similar accuracy to the double-stage detector using only one positional regression, but the speed was greatly improved.

D. Loss Function

The loss function consists of two parts: classification loss $L_{cls}$ and location loss $L_{loc}$ ,

$\begin{equation*} L(x, y, w, h, p)=L_{cls}+L_{loc},\tag{7}\end{equation*}$ View Source

where

$x$

and

$y$

represent the center coordinates of the target;

$w$

and

$h$

represent the width and height of the target horizontal bounding box;

$p$

represents the confidence degree of the target. The classification loss is defined by 0 – 1 Boolean variable of logarithm likelihood loss functions:

$\begin{equation*} L_{cls}\left ({p_{i} y_{i}}\right)=-\sum _{i} \ln p_{i}^{y_{i}},\tag{8}\end{equation*}$

View Source

where

$p_{i}$

indicates the probability that the

$i{-}th$

reference region is classified as a target, and

$y_{i}$

indicates whether the target exists in the

$i{-}th$

reference region, which takes the value of 0 or 1.

$L_{l}oc$ is calculated as:

$\begin{align*} L_{loc}\left ({t_{i}, t_{i}^{*}}\right)=&\sum _{i} \sum _{s \in \{x, y, w, h\}} L_{1}\left ({{t}_{i}^{s}-{t}_{i}^{s^{*}}}\right), \tag{9}\\ L_{1}({x})=&\begin{cases}{0.5 {x}^{2},} & {|{x}| \leq 1} \\ {|{x}|-0.5,} & {{~\text {else }}},\end{cases}\tag{10}\end{align*}$ View Source

where

$L_{1} (x)$

represents a smoothed loss in

$L_{1}{-}norm$

, and the parameterized coordinates of the candidate region and the truth region are defined as follows:

$\begin{align*} t_{\dot {\mathrm {i}}}=&\left ({t_{i}^{x}, t_{i}^{y}, t_{i}^{w}, t_{i}^{h}}\right), \tag{11}\\ t_{i}^{*}=&\left ({t_{i}^{x^{*}}, t_{\dot {t}}^{y^{*}}, t_{i}^{w^{*}}, t_{i}^{h^{*}}}\right).\tag{12}\end{align*}$

View Source

The specific definitions of $t_{i}^{x}$ , $t_{i}^{w}$ , $t_{i}^{x^{*}}$ , $t_{i}^{w^{*}}$ are as follows:

$\begin{align*} \begin{cases}t_{i}^{x}=\dfrac {\left ({x-x_{a}}\right)}{w_{a}} \\ t_{i}^{y}=\dfrac {\left ({y-y_{a}}\right)}{h_{a}} \\ t_{i}^{w}=\log \left ({\dfrac {w}{w_{a}}}\right) \\ t_{i}^{h}=\log \left ({\dfrac {h}{h_{a}}}\right)^{\prime }, \end{cases} \tag{13}\\ \begin{cases}t_{i}^{x^{*}}=\dfrac {\left ({x^{*}-x_{a}}\right)}{w_{a}} \\ t_{i}^{y^{*}}=\dfrac {\left ({y^{*}-y_{a}}\right)}{h_{a}} \\ t_{i}^{w^{*}}=\log \left ({\dfrac {w^{*}}{w_{a}}}\right) \\ t_{i}^{h^{*}}=\log \left ({\dfrac {h^{*}}{h_{a}}}\right)^{\prime },\end{cases}\tag{14}\end{align*}$ View Source

where

$t_{i}^{x}$

and

$t_{i}^{y}$

represent the scale-invariant offset relative to the reference region;

$t_{i}^{w}$

and

$t_{i}^{h}$

represent the width and height in the logarithmic space with respect to the reference region, respectively;

$t_{i}^{x^{*}}$

and

$t_{i}^{y^{*}}$

represent the scale-invariant offset relative to the ground-truth;

$t_{i}^{w^{*}}$

and

$t_{i}^{h^{*}}$

represent the width and height in the logarithmic space with respect to the ground-truth;

$x$

$y$

$w$

and

$h$

represent the coordinates

$(x,y)$

, width

$w$

and height

$h$

of the candidate region, respectively;

$x_{a}$

and

$x^{*}$

indicate the horizontal axis of the reference area and the ground-truth. The ordinate, width, and height are defined in the same pattern.

SECTION III.

Dataset Analysis and Weighted Augmentation Method

A. Dataset

The dataset used in this work contains 3800 artificially annotated long-wave infrared images of the SUAV targets. Each training sample contains 1–4 SUAV targets, and all training sets contain 5138 targets.

This dataset contains various flight scenarios, e.g., hovering attitude, slow cruising, and high-speed maneuvering. The distance from the targets to the infrared detector ranges from 20 m to 800 m, and the probe pitch angle varies over the range of ±45°. The experimental scene includes a variety of backgrounds (e.g., buildings and towers, mountains, clouds, and sky), and different temperature and weather conditions. The temperature ranges between 12°C and 32°C. Some typical images of the dataset and the corresponding positional annotation are shown in Fig.5.

FIGURE 5.

Infrared SUAV target dataset. (a) Long-wave infrared images in the sky, cloud, and mountain background. (b) Long-wave infrared images with different elevation angles. (c) Long-wave infrared images of urban environment under different detection distances and weather conditions.

Show All

To optimize the network parameters, we analyze the geometric features of the labeled dataset. The geometric distribution of SUAV targets is shown in Fig.6. $H$ represents the long side of the target, $W$ represents the short side, and $W/H$ represents the aspect ratio. The aspect ratio reflects the flight attitude and shape characteristic of the SUAV target, and it does not change with the detection distance. Fig.6(a) shows that the target is flat in the image. When the flight accelerates or decelerates rapidly, the target changes from a horizontal attitude to a tilted attitude, and the aspect ratio of the target bounding box is gradually reduced. Because the minimum value of the aspect ratio is 1.0, the banking angle of the SUAV in the maneuver does not exceed ±45°. Fig.6(b) shows the distribution of the long-side $W$ , which mainly represents the distance between the target and the detector. The target width ranges from 20 to 40 pixels.

FIGURE 6.

Histogram of SUAV target geometric feature analysis. (a) Aspect ratio histogram of targets. (b) Distribution of long-side W.

Show All

The validation set used in this work contains two parts. Validation Set 1 (VS-1) is a surveillance video in five different scenes: buildings, iron tower, woodland, mountains, and clouds and sky. The total length of the surveillance video is 218 s, including 5450 frames, and there is only one target in each frame. Validation Set 2 (VS-2) is a manually annotated image set in the same five different scenes, which contains 1200 labeled frames.

B. Weighted Augmentation Method

The training images containing infrared SUAV targets are difficult to obtain in large quantities; therefore, data augmentation was adopted in this study to enhance the generalization of the detector. As shown in Fig.7, the distribution of the geometric features of the targets in the training set is not uniform. As the data balance of the training set has a great influence on the CNN-based model, a weighted augmentation method is proposed here.

FIGURE 7.

An illustration of weighted augmentation method.

Show All

For each target, we calculated the weight $W_{i}$ in the training sample. For the $i{-}th$ original sample, $W_{i}$ is defined as

$\begin{equation*} W_{i}\left ({n_{i}^{w}\!,\!n_{i}^{\frac {w}{h}}}\right)\!=\! \frac {N\!\cdot \!(K+1)}{2}\!\cdot \!\left ({\frac {1}{L_{w} \cdot n_{i}^{w}}\!+\!\frac {1}{L_{\frac {w}{h}}\cdot n_{i}^{\frac {w}{h}}}}\right)\!-\!1,\tag{15}\end{equation*}$ View Source

where

$n_{i}^{w}$

denotes the number of samples with the same width Was the

$i{-}th$

sample in the training set,

$n_{i}^{w/h}$

denotes the number of samples with the similar aspect ratio

$W/H$

as the

$i{-}th$

sample (within 0.1).

$L_{w}$

and

$L_{w/h}$

respectively denote the span of

$W$

and

$W/H$

in the training set.

$N$

denotes the total number of samples in the original training set, and the quantity ratio of the original samples to the amplified samples is

$1:K$

Generally, $W_{i}$ corresponds to different flight attitudes and distances from the detector, expressed as the aspect ratio and the size of the target. A small amount of augmentation was performed on the sufficient portion of the training sample, and a large amount of augmentation was performed on the insufficient portion of the training sample. As shown in Fig.7, we performed the following transformations for each training sample randomly: random resizing, random rotation, flip horizontal, and grey transformation. The degree of augmentation was proportional to $W_{i}$ . Each transformation was performed within a small range of variations to ensure that the bounding boxes of the targets after the transformation of the coordinates still retains the original statistical characteristics.

SECTION IV.

Experiments

In this section, we present the results of the experiments conducted to demonstrate the effectiveness of the proposed method. First, we conducted the ablative studies to verify the proposed approach and optimize the detection model. Second, we optimized the network using the geometric analysis of the targets, data balancing, and data augmentation. Finally, we analyzed the detection results and compared them with the results of the state-of-the-art methods.

A. Framework Optimization

1) Ablation Studies on Multi-Scale Feature Map Fusion

In order to explore the effect of fusion feature maps of different scales on the detection of SUAV targets, an ablation study was conducted on the collocation of the feature maps of different scales. We adopt the Average Precision (AP) to evaluate the performance of the proposed method. AP is the area under the precision–recall curve, which shows the change in variety with respect to the change in recall and reflects the overall performance. Precision is the ratio of true positives (TP) ratio to the total number of targets detected while recall represents the ratio of TP s to the total number of ground-truth targets. $Precision$ and $Recall$ are formulated respectively as follows:

$\begin{align*} {Precision}=&\frac {\textit {TP}}{\textit {TP}+\textit {FP}}, \tag{16}\\ {Recall}=&\frac {\textit {TP}}{\textit {TP}+\textit {FN}},\tag{17}\end{align*}$ View Source

where FP and FN represent the number of false positives and false negatives respectively.

In the experiment, the number of bounding boxes was uniformly set to 5. Experimental results are presented in Table 1.

TABLE 1 Test Results Using Different Collocation of Feature Maps

First, to verify the effectiveness of the dense lateral connection approach, we set the detection of the original feature maps as a control (setting 2 and 3). The AP value of the model that uses fusion feature maps (settings 4 and 5) increased by 1.5 and 4.3, respectively.

Second, settings 1, 4, 5, 6, 7, 8, 9 demonstrate the detection results with different numbers of fusion feature maps. Setting 9, which used all three scale feature maps for detection, has achieved the best results. The combination of large feature maps ( $Scale_{III}$ , $Scale_{II}$ ) achieved better detection results when the number of feature maps was kept the same. This proves that high-resolution feature maps with contextual information can be more beneficial to smaller targets.

Third, setting 10 verified the effect of the lateral connection bypass on the model. The AP value was slightly improved compared to setting 9 without bypass, which indicates that high-level semantic features have a positive impact on the detection of small targets.

2) Parameter Selection of Pre-Defined Boxes

The matching degree of the sliding pre-defined boxes with the ground truth plays an important role in target detection; thus, we used the geometrical analysis results of the SUAV target to accurately select the size of the anchor boxes. The intersection-over-union (IoU) reflects the matching degree, defined as

$\begin{equation*} IoU=\frac {s_{i} \cap s_{i}^{*}}{s_{i} \cup s_{i}^{*}},\tag{18}\end{equation*}$ View Source

where

$s_{i}$

and

$s_{i}^{*}$

represent the anchor box and the ground truth of the target, respectively. We select the pre-defined boxes using the K-means clustering method [34]. The distance metrics are calculated as:

$\begin{equation*} D(box, center)=\frac {1}{IoU(box,center)}.\tag{19}\end{equation*}$

View Source

The average IoU values of all ground truths to the closest cluster centers with different settings of $K$ are shown in Table 2.

TABLE 2 Average IoU of All Ground Truths to Closest Cluster Centers for Different Numbers of Pre-Defined Boxes K

As the number of clustering layers $K$ increases, the mean IoU increases, but the growth rate steadily decreases. As the number of anchor boxes increases, the model complexity and execution time increase. Thereby, we selected five anchor points to achieve the balance between detection accuracy and efficiency. In the RPN [21] approach, which is widely used in double-stage detectors, the smallest anchor boxes are much larger than the SUAV target in the training set, and the average IoU of the boxes to the closest anchor is 0.62. Since most of the nine pre-set anchor boxes with large sizes are unusable, the computation load is increased without any benefits. Compared to RPN [21], the pre-defined box used in this study matched the target’s outline more closely when IoU was raised to 0.75, and showed higher efficiency with only five candidate boxes.

B. Data Balancing and Augmentation

1) Training and Testing of Original Data

After optimizing the network parameters, we used the original data to train the model, and as the baseline for subsequent comparison experiments.

We processed 16 images for each batch. If the matching degree between the prior boxes and their ground-truth box is larger than 0.7, we marked such prior boxes as positive samples. The total number positive samples did not exceed 64. The negative samples were randomly selected in samples whose IoU is less than 0.2. The sum of the positive and negative samples was 128. We used the stochastic gradient descent approach to optimize the training. The initial learning rate was set to 0.001 and the number of iterations was set to 46,000. After 22,000 iterations, the learning rate achieved a tenfold reduction.

In the testing stage, the detection results were filtered by setting the threshold to 0.6. Then, we performed the non-maximum suppression processing to filter out the results with IoU values larger than 20%. We tested the baseline model on two validation sets. The test results on VS-1 show that 4663 frames are detected correctly (confidence level > 60 %), and the detection rate was 82.95%. The test results on VS-2 show that the AP value reaches 49.8, and the false negative reaches 167 when the output confidence threshold was set to 0.6.

2) Weighted Augmentation

By analyzing the detection results on the baseline, we discovered that SUAVs in the large maneuvering flight are difficult to be detected. There are three reasons for the missing of detections: (1) High-speed maneuver causes a dynamic blur problem, leading to the loss of details. (2) Most missed targets are concentrated in the cool zone of distributed heat maps of the target size, i.e., the training samples corresponding to such flight attitudes are insufficient. (3) In the training samples, there are fewer targets that have the same size as the missing ones. The influence of those fewer samples is reduced during the clustering process, finally leading to unreasonable anchor boxes. To solve the problem of sample imbalance, the weighted augmentation method was performed on the original training data. Fig.8 shows the geometric distribution of SUAV targets after data balancing. Compared with the original data shown in Fig.6, the statistical histogram of the geometric features of the targets in the training set are much smoother. In other words, the quantity of SUAV targets in various flight postures is more uniform in the training set.

FIGURE 8.

Histogram of SUAV target geometric feature after data balancing. (a) Aspect ratio histogram of targets. (b) Distribution of long side.

Show All

To achieve a better overlap between the pre-defined box and the contour of the SUAV target under various flight postures, the criteria for setting the pre-defined box sizes were changed from the cluster of the target size quantity in the training sample to the cluster of the target size category. The distribution of the anchor boxes was more discrete after the adjustments, as shown in Fig.9.

FIGURE 9.

Anchor box size distribution (the blue asterisks represent the scale of the targets in the training set, the green asterisks represent the anchor boxes before adjustment, and the red asterisks represent the anchor boxes after adjustment).

Show All

To verify the effectiveness of the data balancing methods based on weighted augmentation, to find an appropriate proportion of data augmentation, and to explore the effect of pre-defined boxes on the performance, the ablation studies were conducted on VS-2.

First, we compared the weighted augmentation method with the traditional direct augmentation method. Table 3 shows the AP values and false negatives of the proposed method and the traditional direct augmentation method at different augmentation ratios. Under different augmentation ratios, the effects of the weighted augmentation method are stronger than the traditional one, and the AP value increase by 3.42%. When the output confidence threshold is 0.6, our method reduces the false negatives by 15.76%. Since the traditional direct augmentation method simply doubles the data, there is no significant influence on the statistical distribution of each original training sample, and the amplified samples are still unbalanced.

TABLE 3 Comparison of AP Values/False Negatives Between Direct Augmentation and Weighted Augmentation

Second, we used the adjusted pre-defined box to retrain the network on the augmented data with different ratios. The experimental results are shown in Table 4. On the one hand, 1:3 is a more appropriate augmentation ratio and larger ratios lead to performance degradation; thus, this conclusion is consistent with the results shown in Table 4. On the other hand, compared to the original pre-defined box, the adjusted pre-defined box brings an average gain of 1.78 AP value, and the false negatives decrease by of 17.12%.

TABLE 4 Comparison of Different Augmentation Ratio and Pre-Defined Boxes

Based on this optimized configuration, we evaluated the model on VS-1. The final result showed that the target was detected in 4982 frames, and the detection rate reached 91.83%, which is 5.88% higher than that without data augmentation. The result of the output confidence after data augmentation is shown in Fig.10(b). Compared to the model without the balanced data shown in Fig.10(a), the global average confidence increased by 3.6% and the false negative (the highest confidence is below the threshold) rate dropped by 27%.

FIGURE 10.

Comparison of experimental results before and after data balancing and augmentation. (a) Output confidence without data balancing (the orange line represents the average value, and the red line represents the output threshold). (b) Output confidence after data balancing.

Show All

Fig.11 shows the test results of the proposed method for several consecutive frames in VS-1. The red crosses show the center coordinates of the targets, and the green bounding boxes show the output location. These test results display that tiny SUAV targets under large maneuvering conditions are successfully detected.

FIGURE 11.

Qualitative results achieved by our methods on the SUAV dataset.

Show All

Based on those conclusions, the follow-up experiments will adopt the same settings and the augmented training set.

C. Comparison and Evaluation

To evaluate the proposed method, we trained the state-of-the-art detector with the same augmented training set, and tested all these methods on VS-2. The IoU threshold was set to 50% and 70%, respectively. The accuracy and efficiency of those methods were measured using the AP value and the average frames per second (FPS). The experiment was implemented on a workstation equipped with an NVIDIA Geforce TITAN XP and Intel Core i7-6800K.

The experimental results are presented in Table 5, which shows the comparison between the proposed method and other state-of-the-art detectors [21], [22], [24], [25], [32], [35]. AP-50 and AP-70 represent the AP values when the overlap threshold is 50% and 70%, respectively. The detection speed was evaluated using FPS and the average computation time (milliseconds) of each frame. Table 6 shows the AP-50 values of the object detection methods in five manually annotated video clips of different scene backgrounds on the validation set VS-2.

TABLE 5 Comparison of Speed and Accuracy Between Our Method and Other State-of-the-Art Methods

TABLE 6 The AP Values of the Object Detection Methods on the VS-2

With respect to accuracy, the proposed method is close to Faster R-CNN (VGG backend) [21], [36] and is slightly inferior to Retina-Net-101 [35]. However, the detection speed of the proposed method is comparable to the current optimal single-stage detector Yolo v2 [24] and Yolov3 [25], and is faster than the others. The detection accuracy of the Faster R-CNN [21] is not prominent because the definition of the sliding anchor boxes is relatively simple and does not incorporate the statistical characteristics of the training sample. For Faster R-CNN [21] the IoU of the ground truth boxes is so small that it limits the performance. Retina-Net [35] with a ResNet-101 [32] back-end has an excellent detection accuracy, especially when the overlap threshold is 50%, but the detection speed is lower than that with the proposed method. As an excellent single-stage detector, the SSD [22] approach has an advantage in computational efficiency but a disadvantage in detection accuracy for small targets because it uses smaller feature maps to achieve detection. Yolo v3 [25] is currently the best single-stage detector, achieving faster detection speeds. However, the detection accuracy of Yolo v3 is lower than that of our method. Compared with Yolo v3, our method uses a deeper feature extraction network, and in the detection stage, it uses larger-scale feature maps and reasonable multi-scale feature fusion operations. Specifically, different from Yolo v3, low-resolution feature maps are upsampled and normalized to a large size $(32\times 32)$ for a dense paving of pre-defined boxes. The number of preset boxes is $(32\times 32)\times 3\times 5=15360$ (3 scales of feature maps with a resolution of $32\times32$ , 5 pre-defined boxes), 30.68% more than $(13\times 13+26\times 26+52\times 52)\times 3=10647$ (3 scales of feature maps with resolutions of $(13\times 13)$ , $(26\times 26)$ and $(52\times 2)$ , 3 set of anchor boxes) anchor boxes in YOLO v3. As shown in Table.5, although the FPS of our method is slightly lower than YOLO v3, the accuracy is much improved, especially the AP-70 which gains a 13.47% boost.

Fig.12 shows the AP-FPS curves achieved by various algorithms. The horizontal axis represents the average FPS, and the vertical axis represents the value of AP-50 and AP-70. In summary, the proposed method shows satisfactory performance both in accuracy and speed.

FIGURE 12.

AP-FPS achieved by our method and other state-of-the-art methods. (a) AP-50. (b) AP-70.

Show All

SECTION V.

Discussion

Ablative studies were carried out on the collocation of feature maps of different scales and the quantity of pre-defined search boxes. Compared with the original feature maps, the AP value of the model using the feature maps fusion on $Scale_{II}$ and $Scale_{III}$ increased by 3.1% and 9.1%, respectively. When using multiple fused feature maps, the model using feature maps with all three scales and bypass for detection achieved the best results, with the AP value reaching 53.3. For pre-defined sliding boxes, even though the increase in the IoU slows when the number of boxes increases, the execution time still increases greatly. Accordingly, we selected five boxes with the scale of ((14,9), (17,13), (19,9), (24,12), (27,19)) for this stage to balance the detection accuracy and efficiency. Compared with RPN [21], our approach increased IoU by 21.0%, with the number of candidate boxes reduced from nine to five.

The experimental results of the weighted augmentation approach showed that, compared with the traditional direct amplification method, the proposed method made the histogram of the target geometric feature after amplification more smooth and the data distribution more uniform; An average gain of 4.79% was obtained at an amplification ratio of 1:3. The experimental results on VS-1 showed that, compared to the model without the balanced data, the global average confidence increased by 3.6% and the false negative rate dropped by 27%.

To demonstrate the superiority of the proposed method, we trained the state-of-the-art CNN based detectors with the same augmented training set and tested on VS-2. In terms of detection accuracy, our method achieved the highest AP in the two scenes (mountains and clouds) of the test set VS-2, one scene (buildings) ranked second, and the other two (iron tower and woodland) ranked third. The average AP of our method is close to that of the double-stage detectors Faster R-CNN (VGG backend) [21], [36] and Retina-Net-101 [35] but superior to the one-stage detectors Yolo v3 [25], Yolo v2 [24] and SSD [22]. In terms of execution time, our method achieved 48 ms per detection and was much faster than all the double-stage detectors and the single-stage detector SSD [22], reaching the same level as Yolo v3 [25]. In summary, the proposed method achieves a good balance in accuracy and speed whereas reaching an advanced level in both.

There are still some aspects of the proposed method that can be improved. First, false alarms may occur when SUAV targets hover in a complex background, which limits the further increase in performance. Second, the infrared SUAV datasets labeled with locations and classes are labor-intensive and difficult to obtain. Hence, in addition to data amplification, unsupervised learning and transfer learning are required to achieve further performance gains.

SECTION VI.

Conclusion

In this study, we explored a new SUAV surveillance system using an infrared sensor and a deep learning-based real-time detector, which has not been previously exploited. To solve the problem of poor detection accuracy of small infrared SUAV targets, we proposed a multi-scale feature map fusion method via dense lateral connections. To meet the real-time requirements, we adopted a single-stage prediction based on densely paved pre-defined boxes and improved the selection strategy of sliding the pre-defined boxes by combining the target geometric feature in the training phase. Specifically, we explored the impact of data quantity and proportionality in small-sample training of the SUAV target detector and proposed a weighted augmentation method to achieve data balancing. Compared with the traditional amplification method, the proposed method made the data distribution more uniform and improved the robustness of the model.

In summary, this paper devised an infrared SUAV detection system based on deep learning used to protecting high-value objects. We established a complete set of training and test datasets, then studied the design and training of the real-time SUAV detector in detail, which successfully improved the SUAV surveillance system.

References is not available for this document.

Design and Training of Deep CNN-Based Fast Detector in Infrared SUAV Surveillance System

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Method of Multi-Scale Feature Fusion of Deep Residual Networks

A. Design of the Network Structure

B. Multi-Scale Feature Map Fusion Via Dense Lateral Connections

C. Single-Stage Prediction Based on Densely Paved Pre-Defined Boxes

D. Loss Function

Dataset Analysis and Weighted Augmentation Method

A. Dataset

B. Weighted Augmentation Method

Experiments

A. Framework Optimization

1) Ablation Studies on Multi-Scale Feature Map Fusion

2) Parameter Selection of Pre-Defined Boxes

B. Data Balancing and Augmentation

1) Training and Testing of Original Data

2) Weighted Augmentation

C. Comparison and Evaluation

Discussion

Conclusion

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Design and Training of Deep CNN-Based Fast Detector in Infrared SUAV Surveillance System

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Method of Multi-Scale Feature Fusion of Deep Residual Networks

A. Design of the Network Structure

B. Multi-Scale Feature Map Fusion Via Dense Lateral Connections

C. Single-Stage Prediction Based on Densely Paved Pre-Defined Boxes

D. Loss Function

Dataset Analysis and Weighted Augmentation Method

A. Dataset

B. Weighted Augmentation Method

Experiments

A. Framework Optimization

1) Ablation Studies on Multi-Scale Feature Map Fusion

2) Parameter Selection of Pre-Defined Boxes

B. Data Balancing and Augmentation

1) Training and Testing of Original Data

2) Weighted Augmentation

C. Comparison and Evaluation

Discussion

Conclusion

Authors

Figures

References

Citations

Keywords

Metrics

References