Introduction
Small unmanned aerial vehicle (SUAV) has become a research hotspot with their potential to revolutionize commercial industries, the public domain, and the military [1]–[6]. However, because of the portability and maneuverability of SUAVs, many dangerous items (e.g., explosives and firearms) can be loaded on them, posing a serious threat to the public security.
The detection and monitoring of SUAV targets is the basic prerequisite for defense against attacking [2]. Therefore, effective monitoring of the SUAV targets is urgently needed. In particular, there is a huge demand to develop a reliable SUAV early warning and monitoring system for protecting high-value targets. The real-time detection of SUAV targets is the key technology in SUAV surveillance systems [7]–[10].
Compared with radar imaging and visible light imaging, infrared SUAV target detection is a feasible technical path with extra benefits, e.g., strong anti-interference, long detection range, and all-weather functionality [11]–[14] In order to meet the requirements of applications such as fixed-area security monitoring, an infrared SUAV target detection algorithm is required to consider both the processing speed and detection accuracy. However, the trade-off between speed and accuracy is the main problem in these algorithms.
When hovering in a complex background, the infrared characteristics of an SUAV target experience strong noise interference, and the signal-to-noise ratio is reduced, which makes target detection in complex backgrounds extremely difficult. Owing to their powerful feature extraction and learning abilities, deep convolutional neural networks (CNN) [15]–[18].can extract features from complex images and represent them hierarchically. Therefore, CNN-based methods perform better than traditional target-detection algorithms in achieving accurate target detection of the deformation, occlusion, blur, and multi-scale changes in complex backgrounds, as proven in the ImageNet Challenges since 2012 [15]. The existing mature CNN-based target-detection methods can be divided into two groups: double-stage methods with high-precision (R-CNN [19], Fast R-CNN [20], and Faster R-CNN [21]) and single-stage detection methods with high-speed (SSD [22], Yolo [23], Yolo 9000 [24] and Yolo v3 [25]). Although double-stage detection methods [19]–[21] are excellent in detection accuracy (especially the regression accuracy of the bounding box), the speed of the network is limited and the real-time requirements cannot be met due to the two-stage architecture based on the Regions-of-Interest (RoI) extraction and bounding-box refinement. Single-stage detection methods [22]–[25] benefit from the single-step prediction strategy, which makes the network reach a rapid detection speed, but the detection accuracy is slightly inferior. These methods cannot solve the problem of trade-off between processing speed and detection accuracy.
In addition, due to the small size of the feature maps used for detection, there is a common problem with these popular detectors; detection precision of small targets is poor, which limits the overall accuracy. In the existing literature, in some studies of machine vision tasks such as image classification, semantic segmentation and face recognition, some multi-scale methods are used to solve the problem of small targets [26]–[31], e.g., Saxena et al. proposed a fabric that embeds an exponentially large number of architectures which has good performance in image classification on MNIST and CIFAR10 datasets [26]; Huang et al. proposed multi-scale dense networks for resource efficient image classification, to facilitate high quality classification early on, they use a two-dimensional multi-scale network architecture that maintains coarse and fine level features all-throughout the network [27]; The cascade of super-resolution module is an interesting solution to increase target signature resolution, which has been proved effective by Zhang et al. [28] However, as an additional pixel-level algorithm, cascading with the detector greatly reduces the speed of operation due to excessive computation. In terms of target detection, baseline methods Faster R-CNN [21] and Yolo v3 [25] use feature maps with different resolutions to identify objects with various sizes therein, especially the feature pyramid network used in Faster R-CNN, which contains three sets of feature maps with different resolution, on which anchor boxes of different sizes and quantities are paved respectively to detect objects. Compared with other solutions on small targets, we are concerned with the infrared SUAV target detection used in anti-UAV systems. Since we focus on this specific application, the prior information can be used to design the network structure and training method, thus achieving better performance than the baseline methods on specific tasks, which have not been exploited in the literature. Anti-UAV is an emerging field, we try to solve this problem with infrared sensors and deep learning methods. The purpose of this paper is to verify the feasibility of this technology path and to provide practicable model and training method for reference by other researchers.
To achieve a balance between detection accuracy and computational efficiency, we propose a single-stage detection network that combines the densely paved pre-defined box prediction strategy and employ the geometric characteristics of targets to establish the scale of the pre-defined box. Specifically, in the design of the top-level, multi-scale prediction is performed by densely and laterally connecting shallow-layered features with deep-layered features to improve the sensitivity to small-target detection.
On the other hand, the training of the detector is a key factor that significantly affects performance. As data-driven algorithms, CNN-based methods learn features from large-scale datasets automatically. Benefiting from large-scale publicly available datasets such as ImageNet, VOC, and COCO, CNN-based methods achieve impressive performance for visible images. However, for infrared images, there are only a few large-scale public datasets available. Furthermore, in infrared target detection tasks, the location and category of the target need to be manually labeled, which leads to larger workloads and higher costs. Therefore, the training of deep CNN has to be performed with a small dataset when applied to infrared SUAV surveillance systems, which may lead to the reduce of generalization. In this case, we use data augmentation and data balancing to enhance generalization and adopt a weighted augmentation approach. Specifically, the difference in geometric characteristics of the target is used to balance the distribution of training data, which is proven effective by experiments.
The contributions of this paper can be summarized as follows: (1) This paper successfully developed a SUAV surveillance system using an infrared sensor and a deep learning-based real-time detector to protecting high-value objects, which have not been previously exploited. As no training and test datasets are publicly available, we built our own infrared SUAV target dataset as the benchmark. (2) We tried to balance the accuracy and speed in the detection of SUAV, and we proposed a real-time detector on a SUAV target where lateral connections based on multi-scale feature fusion and densely arranged pre-defined boxes are adopted, which respectively improved the detection sensitivity to small targets and the accuracy of location prediction. (3) To solve the problem of poor generalization ability caused by insufficient and unbalanced training samples, this paper explored the impact of data quantity and proportionality on small sample training of the SUAV target detector and proposed a weighted augmentation method to achieve data balancing. The experimental results show that this approach can improve the robustness and average accuracy of the algorithm.
Method of Multi-Scale Feature Fusion of Deep Residual Networks
A. Design of the Network Structure
To achieve a balance between the detection speed and precision, and improve the detection accuracy of small targets, we propose a deep residual network-based single-stage detector with multi-scale feature fusion and sliding pre-defined box searching. The network structure is shown in Fig.1.
The network can be divided into four parts: (1) input layer, (2) feature extraction and fusion module, (3) detection module, (4) output layer (loss layer). For each detection, the operation process is divided into four steps. First, feature extraction is performed on the input image using the residual network ResNet-50 [32] with batch-normalization (BN) [33] and random dropout models [15], which avoid the gradient diffusion and allow the network to go deeper. Second, three sets of feature maps of different sizes are merged into three scales via densely lateral connections. These three different fine-grained fusion features are used to improve the detection sensitivity for small-scale targets. Third, a single-stage multi-scale detection method based on the densely arranged pre-defined box is adopted to achieve fast detection. Fourth, we use an independent flow path with convolutional layers to simultaneously predict the loss label and regress the location offsets. In the training stage, we calculate the loss function with the loss layer and backpropagation to train the model.
B. Multi-Scale Feature Map Fusion Via Dense Lateral Connections
In deep neural networks, shallower layers can extract some local features and contextual information because of the smaller perception domain, whereas deeper layers with a larger perception domain can learn more abstract semantic information. As in Faster-RCNN [21], only semantic features extracted by the last feature layer are used in the detection; thus, targets with larger imaging areas achieve superior test results. However, these deeper features are less sensitive to the size, position, and orientation of the targets, leading to a poor performance in small target detection. Whereas the size of the input image is
Fig.2 shows the distributed heat map of the target bounding box sizes in all training samples. The sizes of the targets are relatively fixed, mainly concentrated in the range of
In order to improve the sensitivity to small targets, we applied a detection algorithm based on multi-scale feature fusion via lateral connections to fully combine the context features and semantic features in the detection stage, which is shown in Fig.3.
In the vertical linkage feature extraction network, we extracted feature map outputs by the last three deepest residual blocks Res-4, Res-5 and Res-6 with pixel resolutions of
These three resolution sizes are used as the scale benchmarks of the merged feature map. Feature map \begin{equation*} Scale_{I}=\text Feature_{1}.\tag{1}\end{equation*}
The \begin{align*} Feature_{1}^{\prime }=&Upsampling_{8 \times 8}^{16 \times 16}\left ({Feature_{1}}\right), \tag{2}\\ Scale_{II}=&\frac { Feature_{1}^{\prime }+\alpha Feature _{2}}{\alpha +1}.\tag{3}\end{align*}
Finally, we performed the up-sampling of \begin{align*} Feature_{1}^{\prime \prime }=&Upsampling_{8 \times 8}^{32 \times 32}(Feature_{1}) \tag{4}\\ Feature_{2}^{\prime }=&Upsampling_{16 \times 16}^{32 \times 32}(Feature_{2}) \tag{5}\\ Scale_{III}=&\frac {Feature_{1}^{\prime \prime }+\beta Feature_{2}^{\prime }+\gamma Feature_{3}}{\beta +\gamma +1}\tag{6}\end{align*}
The values of the weighting coefficients
This operation typically enables high-level semantic features with fine-grained features, reflecting the relationship between high-level semantic features and low-level detail features. The analytical results of ablative studies are presented in Section IV and verify the superiority of this approach in SUAV target detection.
Faster R-CNN [21] and Yolo v3 [22] use feature maps with different resolutions to identify objects with various sizes therein (e.g. feature pyramid network in Faster R-CNN [21]). Different from these baseline detectors, the proposed method is a dedicated detector for infrared SUAV targets, so there is no requirement for large-scale objects detection. However, in the application of infrared SUAV detection, we believe that some small-sized feature maps (e.g.
C. Single-Stage Prediction Based on Densely Paved Pre-Defined Boxes
For classic double-stage methods such as Faster R-CNN [21], although the regional proposal network (RPN) [21] contributes greatly to the improvement of detection accuracy, further improvement of the computational efficiency has been a bottleneck. During the evolution of R-CNN [19]–[21],the performance and operating speed of the RoI extraction is continuously improved, while the speed remains far behind that of single-stage detection methods (e.g., SSD [22] and YOLO [23]–[25]).
Therefore, based on the single-stage prediction, we combined the idea of anchor box prediction with the results of the target statistical analysis and designed a sliding window-based candidate region search structure as shown in Fig.4.
An illustration of the sliding window-based densely paved pre-defined boxes searching.
As opposed to Faster-RCNN [21] and Yolo v3 [22], which scans the anchor boxes at different strides on the image,we predict multiple candidate regions simultaneously on each pixel of every fusion feature map, which means we employed a sliding stride of 16 pixels on the input image with the size of
The dimensions of these N pre-defined boxes correspond to the statistical results of the target geometric features in the training data so that the best match between the predicted box and the ground truth can be achieved (discussed in detail in Section IV-A and Section IV-B). For the prediction boxes generated according to these rules, we used two flow paths with two convolutional layers to simultaneously predict the loss label and regress the location. The top K results with the highest score were selected as outputs. Finally, we obtained the positions and degree of confidence of the target after the non-maximum suppression processing for those three-scale prediction results. In this stage, several suitable pre-defined candidate boxes were used instead of the offset fine-tuning in the first stage of double-stage detectors; i.e., additional intermediate layers for proposed location regression were removed (two convolutional layers in Faster R-CNN [21]). Specifically, as a dedicated detector for SUAV, the priori information of the target (geometric features) is used when making feature map selection and paving pre-defined boxes. This operation replaces the regional proposal in the double-stage detection, experimental results show that, since the pre-defined boxes is densely paved with a higher IoU with groundtruth, our method achieved similar accuracy to the double-stage detector using only one positional regression, but the speed was greatly improved.
D. Loss Function
The loss function consists of two parts: classification loss \begin{equation*} L(x, y, w, h, p)=L_{cls}+L_{loc},\tag{7}\end{equation*}
\begin{equation*} L_{cls}\left ({p_{i} y_{i}}\right)=-\sum _{i} \ln p_{i}^{y_{i}},\tag{8}\end{equation*}
\begin{align*} L_{loc}\left ({t_{i}, t_{i}^{*}}\right)=&\sum _{i} \sum _{s \in \{x, y, w, h\}} L_{1}\left ({{t}_{i}^{s}-{t}_{i}^{s^{*}}}\right), \tag{9}\\ L_{1}({x})=&\begin{cases}{0.5 {x}^{2},} & {|{x}| \leq 1} \\ {|{x}|-0.5,} & {{~\text {else }}},\end{cases}\tag{10}\end{align*}
\begin{align*} t_{\dot {\mathrm {i}}}=&\left ({t_{i}^{x}, t_{i}^{y}, t_{i}^{w}, t_{i}^{h}}\right), \tag{11}\\ t_{i}^{*}=&\left ({t_{i}^{x^{*}}, t_{\dot {t}}^{y^{*}}, t_{i}^{w^{*}}, t_{i}^{h^{*}}}\right).\tag{12}\end{align*}
The specific definitions of \begin{align*} \begin{cases}t_{i}^{x}=\dfrac {\left ({x-x_{a}}\right)}{w_{a}} \\ t_{i}^{y}=\dfrac {\left ({y-y_{a}}\right)}{h_{a}} \\ t_{i}^{w}=\log \left ({\dfrac {w}{w_{a}}}\right) \\ t_{i}^{h}=\log \left ({\dfrac {h}{h_{a}}}\right)^{\prime }, \end{cases} \tag{13}\\ \begin{cases}t_{i}^{x^{*}}=\dfrac {\left ({x^{*}-x_{a}}\right)}{w_{a}} \\ t_{i}^{y^{*}}=\dfrac {\left ({y^{*}-y_{a}}\right)}{h_{a}} \\ t_{i}^{w^{*}}=\log \left ({\dfrac {w^{*}}{w_{a}}}\right) \\ t_{i}^{h^{*}}=\log \left ({\dfrac {h^{*}}{h_{a}}}\right)^{\prime },\end{cases}\tag{14}\end{align*}
Dataset Analysis and Weighted Augmentation Method
A. Dataset
The dataset used in this work contains 3800 artificially annotated long-wave infrared images of the SUAV targets. Each training sample contains 1–4 SUAV targets, and all training sets contain 5138 targets.
This dataset contains various flight scenarios, e.g., hovering attitude, slow cruising, and high-speed maneuvering. The distance from the targets to the infrared detector ranges from 20 m to 800 m, and the probe pitch angle varies over the range of ±45°. The experimental scene includes a variety of backgrounds (e.g., buildings and towers, mountains, clouds, and sky), and different temperature and weather conditions. The temperature ranges between 12°C and 32°C. Some typical images of the dataset and the corresponding positional annotation are shown in Fig.5.
Infrared SUAV target dataset. (a) Long-wave infrared images in the sky, cloud, and mountain background. (b) Long-wave infrared images with different elevation angles. (c) Long-wave infrared images of urban environment under different detection distances and weather conditions.
To optimize the network parameters, we analyze the geometric features of the labeled dataset. The geometric distribution of SUAV targets is shown in Fig.6.
Histogram of SUAV target geometric feature analysis. (a) Aspect ratio histogram of targets. (b) Distribution of long-side W.
The validation set used in this work contains two parts. Validation Set 1 (VS-1) is a surveillance video in five different scenes: buildings, iron tower, woodland, mountains, and clouds and sky. The total length of the surveillance video is 218 s, including 5450 frames, and there is only one target in each frame. Validation Set 2 (VS-2) is a manually annotated image set in the same five different scenes, which contains 1200 labeled frames.
B. Weighted Augmentation Method
The training images containing infrared SUAV targets are difficult to obtain in large quantities; therefore, data augmentation was adopted in this study to enhance the generalization of the detector. As shown in Fig.7, the distribution of the geometric features of the targets in the training set is not uniform. As the data balance of the training set has a great influence on the CNN-based model, a weighted augmentation method is proposed here.
For each target, we calculated the weight \begin{equation*} W_{i}\left ({n_{i}^{w}\!,\!n_{i}^{\frac {w}{h}}}\right)\!=\! \frac {N\!\cdot \!(K+1)}{2}\!\cdot \!\left ({\frac {1}{L_{w} \cdot n_{i}^{w}}\!+\!\frac {1}{L_{\frac {w}{h}}\cdot n_{i}^{\frac {w}{h}}}}\right)\!-\!1,\tag{15}\end{equation*}
Generally,
Experiments
In this section, we present the results of the experiments conducted to demonstrate the effectiveness of the proposed method. First, we conducted the ablative studies to verify the proposed approach and optimize the detection model. Second, we optimized the network using the geometric analysis of the targets, data balancing, and data augmentation. Finally, we analyzed the detection results and compared them with the results of the state-of-the-art methods.
A. Framework Optimization
1) Ablation Studies on Multi-Scale Feature Map Fusion
In order to explore the effect of fusion feature maps of different scales on the detection of SUAV targets, an ablation study was conducted on the collocation of the feature maps of different scales. We adopt the Average Precision (AP) to evaluate the performance of the proposed method. AP is the area under the precision–recall curve, which shows the change in variety with respect to the change in recall and reflects the overall performance. Precision is the ratio of true positives (TP) ratio to the total number of targets detected while recall represents the ratio of TP s to the total number of ground-truth targets. \begin{align*} {Precision}=&\frac {\textit {TP}}{\textit {TP}+\textit {FP}}, \tag{16}\\ {Recall}=&\frac {\textit {TP}}{\textit {TP}+\textit {FN}},\tag{17}\end{align*}
In the experiment, the number of bounding boxes was uniformly set to 5. Experimental results are presented in Table 1.
First, to verify the effectiveness of the dense lateral connection approach, we set the detection of the original feature maps as a control (setting 2 and 3). The AP value of the model that uses fusion feature maps (settings 4 and 5) increased by 1.5 and 4.3, respectively.
Second, settings 1, 4, 5, 6, 7, 8, 9 demonstrate the detection results with different numbers of fusion feature maps. Setting 9, which used all three scale feature maps for detection, has achieved the best results. The combination of large feature maps (
Third, setting 10 verified the effect of the lateral connection bypass on the model. The AP value was slightly improved compared to setting 9 without bypass, which indicates that high-level semantic features have a positive impact on the detection of small targets.
2) Parameter Selection of Pre-Defined Boxes
The matching degree of the sliding pre-defined boxes with the ground truth plays an important role in target detection; thus, we used the geometrical analysis results of the SUAV target to accurately select the size of the anchor boxes. The intersection-over-union (IoU) reflects the matching degree, defined as \begin{equation*} IoU=\frac {s_{i} \cap s_{i}^{*}}{s_{i} \cup s_{i}^{*}},\tag{18}\end{equation*}
\begin{equation*} D(box, center)=\frac {1}{IoU(box,center)}.\tag{19}\end{equation*}
The average IoU values of all ground truths to the closest cluster centers with different settings of
As the number of clustering layers
B. Data Balancing and Augmentation
1) Training and Testing of Original Data
After optimizing the network parameters, we used the original data to train the model, and as the baseline for subsequent comparison experiments.
We processed 16 images for each batch. If the matching degree between the prior boxes and their ground-truth box is larger than 0.7, we marked such prior boxes as positive samples. The total number positive samples did not exceed 64. The negative samples were randomly selected in samples whose IoU is less than 0.2. The sum of the positive and negative samples was 128. We used the stochastic gradient descent approach to optimize the training. The initial learning rate was set to 0.001 and the number of iterations was set to 46,000. After 22,000 iterations, the learning rate achieved a tenfold reduction.
In the testing stage, the detection results were filtered by setting the threshold to 0.6. Then, we performed the non-maximum suppression processing to filter out the results with IoU values larger than 20%. We tested the baseline model on two validation sets. The test results on VS-1 show that 4663 frames are detected correctly (confidence level > 60 %), and the detection rate was 82.95%. The test results on VS-2 show that the AP value reaches 49.8, and the false negative reaches 167 when the output confidence threshold was set to 0.6.
2) Weighted Augmentation
By analyzing the detection results on the baseline, we discovered that SUAVs in the large maneuvering flight are difficult to be detected. There are three reasons for the missing of detections: (1) High-speed maneuver causes a dynamic blur problem, leading to the loss of details. (2) Most missed targets are concentrated in the cool zone of distributed heat maps of the target size, i.e., the training samples corresponding to such flight attitudes are insufficient. (3) In the training samples, there are fewer targets that have the same size as the missing ones. The influence of those fewer samples is reduced during the clustering process, finally leading to unreasonable anchor boxes. To solve the problem of sample imbalance, the weighted augmentation method was performed on the original training data. Fig.8 shows the geometric distribution of SUAV targets after data balancing. Compared with the original data shown in Fig.6, the statistical histogram of the geometric features of the targets in the training set are much smoother. In other words, the quantity of SUAV targets in various flight postures is more uniform in the training set.
Histogram of SUAV target geometric feature after data balancing. (a) Aspect ratio histogram of targets. (b) Distribution of long side.
To achieve a better overlap between the pre-defined box and the contour of the SUAV target under various flight postures, the criteria for setting the pre-defined box sizes were changed from the cluster of the target size quantity in the training sample to the cluster of the target size category. The distribution of the anchor boxes was more discrete after the adjustments, as shown in Fig.9.
Anchor box size distribution (the blue asterisks represent the scale of the targets in the training set, the green asterisks represent the anchor boxes before adjustment, and the red asterisks represent the anchor boxes after adjustment).
To verify the effectiveness of the data balancing methods based on weighted augmentation, to find an appropriate proportion of data augmentation, and to explore the effect of pre-defined boxes on the performance, the ablation studies were conducted on VS-2.
First, we compared the weighted augmentation method with the traditional direct augmentation method. Table 3 shows the AP values and false negatives of the proposed method and the traditional direct augmentation method at different augmentation ratios. Under different augmentation ratios, the effects of the weighted augmentation method are stronger than the traditional one, and the AP value increase by 3.42%. When the output confidence threshold is 0.6, our method reduces the false negatives by 15.76%. Since the traditional direct augmentation method simply doubles the data, there is no significant influence on the statistical distribution of each original training sample, and the amplified samples are still unbalanced.
Second, we used the adjusted pre-defined box to retrain the network on the augmented data with different ratios. The experimental results are shown in Table 4. On the one hand, 1:3 is a more appropriate augmentation ratio and larger ratios lead to performance degradation; thus, this conclusion is consistent with the results shown in Table 4. On the other hand, compared to the original pre-defined box, the adjusted pre-defined box brings an average gain of 1.78 AP value, and the false negatives decrease by of 17.12%.
Based on this optimized configuration, we evaluated the model on VS-1. The final result showed that the target was detected in 4982 frames, and the detection rate reached 91.83%, which is 5.88% higher than that without data augmentation. The result of the output confidence after data augmentation is shown in Fig.10(b). Compared to the model without the balanced data shown in Fig.10(a), the global average confidence increased by 3.6% and the false negative (the highest confidence is below the threshold) rate dropped by 27%.
Comparison of experimental results before and after data balancing and augmentation. (a) Output confidence without data balancing (the orange line represents the average value, and the red line represents the output threshold). (b) Output confidence after data balancing.
Fig.11 shows the test results of the proposed method for several consecutive frames in VS-1. The red crosses show the center coordinates of the targets, and the green bounding boxes show the output location. These test results display that tiny SUAV targets under large maneuvering conditions are successfully detected.
Based on those conclusions, the follow-up experiments will adopt the same settings and the augmented training set.
C. Comparison and Evaluation
To evaluate the proposed method, we trained the state-of-the-art detector with the same augmented training set, and tested all these methods on VS-2. The IoU threshold was set to 50% and 70%, respectively. The accuracy and efficiency of those methods were measured using the AP value and the average frames per second (FPS). The experiment was implemented on a workstation equipped with an NVIDIA Geforce TITAN XP and Intel Core i7-6800K.
The experimental results are presented in Table 5, which shows the comparison between the proposed method and other state-of-the-art detectors [21], [22], [24], [25], [32], [35]. AP-50 and AP-70 represent the AP values when the overlap threshold is 50% and 70%, respectively. The detection speed was evaluated using FPS and the average computation time (milliseconds) of each frame. Table 6 shows the AP-50 values of the object detection methods in five manually annotated video clips of different scene backgrounds on the validation set VS-2.
With respect to accuracy, the proposed method is close to Faster R-CNN (VGG backend) [21], [36] and is slightly inferior to Retina-Net-101 [35]. However, the detection speed of the proposed method is comparable to the current optimal single-stage detector Yolo v2 [24] and Yolov3 [25], and is faster than the others. The detection accuracy of the Faster R-CNN [21] is not prominent because the definition of the sliding anchor boxes is relatively simple and does not incorporate the statistical characteristics of the training sample. For Faster R-CNN [21] the IoU of the ground truth boxes is so small that it limits the performance. Retina-Net [35] with a ResNet-101 [32] back-end has an excellent detection accuracy, especially when the overlap threshold is 50%, but the detection speed is lower than that with the proposed method. As an excellent single-stage detector, the SSD [22] approach has an advantage in computational efficiency but a disadvantage in detection accuracy for small targets because it uses smaller feature maps to achieve detection. Yolo v3 [25] is currently the best single-stage detector, achieving faster detection speeds. However, the detection accuracy of Yolo v3 is lower than that of our method. Compared with Yolo v3, our method uses a deeper feature extraction network, and in the detection stage, it uses larger-scale feature maps and reasonable multi-scale feature fusion operations. Specifically, different from Yolo v3, low-resolution feature maps are upsampled and normalized to a large size
Fig.12 shows the AP-FPS curves achieved by various algorithms. The horizontal axis represents the average FPS, and the vertical axis represents the value of AP-50 and AP-70. In summary, the proposed method shows satisfactory performance both in accuracy and speed.
AP-FPS achieved by our method and other state-of-the-art methods. (a) AP-50. (b) AP-70.
Discussion
Ablative studies were carried out on the collocation of feature maps of different scales and the quantity of pre-defined search boxes. Compared with the original feature maps, the AP value of the model using the feature maps fusion on
The experimental results of the weighted augmentation approach showed that, compared with the traditional direct amplification method, the proposed method made the histogram of the target geometric feature after amplification more smooth and the data distribution more uniform; An average gain of 4.79% was obtained at an amplification ratio of 1:3. The experimental results on VS-1 showed that, compared to the model without the balanced data, the global average confidence increased by 3.6% and the false negative rate dropped by 27%.
To demonstrate the superiority of the proposed method, we trained the state-of-the-art CNN based detectors with the same augmented training set and tested on VS-2. In terms of detection accuracy, our method achieved the highest AP in the two scenes (mountains and clouds) of the test set VS-2, one scene (buildings) ranked second, and the other two (iron tower and woodland) ranked third. The average AP of our method is close to that of the double-stage detectors Faster R-CNN (VGG backend) [21], [36] and Retina-Net-101 [35] but superior to the one-stage detectors Yolo v3 [25], Yolo v2 [24] and SSD [22]. In terms of execution time, our method achieved 48 ms per detection and was much faster than all the double-stage detectors and the single-stage detector SSD [22], reaching the same level as Yolo v3 [25]. In summary, the proposed method achieves a good balance in accuracy and speed whereas reaching an advanced level in both.
There are still some aspects of the proposed method that can be improved. First, false alarms may occur when SUAV targets hover in a complex background, which limits the further increase in performance. Second, the infrared SUAV datasets labeled with locations and classes are labor-intensive and difficult to obtain. Hence, in addition to data amplification, unsupervised learning and transfer learning are required to achieve further performance gains.
Conclusion
In this study, we explored a new SUAV surveillance system using an infrared sensor and a deep learning-based real-time detector, which has not been previously exploited. To solve the problem of poor detection accuracy of small infrared SUAV targets, we proposed a multi-scale feature map fusion method via dense lateral connections. To meet the real-time requirements, we adopted a single-stage prediction based on densely paved pre-defined boxes and improved the selection strategy of sliding the pre-defined boxes by combining the target geometric feature in the training phase. Specifically, we explored the impact of data quantity and proportionality in small-sample training of the SUAV target detector and proposed a weighted augmentation method to achieve data balancing. Compared with the traditional amplification method, the proposed method made the data distribution more uniform and improved the robustness of the model.
In summary, this paper devised an infrared SUAV detection system based on deep learning used to protecting high-value objects. We established a complete set of training and test datasets, then studied the design and training of the real-time SUAV detector in detail, which successfully improved the SUAV surveillance system.