Journals & Magazines >IEEE Access >Volume: 12

NFE-YOLO: A Lightweight and Efficient Detection Network for Low, Slow, and Small Drones

Framework for NFE-YOLO: The new high-resolution feature maps and their associated layers are integrated into the neck section. The output from these feature maps is utili...

Abstract:

With the rapid advancement of drone technology, there has been a significant increase in the demand for detecting “Low, Slow, and Small” (LSS) drones. However, existing d...Show More

Metadata

Abstract:

With the rapid advancement of drone technology, there has been a significant increase in the demand for detecting “Low, Slow, and Small” (LSS) drones. However, existing deep learning networks often overlook critical details of such small targets during feature extraction, leading to reduced detection accuracy. Furthermore, edge inference devices typically have limited computational power, which hampers real-time processing capabilities. To address these challenges, we propose a lightweight and efficient drone detection network called NFE-YOLO. This network introduces an efficient positive traffic channel attention module known as EOrthoNet and enhances small target detection capabilities by improving the model’s neck structure. Specifically, the neck incorporates partial Convolution and C3Faster modules to reduce model size while the EOrthoNet channel attention module improves feature extraction accuracy. To further enhance the model’s generalization ability, we present a cross-scenario visible-light “Low, Slow, and Small” drone dataset referred to as UESTC Anti-UAV. This dataset comprises a total of 10,099 images that have all been manually annotated with precision to address the issue of insufficient sample diversity in existing datasets. Experimental results demonstrate a significant improvement in detection accuracy across multiple datasets. NFE-YOLO achieves a mAP50 value of 0.987 on the UESTC Anti-UAV dataset representing a 2.3% improvement over YOLOv8n with a model size of only 3.72MB. The proposed method effectively enhances detection accuracy for LSS drones and is crucial for the development of lightweight drone detection models.

Framework for NFE-YOLO: The new high-resolution feature maps and their associated layers are integrated into the neck section. The output from these feature maps is utili...

Published in: IEEE Access ( Volume: 12)

Page(s): 175458 - 175471

Date of Publication: 05 November 2024

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2024.3492058

Contents

SECTION I.

Introduction

In recent years, the drone industry has undergone rapid growth, with new applications and business models emerging continuously [1]. Currently, drones are widely used across various industries due to their advantages, such as low cost, high flexibility, ease of operation, and compact size [2]. However, the widespread use of drones has also brought about certain issues [3]. Notably, the “low, slow, and small” targets represented by quadrotor drones pose increasing threats to low-altitude safety. The illegal flight of drones in urban environments seriously endangers public safety, with incidents of drones stealing information or causing harm occurring frequently [4]. Due to their excellent stealth characteristics, powerful penetration capabilities, low acquisition cost, and diverse usage methods, small drones present significant challenges for detection and identification [5].

Currently, common drone detection technologies include radar, electro-optical systems, radio frequency (RF) spectrum analysis, and acoustic methods [6]. However, ‘low, slow, and small’ drones present challenges for traditional radar systems due to their weak electromagnetic reflections. Despite the large coverage and effectiveness in adverse weather conditions, radar systems are often cost-prohibitive for some applications and may face issues with target identification accuracy in dense environments [7]. RF spectrum detection identifies drones by detecting anomalous signals but is sensitive to interference and may struggle in complex electromagnetic environments [8]. Acoustic detection analyzes sound signals generated by drones; however, this method requires high accuracy in noise localization and its effectiveness diminishes in noisy environments. In recent years, deep convolutional neural networks (CNNs) based on computer vision have become standard tools for tasks such as classification, segmentation, and object detection [9], [10], [11], [12]. Vision-based detection systems offer advantages such as low cost, clear visibility, and rapid response. As a result of these advantages image detection technology has become one of the most commonly used methods for low-altitude drone detection. Compared to other methods vision-based detection uses high-resolution images to provide detailed target features enhancing identification accuracy [13]. Despite rapid advancements in visual detection technology traditional visual methods still face challenges in extracting features from ‘low’, ‘slow’, and’small’ drones. When drones are at a distance their image size is reduced requiring more semantic information which limits the effectiveness of visual-based detections particularly when dealing with complex backgrounds.

To address these challenges and accommodate resource constraints, such as computational power, we propose an enhanced NFE-YOLO network based on YOLOv8 to improve drone detection performance in complex backgrounds. This network optimizes the neck structure and incorporates attention mechanisms to enhance focus on regions of interest (ROIs) in images, thereby reducing information loss during the target detection process. Additionally, we introduce Partial Convolution (PConv) and C3Faster modules to compress the neck network and achieve high-precision detection performance with lower computational resource requirements. Furthermore, we have compiled a dataset that encompasses various emerging categories of “low, slow, and small” drones within intricate environments.

To enhance the accuracy of object detection, this paper introduces an efficient positive traffic channel attention mechanism (EOrthoNet) and integrates it into the neck structure of the model. This enhancement significantly improves feature extraction effectiveness, enabling NFE-YOLO to more accurately identify important regions within images. Specifically, EOrthoNet maximizes the retention of pixel-level attributes and spatial information in feature maps, allowing the network to focus more on critical features during feature extraction. This approach not only enhances the comprehensiveness and stability of features but also improves their discriminative power, thereby optimizing the accuracy of object detection.
In order to improve the accuracy of recognizing small drone targets, this paper introduces a specialized small target detection head. This enhancement aims to minimize the loss of feature information related to small targets and prioritize the detection of very small targets, thereby significantly enhancing drone detection accuracy. Furthermore, by removing detection heads in the network that are less sensitive to small drone targets, a more lightweight model is achieved without compromising detection accuracy. This optimization not only strengthens the capability to detect small targets but also enhances the overall efficiency and practicality of the model.
To enhance feature extraction efficiency and multi-scale adaptability, this paper replaces the original C2f module in the network’s neck with the C3 module. This modification provides a lighter feature extraction capability and improves capturing rich semantic and contextual information. Specifically, replacing the bottleneck structure in the C3 module with FastBottleneck reduces information loss during long-distance feature transmission. Additionally, using P-Convolution instead of traditional convolution further reduces redundant computations and memory access, achieving more efficient spatial feature extraction.
To advance LSS drone detection technology development, this paper constructs a high-quality visible-light “Low, Slow, and Small” (LSS) drone dataset. The dataset includes numerous novel drone samples covering various complex environments and diverse scenarios involving drones. By systematically increasing samples collected under different backgrounds and conditions, we significantly expand coverage compared to existing datasets. This dataset not only provides rich empirical data for related experiments but also establishes a solid foundation for research and practical application of LSS drone detection technology.

SECTION II.

Related Work

A. Real-Time Object Detectors

Deep learning-based object recognition methods are primarily classified into two-stage algorithms and one-stage algorithms. Two-stage algorithms involve the division of object recognition into candidate region generation, object classification, and bounding box refinement. In contrast, one-stage algorithms treat the entire image as a candidate region and directly obtain the object class and bounding box information through regression. When comparing the two types of algorithms, it is generally observed that one-stage methods are faster, while two-stage methods offer higher accuracy. Common two-stage deep learning algorithms used in drone detection include RCNN [12], Fast RCNN [14], Faster RCNN [15], Mask RCNN [16], SPPNet [17], and Pyramid Networks [18]. On the other hand, one-stage deep learning algorithms include YOLO series algorithms, CenterNet [19], MobileNet [20], RetinaNet [21], DETR [22], and SSD [23]. Some studies have developed custom algorithm models based on current research achievements in convolutional neural networks to address practical needs in drone detection. In tasks involving small object recognition such as drones, shallow neural network algorithms often demonstrate better performance.

The rapid advancement of computer vision technology has led to notable improvements in drone detection algorithms. Hu et al. [24] were among the first to implement the YOLOv3 [25] algorithm for drone detection, enhancing detection accuracy by modifying the number of anchor boxes. Zhai and Zhang [26] further advanced the detection of small targets through the application of multi-scale prediction techniques. Javan et al. [27] enhanced the YOLOv4 network by incorporating additional convolutional layers to facilitate the extraction of more precise features. Delleji et al. [28] introduced instance enhancement strategies and hyperparameter optimization to bolster the performance of YOLOv5 in detecting small drones. Zhu et al. [29] augmented the detection capabilities of YOLOv5 in complex environments by integrating transformer detection heads and the Convolutional Block Attention Module (CBAM). Zhao et al. [30] combined Transformer modules with YOLOv5 to improve the accuracy and robustness of small target detection. Lastly, Ma et al. [31] proposed the LA-YOLO network, which enhances detection accuracy in low-altitude scenarios by incorporating attention mechanisms and normalized Wasserstein distances.

In the context of optimizing models for edge computing devices, Zhang et al. [32] enhanced the anchor boxes of YOLOv3 utilizing the K-means algorithm, thereby achieving real-time detection through the application of pruning techniques. Howard et al. [33] developed a lightweight model based on MobileNet, which effectively balances speed and accuracy. Sun et al. [34] introduced TIBNet, which minimizes computational overhead by employing a compact backbone network. Wang et al. [35] advanced detection accuracy through the implementation of deformable convolution in conjunction with YOLOX. Zhou et al. [36] proposed VDTNet, which enhances both detection performance and speed by incorporating Spatial Pyramid Pooling (SPP) modules and backbone attention mechanisms. Li et al. [37] improved the detection capabilities of YOLOv8s through the integration of Bi-PAN-FPN and GhostblockV2 architectures. Furthermore, sophisticated algorithms such as EfficientDet [38], RetinaNet, and DETR have augmented detection accuracy and flexibility via compound scaling, focal loss, and self-attention mechanisms. Additionally, PP-YOLO [39] and GhostNet [40] have achieved enhancements in detection speed and accuracy through the optimization of network structures. Lastly, NAS-FPN [41] feature pyramid networks effectively consolidate multi-scale features, thereby improving object detection accuracy.

B. Attention Mechanisms

Drawing from the concept of focus in human behavior, visual attention aims to emphasize important parts of an image. Research indicates that selecting these parts is crucial for distinguishing specific aspects of an image. Current studies leverage these attributes to enhance network performance. Highways Networks introduced a simple yet effective gating mechanism that facilitates the flow of information in deep neural networks, which can be considered a form of attention mechanism.

Built on the ResNet backbone [42], SENet [43] introduced channel attention through the squeeze-and-excitation (SE) structure, sparking a wave of research into improving channel attention processes. The SE module enhances network performance by incorporating attention mechanisms on feature channels. This module learns the importance of each feature channel automatically and uses this information to amplify important features while suppressing less relevant ones. The SE module consists of two parts: Squeeze and Excitation. First, the Squeeze operation performs global pooling on each feature map, averaging it into a single scalar value, resulting in a $1\times 1\times$ C feature map, where C is the number of channels. Next, the Excitation operation learns the correlation between the C channels using weights, first reducing dimensionality and then expanding back to C channels. This not only reduces computation but also increases non-linearity within the network. Finally, the output of the Excitation module—representing the importance of each channel—is applied to the original features via multiplicative weighting, thus improving detection accuracy. DANet [44] integrates spatial attention modules with channel attention to effectively model long-range contextual dependencies. NLNet [45] incorporates query-specific global context into the attention module for each query location. Building on NLNet and SENet, GCNet [46] introduces GC blocks to maintain global context awareness while capturing channel interdependencies. In contrast, CBAM [47] replaces global average pooling (GAP) with global max pooling, and GSoPNet [48] introduces global second-order pooling, which, despite higher computational costs, is more effective. ECANet [49] captures cross-channel interactions without unnecessary dimensionality reduction by reshaping the channel attention architecture. FcaNet [50], on the other hand, extracts channel information using residual frequencies based on the relationship between GAP and discrete cosine transform (DCT) initial frequencies. Finally, WaveNet [51] proposes the use of discrete wavelet transform for channel information extraction, while OrthoNet [52] validates the effectiveness of orthogonal filters in contributing to attention mechanisms primarily through the orthogonality of DCT kernels.

C. Drone Datasets

In the field of computer vision, the development of robust models heavily relies on high-quality datasets. Therefore, there has been a consistent emphasis on establishing comprehensive datasets for drone detection. This overview provides an analysis of several existing and relatively complete drone datasets:

The Real World [53], Det-Fly [54], MIDGARD [55], USC-Drone [56], and DUT Anti-UAV [57]. These five datasets contain 56,821, 13,271, 8,775, 18,778, and 10,000 images respectively. The Real World dataset features a wide variety of drones and environments but has relatively low image resolution as all data are obtained from YouTube videos. Additionally, most of the data in this dataset are captured from horizontal and upward perspectives which limits its effectiveness for detecting drones from a downward view. The Det-Fly dataset addresses the limitation of single-view drone data by capturing drones from various perspectives including upward, downward and horizontal views. However it only includes one type of drone which limits its applicability to other drone types. Similarly to Det-Fly dataset, the MIDGARD dataset and USC-Drone dataset also contain only one type of drone with a rich set of environments but share the limitation of having a narrow range shooting perspective. In summary, while each these drone datasets have their own advantages and limitations, there is an urgent need for widely recognized drone dataset that overcomes these drawbacks to advance the development technology in detecting drones.

SECTION III.

Uestc Anti-Uav Benchmark

With the continuous advancement of deep learning technology, researchers are increasingly focusing on algorithmic innovations. However, the performance and effectiveness of algorithms heavily depend on high-quality datasets. Thus, constructing and optimizing datasets play a crucial role in advancing the application and research of deep learning technologies. The construction of drone datasets faces several challenges: (1) Unlike everyday objects, drones are not commonly encountered, making data collection more challenging. Obtaining high-quality drone data requires specialized drones of various shapes and specific filming locations, which increases the difficulty of acquiring drone data. (2) Drone detection has emerged as a distinct research field due to its unique challenges. Unlike other objects, drones exhibit significant scale variations. Effective detection requires data from various scales, necessitating the collection of drone data at different distances. (3) Drone detection can be categorized into ground-to-air and air-to-air detection, with each perspective introducing different levels of detection difficulty. Whether a dataset includes drones in various poses is a critical factor influencing drone detection performance. (4) The diversity of backgrounds is crucial for enhancing the generalization capability of drone detection. Thus, data needs to be collected in various environments to ensure robustness across different scenarios. To support the development of drone detection technology, we propose a new drone detection dataset named UESTC Anti-UAV. This dataset is divided into three subsets: the training set, validation set, and test set.

A. Dataset Splitting

Our UESTC Anti-UAV detection dataset is divided into three subsets: the training set, validation set, and test set. All images in the dataset are manually and precisely annotated. Detailed information about the images and objects is provided in Table 1. Specifically, the dataset contains a total of 10,099 images. Among these, the training set comprises 8,079 images, the validation set includes 1,009 images, and the test set consists of 1,011 images. Considering that each image may contain multiple objects, the dataset includes a total of 10,333 objects, with 8,264 objects in the training set,1029 objects in the validation set and 1040 objects in the test set.

TABLE 1 Detailed Attributes of the UESTC Anti-UAV Dataset

B. Dataset Characteristics

Compared to general object detection datasets such as COCO [58], ILSVRC [59], and PASCAL VOC [60], the most distinctive feature of our proposed drone detection dataset is its relatively large proportion of small objects. Additionally, drones predominantly operate in outdoor environments, resulting in complex backgrounds within our dataset. This complexity increases the difficulty of detecting drones. The characteristics of our proposed dataset are analyzed from the following aspects.

1) Image Resolution

The dataset includes images of various resolutions. The largest image size in the detection dataset is $3840\times 2160$ pixels, and the smallest image size is $1280\times 668$ pixels. There is no substantial difference between these sizes. The variety in image resolutions enables the model to adapt to images of different dimensions and helps prevent overfitting.

2) Objects and Backgrounds

In order to enhance the diversity of drone objects and mitigate the risk of model overfitting, we have curated a selection comprising over 20 distinct types of DJI drones, in addition to small drones from various other brands. as illustrated in Figure 1. The dataset is further enriched with a variety of scene information. Recognizing that drones primarily operate in outdoor environments, our dataset encompasses backgrounds that reflect a wide range of outdoor settings. These include open areas such as fields, rural landscapes, and grasslands; urban environments featuring buildings, construction sites, and city parks; rugged terrains like mountains, cliffs, canyons, and forests; aquatic environments including coastlines, lakes, and rivers; as well as extreme conditions found in deserts, glaciers, and arid regions. Moreover, the dataset incorporates a spectrum of lighting conditions (e.g., bright sunlight, overcast skies, sunsets, and nighttime) and various weather scenarios (e.g., clear days, cloudy conditions, rain, snow, fog, and strong winds). Examples of these diverse conditions are presented in Figure 2. The intricate backgrounds and significant variations in outdoor lighting present in the dataset are essential for the development of robust and high-performance drone detection models.

FIGURE 1.

A comprehensive overview of the different types of drones represented in our dataset.

Show All

FIGURE 2.

Examples of the detection images and their corresponding annotations from our study.

Show All

3) Object Size

Drones are generally characterized by their small dimensions, while the outdoor environments in which they operate are vast. Consequently, our dataset comprises a significant proportion of small objects. We computed the object area ratio for the entire image and subsequently generated a histogram to illustrate the scale distribution, as presented in Table 1 and Figure 3. For the dataset encompassing the training, testing, and validation sets, the average object area ratio is approximately 0.032, with a minimum object area ratio of 4.23e-05 and a maximum ratio reaching up to 0.95 of the entire image. The majority of objects are small, with sizes that account for less than 0.05 of the total image area. In comparison to objects found in conventional detection datasets, these smaller objects present greater challenges for detection and are more susceptible to issues such as missed detections. Additionally, Table 1 and Figure 3 provide insights into the aspect ratios of the objects within our dataset, which range from a minimum of 0.43 to a maximum of 7.23.

FIGURE 3.

Aspect ratio and scale distribution of the UESTC Anti-UAV dataset.

Show All

4) Object Position

Figure 4 illustrates the distribution of object positions in relation to the center of the image through a scatter plot. A significant concentration of objects is observed in the central region of the image. The degree of object movement exhibits variability across different sets, with both horizontal and vertical movements being uniformly distributed. These scatter plots provide insights into the spatial distribution of drone targets within the images. By maintaining this uniform distribution, we ensure that the dataset sufficiently captures a range of potential drone positions that may be encountered in practical applications, thereby improving the model’s ability to generalize.

FIGURE 4.

Position distribution of the UESTC Anti-UAV dataset. Set a serves as the training set, set b functions as the validation set, and set c is designated as the test set.

Show All

C. Dataset Challenges

Through the analysis of the dataset characteristics described in the previous section, several challenges in drone detection have been identified. The main challenges include the small size of objects, complex backgrounds or backgrounds similar to the objects, and significant variations in lighting conditions. Additional issues such as object blurring, rapid motion, camera movement, and objects appearing outside the field of view are also common. Figure 2 illustrates examples from the detection dataset that reflect these challenges.

SECTION IV.

Materials and Methods

A. YOLOv8

The YOLOv8 model architecture is primarily divided into three parts: the backbone network, the feature enhancement network, and the detection head, as shown in Figure 1. The backbone network is responsible for extracting features from the input image. It utilizes techniques such as depthwise separable convolutions and residual connections to enhance computational efficiency and reduce the number of parameters. The feature enhancement network employs the concept of PA-FPN (Path Aggregation Network - Feature Pyramid Network) to integrate feature maps from different stages of the backbone network, thereby improving the feature representation capability. The detection head generates the final detection results, including the location and category of objects. YOLOv8’s detection head utilizes an Anchor-Free mechanism and incorporates multi-scale features for prediction [61].

The YOLOv8 series consists of five variants: YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x, each designed to meet different application needs. While they share the same architecture, differences in width and depth lead to varying performance and speed. Compact models like YOLOv8n are resource-efficient and fast, making them suitable for mobile deployment and high FPS scenarios. On the other hand, larger models such as YOLOv8x offer improved performance at the expense of increased complexity and computational demands. To address high FPS requirements while simplifying deployment, we have enhanced the smallest model YOLOv8n to minimize resource usage while maintaining real-time detection capabilities. In this paper, we have made several improvements to enhance object detection accuracy and reduce model size in our enhanced version of YOLOv8. By refining the neck structure, we reduced the model size while increasing accuracy. The C3Faster module replaced the C2f module in the network neck; P convolutions replaced traditional convolutions; these changes lowered the model’s computational overhead and made it more compact and efficient. Additionally, the integrated EOrthoNet attention mechanism embedded in the network neck enhances the model’s stability, generalization, and robustness. Overall, these improvements enable our enhanced version of YOLOv8 to perform well in small-object drone detection tasks with higher accuracy and efficiency while maintaining good stability and robustness.

B. EOrthoNet Channel Attention Module

The channel attention mechanism is considered a powerful technique for enhancing overall network performance with minimal computational overhead. It was first introduced in Squeeze-and-Excitation Networks (SE-Nets) and can be integrated as an additional module into any existing network architecture. The channel attention block acts as a computational unit designed to extract and emphasize important features within the feature tensor $X \in R^{C} \times R^{H} \times R^{W}$ , where C denotes the number of channels, and H and W represent the height and width, respectively. The channel attention mechanism computes a vector $A \in R^{c}$ to highlight the most relevant channels in X, resulting in an output of $A \odot X$ . The computation formula for this module is:

$\begin{equation*} (A \odot X) _ { c, h, w } = A _ { c } X _ { c, h, w } \tag {1}\end{equation*}$ View Source

Various researchers have proposed different variants of channel attention, each with its own method for calculating the attention vector A. The Discrete Cosine Transform (DCT) utilized in FcaNet [50] possesses a crucial property: orthogonality [52]. In this study, we leverage this orthogonality to enhance the channel attention mechanism. Specifically, we begin by randomly selecting filters

$(C, H, W)$

with appropriate dimensions and then applying the Gram-Schmidt process to ensure their orthogonality. These filters are denoted as

$K \in R ^{ C } \times R ^{ H } \times R ^{ W }$

. Our compression process is outlined as follows:

$\begin{equation*} F _ { E O r t h o } (X) _ { c } = \sum _ { h = 1 } ^ { H } \sum _ { n = 1 } ^ { W } K _ { c, h, w } X _ { c, h, W } \tag {2}\end{equation*}$

View Source

Subsequently, we conduct local information interaction along the channel dimension through a one-dimensional convolution (1D Convolution) operation [49]. The size k of this convolution kernel is typically determined dynamically based on the number of channels C, and it is an odd number.

$\begin{equation*} E = \sigma (C o n v~1~D (F _ { E o r t h o } (X))) \tag {3}\end{equation*}$

View Source

The vector E resulting from the one-dimensional convolution serves as a scaling factor to reweight the original feature map. Similar to other approaches, we define channel attention as follows:

$\begin{equation*} A (X) = E (F _ { E O r t h o } (X)) \tag {4}\end{equation*}$

View Source

Our method comprises two stages, illustrated in Figure 6. In the first stage, we initialize random filters whose size matches that of the feature layer. Subsequently, we employ the Gram-Schmidt process to ensure their orthogonality. In the second stage, these filters are utilized to extract compressed vectors and cross-channel interactions are applied to obtain the attention vector. By multiplying the attention vector with input features, we compute weighted output features and add a residual component.

FIGURE 5.

YOLOv8 Architecture Diagram, Including Backbone, Neck, and Head Units.

Show All

FIGURE 6.

The framework of effective positive communication attention is presented.

Show All

C. Improvement of Neck Structure

YOLOv8 utilizes feature maps at three different scales for result prediction. When the input image size is $640\times 640$ , the dimensions of these feature maps are $80\times 80$ , $40\times 40$ , and $20\times 20$ , respectively. However, in large field-of-view scenarios where drones appear relatively small, the detection capability for small targets is limited. Specifically, detection accuracy declines significantly when the drone target size is smaller than $8\times 8$ . To enhance the performance of YOLOv8 on drone datasets, several modifications were made. Firstly, we omitted the high-resolution feature output from the first convolutional layer as these high-resolution features are generally not necessary for most object detection tasks but are crucial for detecting small targets [18]. We connected the high-resolution feature maps from the B2 layer of the backbone network to the P2 layer, increasing the total number of sampling layers from three to four to mitigate loss of spatial information. In the original YOLOv8 model, different target sizes necessitate feature maps at different scales. To improve detection of small objects like drones, we introduced an additional down-sampling process in the feature pyramid. This modification reduces excessive expansion of receptive field through multi-scale fusion and enriches target information thereby enhancing model’s capability to detect small targets [62]. Although this adjustment did not result in a significant increase in model performance it made model more efficient and practical for real-world applications [63]. The modified neck structure is illustrated in Figure 7.

FIGURE 7.

Neck improvement enhancement program.

Show All

D. C3Faster

Compared to the C2f module, the C3 module offers several notable advantages. Firstly, the C3 module features a more streamlined structure with fewer parameters, making it easier to implement and optimize. This also reduces computational overhead and memory usage in resource-constrained environments. Secondly, the C3 module incorporates residual connections, which not only enhance the training stability of deep networks but also improve feature fusion capabilities, enabling better capture of multi-scale information. Moreover, the C3 module has been extensively validated in YOLOv5 [64], demonstrating outstanding performance and reliability in practical applications. Lastly, since the C3 module has been widely adopted in existing YOLO architectures, it exhibits good compatibility and can be integrated into existing systems with minimal modifications.

To further enhance performance, we replaced the C2f module with C3 and substituted the bottleneck layer in the C3 module with FastBottleneck. Additionally, we introduced partial convolutions [65] in the neck network to replace traditional convolutions. The structures of C3Faster and FastBottleneck are illustrated in Figures 8A and 8B respectively. In the C3Faster module, FastBottleneck is repeated n times, and the outputs are concatenated at the end. The FastBottleneck structure includes a PConv2d layer situated between two $1\times 1$ convolution layers. This design features an inverted residual block where PConv2d is placed between two convolution layers to process and transform feature maps using the $s p l i t \_ c a t$ method to mix and handle channel data. The two $1\times 1$ convolution layers positioned before and after the PConv2d layer adjust the number of channels: The first $1\times 1$ convolution layer maps from c1 to expanded $C_{ }$ while The second $1\times 1$ convolution layer maps from $C_{ }$ back to output channel count c2 A shortcut connection is present between the input and output feature maps, allowing for the direct addition of input features to output features, provided that the input and output channel counts are the same. This design facilitates the reuse of input features, thereby enhancing network training stability and performance.

FIGURE 8.

C3Faster in NFE-YOLO.(A) C3Faster Module.(B) FastBottleneck Bottleneck Module.(C) PConv Module.

Show All

The FastBottleneck architecture, based on Partial Convolutions (PConv), represents an efficient network design. As shown in Figure 2C, partial convolutions offer similar feature map transformations as standard convolutions but with fewer parameters, optimizing computational efficiency and model accuracy. Many research efforts have focused on reducing Floating Point Operations Per Second (FLOPs) to design faster neural networks. However, a reduction in FLOPs does not necessarily correspond to a decrease in latency due to low efficiency in Floating Point Operations Per Second (FLOPS), largely attributed to frequent memory accesses by operators, especially in deep convolutions. Therefore, we introduce a novel local convolution (PConv) that extracts spatial features more efficiently by reducing redundant computations and memory accesses.

E. NFE-YOLO

To enhance drone target detection accuracy, achieve a lightweight model, and reduce deployment costs, we introduce the efficient detection model NFE-YOLO based on YOLOv8n. Figure 5 illustrates the overall structure of the network. Compared to YOLOv8, we have made optimizations including Neck Structure Improvement: connecting high-resolution feature maps from the first C2F layer to the neck via a 4x downsampling output layer focusing on small objects; removal of the 32x downsampling output layer in the neck further optimizing for drone detection and lightweight requirements. To improve the accuracy of drone target detection, achieve a lightweight model, and reduce deployment costs, we propose the efficient detection model NFE-YOLO based on YOLOv8n. The overall structure of the network is illustrated in Figure 5. In comparison to YOLOv8, several optimizations have been made: (1) Neck Structure Improvement: The high-resolution feature maps from the first C2F layer are connected to the neck via a 4x downsampling output layer, focusing the model on small objects. The 32x downsampling output layer in the neck has been removed to further optimize the model for drone detection and lightweight requirements. (2) Replacement of C2f with C3Faster: The C2f module in the neck has been replaced with the C3Faster module, which replaces C2f modules with C3 modules in the neck network. This change enables more efficient feature extraction and comprehensive spatial context processing. The FastBottleneck structure is used to optimize the bottleneck, substituting traditional convolutions with PConvs, enhancing model performance, reducing computational overhead, and making the model more compact and efficient. (3) Integration of EOrthoNet Mechanism: The EOrthoNet channel attention mechanism is integrated into the neck of the network. By incorporating EOrthoNet, the stability, generalization, and robustness of the system are significantly improved, thereby enhancing the accuracy and consistency of drone target recognition. (4) Addition of a Small Object Detection Head: A small object detection head has been added to the detection model to better accommodate the size variations of drones during flight, improving detection accuracy and robustness. This small object detection head enables more precise identification of drone features, significantly increasing detection accuracy while reducing computational requirements and improving model validation performance.

SECTION V.

Experiment

We will begin by providing a detailed description of the experiments, followed by an assessment of the effectiveness of our methods and modular network for detecting “low, slow, and small” drone targets on both public and custom datasets.

A. Dataset and Its Preprocessing

This paper selects the Real World, DUT Anti-UAV, and UESTC Anti-UAV datasets for experimental validation. In computer vision tasks, selecting the appropriate dataset is crucial for developing robust models. Therefore, there is a continuous proposal of datasets for drone detection. The following describes several relatively comprehensive drone datasets:Real World: This dataset consists of 56,821 images with a resolution of $640\times 480$ pixels. DUT Anti-UAV: This dataset comprises 10,000 images - 5,200 in the training set, 2,200 in the test set, and 2,600 in the validation set. It includes 10,109 detection objects, with 5,243 objects in the training set; 2,245 objects in the test set; and 2,621 objects in the validation set. The dataset contains a significant proportion of small objects. UESTC Anti-UAV: This dataset covers most drone models available on the market flying primarily in outdoor environments under various weather and lighting conditions.

For our experiments we used standardized sample sizes of $640\times 640$ to balance real-time processing and accuracy requirements. This size not only allows effective deployment of the model on edge devices but also retains essential image information. Our hardware configuration includes a computer equipped with an AMD Ryzen 5 3400G processor (3.70 GHz),16 GB RAM, and two Nvidia Tesla T4 GPUs with 16 GB VRAM each. We employed PyTorch1.11.0 and Torchvision0.11.O as deep learning frameworks, and used Ultralytics YOLOv8 version8.21 along with its official pretrained weights for training. Table 1 details the main parameter settings used during training which incorporated image translation, scaling, horizontal flipping, and mosaic data augmentation techniques along with their respective probabilities.

All network models undergo varying numbers of iterations during training, depending on the dataset and the convergence of the network model.

B. Comparison of Attention Mechanisms

In order to validate the performance of the efficient orthogonal channel attention module EOrthoNet, we conducted evaluations on both the publicly available Real World dataset and our custom UESTC Anti-UAV dataset. We compared the results with those of SENet, FcaNet, and OrthoNet, presenting detailed experimental results in Table 3. Based on comparative experiments, EOrthoNet demonstrates significant improvements in performance. On the Real World dataset, EOrthoNet achieves up to a 0.9% increase in mAP50 and up to a 1.1% increase in mAP75. On the UESTC Anti-UAV dataset, it yields up to a 0.7% improvement in mAP50 and up to a 2.4% improvement in mAP75. These results indicate a particularly notable enhancement in detection accuracy for small targets, indirectly validating the effectiveness of the EOrthoNet channel attention mechanism in enhancing “low, slow, and small” UAV detection performance.

TABLE 2 Training Parameter Settings

TABLE 3 Results of Drone Detection Tasks Using Different Methods on the Real World and UESTC Anti-UAV Datasets

C. Model Comparison Experiments

In addition to testing on publicly available DUT Anti-UAV and Real World datasets, we also evaluated NFE-YOLO network on UESTC Anti-UAV dataset by comparing it with single-stage and two-stage detection methods as well as other networks proposed by researchers to analyze its performance in detecting “low, slow, small” UAVs.

For this study, YOLOv8s and YOLOv8m were selected due to their expanded model architectures and additional parameters compared to baseline model YOLOv8n, to enhance evaluation. As shown in Table 4, YOLOv8m achieved a performance of 95.9% on the DUT Anti-UAV dataset, surpassing the baseline model YOLOv8n. However, YOLOv8m has more parameters and a relatively lower FPS. In contrast, our proposed NFE-YOLO demonstrates higher accuracy, a smaller model size, and faster detection speed, showcasing its superiority in UAV target detection.

TABLE 4 Comparison of Experimental Results

The results of NFE-YOLO demonstrate its ability to maintain high accuracy even with a smaller model size, indicating a significant advantage in balancing FPS and model precision. These findings are practically important as they confirm the effectiveness of NFE-YOLO in recognizing small UAV targets in complex scenarios with both high precision and real-time FPS performance, thereby enhancing the effectiveness of UAV detection in real-world applications.

D. Ablation Experiment

To validate the proposed improvements, this section presents the experimental process through ablation studies. The evaluation results based on different metrics (accuracy, recall, mAP50, mAP75, GFLOPS, model size, and FPS) are shown in Table 5 and Figure 10. The table outlines the impact of various methods: the baseline model (Y), the design of the neck network (N), the addition of small UAV target detection, the introduction of C3Faster lightweight module and P-convolution (F), and integration of EOrtho attention mechanism (E). The table sequentially defines the baseline model Y, improved models Y+N, Y+N+F, and Y+N+F+E while quantitatively discussing changes in five evaluation metrics across these models.

TABLE 5 Ablation Study Results for Various Metrics

FIGURE 9.

Framework for NFE-YOLO: The new high-resolution feature maps and their associated layers are integrated into the neck section. The output from these feature maps is utilized for a new head dedicated to small object detection.

Show All

FIGURE 10.

Comparison of mAP50, FPS, and Model Size Before and After Structural Improvement.

Show All

In the experimental results, although the baseline model achieves a high frame rate (FPS) of 285.7, its accuracy is relatively low. In contrast, the refined model, while reducing the number of parameters, incurs a slight increase in inference time. The optimized model achieves an FPS of 204.1, meeting the real-time deployment requirements. According to the experimental results, adjusting the model structure has proven to be a key factor in improving drone detection accuracy. These structural adjustments led to a 0.8% increase in mAP50 and a 3.2% improvement in mAP75, highlighting the necessity of finer feature map outputs and a more compact object detection head. Meanwhile, removing the large object detection head reduced the model size by 33.7%, decreasing the overall weight. By replacing the original C2f module and traditional convolution with the C3Faster module and P convolution in the neck network, we reduced GFLOPS by 14.29% to 10.8 while maintaining model accuracy and FPS. Compared to the Y+N model, the optimized model size was reduced to 3.83 MB. These experimental results demonstrate that without sacrificing accuracy, reducing the number of parameters makes the model more compact and lowers computational complexity, thereby enhancing overall performance.

We improved the model accuracy by incorporating EOrthoNet into the Y+N+F model. This resulted in mAP50 and mAP75 reaching [Values]. This demonstrates that integrating an efficient channel attention mechanism (EOrthoNet) into the network can effectively utilize multi-scale information, allowing the model to better handle multi-scale input data and positively impact bounding box regression. These enhancements not only improved the model’s accuracy but also maintained a high FPS.

In applications requiring real-time processing, a decrease in FPS may result in an inability to respond promptly, thereby affecting safety and decision-making capabilities. Furthermore, in resource-constrained devices, a decline in FPS could imply that application requirements are not met, limiting the practical applicability of the model. In user interaction systems, lower FPS may affect fluidity, leading to a subpar user experience, which in turn impacts the system’s acceptance. In multi-object detection tasks, reduced FPS may diminish the capability to detect and track rapidly moving targets, affecting the overall performance of the task. In complex environments, a decrease in FPS could lead to detection failures or reduced accuracy, compromising the reliability of the model in practical applications. Therefore, to explore the trade-offs between FPS and detection accuracy more thoroughly, future research should focus on this issue, particularly in scenarios where a decline in FPS may have severe or negligible impacts on application contexts. Such an analysis will contribute to a more comprehensive understanding of the different aspects of model performance.

SECTION VI.

Conclusion

This study addresses the challenge of detecting “Low, Slow, Small” drones by introducing and evaluating a lightweight and efficient drone detection model, NFE-YOLO. Module comparisons and drone detection were performed using the Real World, DUT Anti-UAV, and UESTC Anti-UAV datasets. In comparison to the baseline YOLOv8 model, our method includes several enhancements:(a) The network’s neck structure was improved by integrating a new detection head specifically for small drone detection. This enhancement increased detection accuracy while removing insensitive detection heads, thus reducing the model’s parameters.(b) The EOrthoNet, an efficient orthogonal attention mechanism variant of ECANet, was introduced into the neck. This channel attention mechanism utilizes orthogonal squeezed filters to generate effective channel attention, maximizing the extraction of relevant feature information.(c) The original C2f modules and traditional convolutions were replaced with C3Faster modules and P-convolutions to reduce computational and model complexity. These improvements aim to enhance the model’s accuracy in detecting “Low, Slow, Small” drones while reducing model complexity. As a result, a highly efficient and reliable lightweight model is proposed that achieves high detection accuracy with a model size of only 3.72 MB. This makes it suitable for embedded devices and mobile terminals in practical applications within this field of research. We have introduced the UESTC Anti-UAV dataset for drone detection, which comprises 10,333 drone objects from 10,099 images. The dataset is divided into training, testing, and validation subsets. In our study, we compared our proposed model with several baseline YOLOv8 models and the latest drone detection models from the DUT Anti-UAV public dataset. Our method outperformed YOLOv8n with a 1.9% increase in mAP50, while also having a smaller model size and GFLOPS. Experimental results demonstrate that our method meets the requirements for high-speed and precise drone detection. In summary, the proposed drone detection model demonstrates excellent real-time performance and accuracy, meeting the demands for rapid drone detection in real-world scenarios. The model maintains high processing speed even with limited computational resources, making it suitable for deployment in embedded systems and applications. The model employed in this study has certain limitations across multiple aspects. Firstly, the current dataset lacks sufficient diversity, which may lead to suboptimal performance of the model in specific environments or scenarios. Secondly, the model’s adaptability when faced with unseen data still requires further validation. Therefore, future work should focus on expanding the dataset to encompass a broader range of application contexts and conditions, thereby enhancing the model’s generalization ability. Additionally, improving the model architecture to increase its efficiency and accuracy represents another important research direction. This may involve exploring new technologies or optimizing existing algorithms to strengthen the model’s performance in complex tasks. Ultimately, these efforts will contribute to enhancing the model’s practicality and reliability, promoting its prospects for real-world applications.

References is not available for this document.

NFE-YOLO: A Lightweight and Efficient Detection Network for Low, Slow, and Small Drones

Abstract:

Metadata

Abstract:

Introduction