Introduction
Jaboticaba, with its sweet and sour, juicy fruits and rich taste resembling mangosteen and pineapple, is high in various vitamins, amino acids, and polyphenols. The fruit skin and leaves can also be used to extract anthocyanins and other phenolic compounds for research and treatment of diseases such as diabetes [1], [2], [3]. Due to its unique growth characteristics, high utilization value, and good planting benefits, it has gradually developed into a new tropical fruit in China. However, despite its significant role as an important agricultural product, the Jaboticaba industry is still in its infancy and has not yet formed a large scale [4]. Traditional manual identification methods have many limitations, such as time consumption, labor intensity, and significant influence by weather and environment. In recent years, UAV (unmanned aerial vehicle) imaging technology has been increasingly applied in agriculture, particularly in crop identification, showing significant advantages. UAV imaging technology provides high-resolution, dynamically updated images, offers operational flexibility, and reduces flight costs. Compared with traditional remote sensing, UAV low-altitude aerial photography is less affected by weather, has flexible operation methods, and provides large-scale, high-precision, and high-overlap images, improving the reliability of post-processing images. These advantages provide new technical means for agricultural production, making crop identification and management more efficient and accurate. For example, UAV imaging technology has shown great potential in crop growth monitoring, pest control, and farmland area statistics.
Drawing on the successful experience of UAV technology in identifying other crops, researchers have begun to focus on methods for identifying Jaboticaba based on UAV images. The unique growth characteristics and industrial development needs of Jaboticaba require more efficient and accurate identification methods. The advantages of UAV imaging technology provide new possibilities for the identification of Jaboticaba, promoting large-scale planting and intelligent management of the Jaboticaba industry.
Orthophoto of Jaboticaba orchard. This orthophoto shows the Jaboticaba orchard located in Nanxiong City, Guangdong province.
Currently, research on identification methods based on UAV images mainly focuses on deep learning models [5], including Let5 [6], Caffenet [7], AlexNet [8], VGG [9], YOLO [10], and R-FCN [11]. For example, Pantazi et al. used artificial neural networks to identify crops with an accuracy of 98% [12].Kumarapu et al. conducted a precise evaluation of bridge deformations using UAV imagery [13]. Singh et al. utilized UAV photography for real-time vehicle monitoring [14].The contributions of Sharma’s UAV image stitching and recognition techniques to feature alignment further demonstrate the feasibility of UAV-based image recognition [15].
YOLO, as a deep learning object detection algorithm, detects both the category and location of objects in a single pass. Compared to traditional methods, it offers shorter computation times, lower costs, and broader applicability. Many scholars use the YOLO algorithm for research in related fields [16]. For example, Zhang et al. proposed a method combining adaptive discriminator-enhanced style generation adversarial networks and improved YOLOv7 for the detection of grape black rot and black spot disease, achieving a lesion detection accuracy of 94.1%, an improvement of 5.7 percentage points over the original algorithm [17]. Wang et al. used MPDIOU as the loss function of YOLOv8n, adding a small target detection layer and SCConv to alleviate the problem of missed detection of camellia fruits, while maintaining detection speed [18]. Li et al. proposed a mulberry trunk identification model based on the improved YOLOv5 model to change the current status of relying on manual assistance for positioning in mulberry leaf picking equipment, improving model accuracy by 16.9% [19].Wu et al. proposed the EUAVDet-n model, which contains only 1.34M parameters and achieves over 20 fps on the Jetson Nano, offering a better balance between detection accuracy and inference speed for edge-based UAV applications [20]. Chen et al. proposed the EL-YOLO model, which demonstrates a 12.4% and 1.3% improvement in mAP50 on the VisDrone2019-DET and AI-TOD datasets, respectively, and outperforms YOLOv8s with a 2.8% and 10.7% increase in mAP50, achieving 24 fps and 35 fps for the Small and Nano models, respectively, making it promising for UAV-based small object detection tasks [21].
Due to the limitations in computational resources and memory space on UAV platforms, the parameter size of deep learning models must be effectively controlled. Traditional object detection models, although performing well in terms of accuracy, have high parameter sizes and complex computational demands that hinder their practical application on UAV platforms. To address the challenge of limited computational capacity, this study proposes an efficient and lightweight object detection model, CRE-YOLO. By incorporating the Cross-Scale Feature Fusion Module (CCFM), RepConv and Depthwise Convolution Block (RepDWBlock), and Efficient Channel Attention (ECA) mechanisms, the CRE-YOLO model significantly reduces parameter size while improving detection accuracy and processing speed. Compared to the traditional YOLOv5, the CRE-YOLO model is specifically designed for UAV platforms, allowing for efficient operation under constrained computational and memory resources.
The experimental results show that the application of CRE-YOLO in Jaboticaba orchards achieves high detection accuracy and speed, enabling fast and accurate identification of Jaboticaba trees. The successful implementation of this model provides technical support for the intelligent management of agricultural UAV platforms, contributing to the scalability and smart management of Jaboticaba cultivation, enhancing agricultural productivity, and promoting precision agriculture.
Related Work
A. Sample Augmentation and Annotation
To ensure the accuracy and richness of the data, this study manually annotated the image data for samples with an occlusion area greater than or equal to 50%, resulting in 549 representative data samples. To enhance the dataset’s diversity and simulate real-world conditions, data augmentation techniques were applied during preprocessing. These included horizontal flips, random rotations, and HSV color enhancements, which were specifically designed to simulate variations in lighting and noise. The use of these augmentations ensured the robustness of the model under different environmental conditions. As a result, the augmented dataset now consists of 1098 Jaboticaba images, providing a comprehensive representation of various real-world scenarios.
Orthophoto identification diagram of Jaboticaba Orchard, showing labeled sections for data annotation and model evaluation.
The model’s ability to handle complex conditions, such as overlapping objects and varying lighting, is significantly enhanced by both the data augmentation techniques and the robust architecture of the network. For overlapping objects, the model benefits from the diverse dataset, which includes samples with occlusions greater than 50%, allowing it to learn how to detect partially visible objects. Additionally, data augmentations such as random rotations and horizontal flips simulate different perspectives, helping the model handle cases where objects may overlap or appear from various angles. Regarding varying lighting conditions, the HSV color enhancements applied during preprocessing allow the model to learn to recognize objects under different lighting intensities, shadows, and color variations. These strategies ensure that the model remains robust to real-world environmental factors, enabling accurate object detection even in challenging conditions.
B. Data Visualization
The distribution of dataset labels is a valuable tool for describing the sample distribution of different categories within the dataset. As shown in Fig. 3, the distribution of labels includes:
The upper left corner shows the distribution of label samples, displaying the sample distribution of different labels in the dataset. Since this study involves single-target recognition, all samples are labeled as Jaboticaba trees.
The upper right corner shows the centroid position, displaying the horizontal and vertical coordinates of the centroids, which helps detect whether there is any bias in the detection dataset.
The lower left corner shows the distribution of object centroid positions, describing the distribution of object centroid positions, highlighting the focus and positional information of the samples.
object size distribution, this section explains the distribution of object sizes. The horizontal axis represents the width of the objects, while the vertical axis represents the height of the objects.
Dataset label diagram, showing annotated labels for the Jaboticaba orchard images used in the study.
This provides important insights into the usability and reliability of the dataset. It reflects that the dataset in this study is balanced in all directions, considering multi-directional recognition. The layout and interpretation of the chart help us better understand the characteristics and distribution of the dataset.
Methods
To enhance the applicability of the YOLOv5 algorithm for aerial detection of Jaboticaba trees, certain optimizations are required. First, lightweight backbone networks are used for feature extraction before deploying the algorithm on the target device. At the same time, more complex features are extracted by combining an improved Neck network. This study integrates the ECA channel attention mechanism to achieve this goal, which effectively extracts features from the input image with fewer parameters, thereby improving inference speed. To address targets of different scales, the original feature pyramid network in YOLOv5 is retained, utilizing multiple feature maps from different layers to detect targets of varying sizes. To enrich feature mapping with semantic information, the study employs the CCFM feature fusion structure in the network’s Neck section, replacing the original FPN+PAN feature fusion structure of the YOLOv5 object detection algorithm, as shown in Fig. 4. Additionally, an anchor-free form is adopted instead of the built-in COCO anchors due to the use of a self-made dataset.
Improved YOLOv5-CRE Framework Diagram, with the addition of ECA and RepDWBlock modules to enhance feature extraction and model performance.
A. ECA Module
In recent years, channel attention has garnered widespread attention when introduced into convolutional blocks, showing great potential for performance improvement. One representative method is SENet [22], which can learn the channel attention of each convolutional block, bringing significant performance gains to various deep CNN architectures. The ECANet [23]method also draws inspiration from SENet, enhancing model feature representation by dynamically adjusting the significance of channels in the CNN model. The channel attention mechanism is designed to enhance the ability of convolutional neural networks (CNNs) to capture relationships between different channels.
In traditional CNNs, the feature maps of each channel are considered independent, and all channels are deemed equally important. However, in real-world images, different channels may exhibit different relationships and importance levels. The ECA attention mechanism solves this problem by adaptively adjusting channel importance, allowing the network to more effectively focus on task-related information. Studies have shown that this approach is very effective in improving the model’s ability to represent key channel features.
In the proposed network, the main purpose of introducing ECA is to enhance the model’s ability to express key channel features. By assigning weights to each channel and implementing information exchange between channels through one-dimensional convolution with kernel size k, ECA allows the network to better capture relationships between different channels, thereby improving the model’s representation ability. By adjusting parameters
In this study’s network, after introducing the ECA module, the channel relationships of the model were better captured and expressed, enhancing feature representation ability. ECA generates the weight of each channel through one-dimensional convolution and a sigmoid function, and these weights are used to weight the channel features, improving model accuracy and inference speed. Furthermore, by adjusting the hyperparameters of ECA, the concentration and reference point of attention distribution can be controlled, further optimizing the model’s attention to different channel features.
By integrating the ECA module, the proposed network reduces the number of parameters while maintaining model accuracy and significantly improving target detection speed. The introduction of the ECA mechanism not only enhances the model’s representation ability but also improves its performance in handling targets of different scales, making the improved model perform better in practical applications.
Specifically, the basic concept behind ECA channel attention involves assigning weights to each channel. Through shared learning parameters and one-dimensional convolution with kernel size k, information exchange between channels is achieved. The convolution results are converted to weight values ranging from 0 to 1 using a sigmoid function, representing attention weights. The following is the ECA formula for generating channel weights through one-dimensional convolution with size K:\begin{equation*} \omega =\sigma (C1D_{K}(y)) \tag {1}\end{equation*}
\begin{equation*} C=\phi (k)=2^{(\gamma *k-b)}. \tag {2}\end{equation*}
Given the channel dimension \begin{equation*} k=\psi (C)=\mid \frac {log_{2}(C)}{\gamma }+\frac {b}{\gamma }\mid _{odd} \tag {3}\end{equation*}
Parameters
B. Feature Fusion Module
Traditional Feature Pyramid Networks (FPN) [24] have certain limitations, including feature information loss and ambiguity. These limitations lead to incomplete information transmission and reduced feature resolution. The complexity and accuracy of traditional FPN structures need improvement.
According to Zhao et al. [25], their proposed Cross-Scale Feature Fusion Module (CCFM) based on neural networks cleverly solves this problem. The authors suggest applying self-attention operations to higher-level features with richer semantic concepts, capturing relationships between conceptual entities in images, which helps subsequent modules detect and recognize targets in images. At the same time, due to the lack of semantic concepts and the risk of redundancy and confusion in interactions with higher-level features, scale interactions within lower-level features are unnecessary.
CCFM uses a structure similar to PANet, inserting Fusion Blocks composed of convolution layers into the fusion path to merge features of adjacent scales. In this study, the Fusion Block is replaced with the RepDWBlock module, where reparameterization convolution (RepConv) can reduce parameter quantity while ensuring fusion performance [26].
This module achieves efficient, flexible, and robust feature extraction capabilities by combining depthwise separable convolution (DWConv), RepConv modules, residual connections (Shortcut), and structural reparameterization techniques. DWConv significantly reduces computational and parameter quantities, improving computational efficiency. The RepConv module uses a multi-branch structure to extract rich feature information during training and simplifies it to a single convolution layer during inference, enhancing inference speed. Residual connections alleviate gradient vanishing problems, promoting gradient flow, and improving model training efficiency and stability. Furthermore, by stacking multiple RepConv modules, the module depth can be flexibly adjusted to meet different task requirements. Structural reparameterization techniques separate usage during training and inference stages, balancing high performance and efficiency.
Experimentation
A. Experimental Environment
YOLOv5 is a mainstream algorithm in the field of object detection, dynamically adjusting network structure and depth according to different application areas. It is divided into four versions: v5s, v5m, v5l, and v5x. Considering the limitations of UAV platform computing power and memory space, this study chooses the YOLOv5s version with the smallest parameter quantity for comparison experiments.
The experimental platform is Intel(R) Core(TM) i7-14700KF, equipped with NVIDIA GeForce RTX2080Ti 22G. The experiment uses Pytorch 2.2.2 and CUDA version 12.1. The parameters are shown in Table 2.
B. Evaluation Metrics
The evaluation metrics of the experiment are precision (Precision), recall (Recall), mean average precision (mAP), Params, and FPS.\begin{align*}Precision& =\frac {TP}{TP+FP} \tag {4}\\ Recall& =\frac {TP}{TP+FN} \tag {5}\end{align*}
AP refers to the area under the PR curve, which is the average precision across different recall points. The calculation formula is as follows:\begin{equation*} AP = \int _{0}^{1} p(r) \, \mathrm {d}r \tag {6}\end{equation*}
AP: Average Precision, which is the area under the precision-recall curve.
p(r): Precision at recall value r.
The integral from 0 to 1 indicates the process of calculating precision at different recall levels and then summing them to compute the overall average precision.
mAP@0.5 refers to the mAP value when the Intersection over Union (IoU) value is set to 0.5. It is calculated by finding the Average Precision (AP) value of each class and then averaging them to get the mAP. The formula is as follows:\begin{equation*} mAP = \frac {1}{K} \sum _{i=1}^{K} AP_{i} \tag {7}\end{equation*}
mAP: Mean Average Precision, which is the average of the Average Precision (AP) values for all classes.
K: The total number of classes in the dataset.
AP i: The Average Precision for class i, which is calculated by integrating the precision-recall curve for each class.
The formula computes the mean of the AP values for all K classes to obtain the mAP value, which is a common evaluation metric in object detection tasks.
mAP@0.95 refers to the average mAP value when the IoU value ranges from 0.5 to 0.95 with a step size of 0.05.\begin{equation*}\mathrm {FPS}=\frac {1}{t} \tag {8}\end{equation*}
C. Results and Discussion
From Table 1, it can be seen that the CRE-YOLO model proposed in this study significantly reduces the computational complexity compared to the original YOLOv5s model. The number of parameters was reduced by 54%, greatly optimizing memory usage for inference. Compared with the Precision and Recall of YOLOv5s, the proposed model shows a slight improvement in Precision but a decrease in Recall. Overall, the model’s accuracy has improved, although its breadth is not as extensive as the original model. In terms of mAP, the proposed model shows a significant improvement, with mAP@50 reaching 0.971, 0.7% higher than the original model. mAP@95 reached 0.603, 2.5% higher than the original model. In terms of inference speed, The proposed model exceeds YOLOv5s, processing 387 frames per second—96 frames more than the original—resulting in a 24% speed increase.
From Table 3, it can be seen that in terms of FPS, the proposed model is far behind Faster-RCNN. However, considering the high spatial requirements for UAV platforms or other edge devices, the proposed model is 109 times smaller in parameter quantity than Faster-RCNN. Compared with other lower-parameter models, the proposed model is the optimal model. In terms of mAP@50, CRE-YOLO also achieves the highest score, 1% higher than the second place, indicating that the proposed model has a slight advantage in accuracy.
Combining Fig 7, it can be seen that the proposed model significantly reduces missed detections and improves small object detection under the cross-scale feature fusion module. Overall, the proposed model shows advantages in both accuracy and inference speed. The CRE-YOLO algorithm is a high-precision, fast algorithm for single-target tree recognition, suitable for UAV platforms.
D. Application in Jaboticaba Orchards
The model introduced in this paper provides a feasible solution for aerial identification of jaboticaba trees, having successfully identified over 13,000 trees. This capability holds significant implications for precision agriculture, allowing farmers to monitor and manage their orchards more effectively.
For the development of subsequent systems, this study proposes a numbering management system, as shown in Fig. 8. Each identified jaboticaba tree is assigned a unique identification number and entered into a database, providing a reliable solution for precision agriculture. This system can facilitate better resource allocation, improve yield tracking, and enhance overall orchard management efficiency. The ability to quickly and accurately identify trees opens the door to implementing advanced agricultural practices, such as targeted fertilization and pest control, ultimately contributing to sustainable farming practices.
Comparison of results diagram, showing a comparison between the improved CRE-YOLO model and the original YOLOv5 model in terms of key performance metrics.
Conclusion
In response to the time-consuming and labor-intensive issues of manual management in Jaboticaba orchards, this paper presents the following key points:
Advantages of Drone Remote Sensing Technology: The adoption of drone remote sensing technology combined with deep learning methods can efficiently collect high-quality data, enabling the automatic identification and extraction of Jaboticaba trees, thus laying the data and theoretical foundation for the future intelligent management of the orchard.
Innovation of the YOLO-CRE Algorithm: This study generates orthophotos based on high-resolution drone imagery and proposes the YOLO-CRE algorithm to address the challenges of insufficient computing power on current drone platforms and the inadequacy of object recognition accuracy.
Model Optimization and Performance Improvement: Experimental results indicate that the effective integration of the CCFM structure and ECA attention mechanism in YOLO-CRE reduces the model parameter count to 54% of the original structure, significantly alleviating the computing power demands on drone platforms. Additionally, the recognition accuracy (mAP@95) has increased by 2.5%, meeting the requirements for practical applications.
Future Research Directions: This study plans to integrate other edge devices and datasets of different crops to further explore models with strong generalization capabilities, providing feasible solutions for the application of drones in agriculture.
This paper emphasizes the potential and practical significance of drone technology in agricultural management. It not only provides a feasible solution for improving the management efficiency and production effectiveness of Jaboticaba orchards but also offers valuable insights and references for the intelligent transformation of other agricultural sectors. However, real-world implementation of drone-based systems often faces challenges, including varying environmental conditions, occlusions, and lighting issues that can affect the accuracy of data collection and model performance. In this study, we addressed these challenges by employing data augmentation techniques and optimizing the model architecture, such as incorporating the ECA and RepDWBlock modules, to ensure robust performance under different conditions. Additionally, our model’s lightweight design, with significantly fewer parameters compared to traditional models, demonstrates excellent performance on drone platforms where computational resources are limited. This lightweight architecture not only enhances real-time processing capabilities but also enables efficient deployment in field applications. With the continuous advancement of technology and deepening applications, we anticipate that the integration of drone and deep learning solutions will drive the modernization of the entire agricultural industry, achieving higher levels of resource utilization and sustainable development.