Introduction
Exponential growth of the global population, urbanization, and industrialization have precipitated environmental concerns like plastic pollution, traffic congestion, pedestrian safety hazards, harmful emissions, and infrastructure degradation. Remote sensing technology offers a cost-effective and efficient solution, combining image and digital detection to support sustainable development [1].
Object detection in remote sensing imagery stands as a crucial segment within remote sensing technology. It involves employing specialized algorithms to identify objects of interest within these images, ranging from waste to vehicles, aircraft, and pedestrians. The advancement of wireless networks has facilitated the application of such object detection techniques in environmental monitoring, fostering sustainable environmental development [2].
With the rapid advancement of wireless networks, object detection in remote sensing images has become more effectively applicable to environmental monitoring, thereby promoting sustainable environmental growth. For instance:
Disaster Monitoring: For sudden natural disasters, such as floods [2], snowstorms [3], and forest fires [4], remote sensing image object detection technology enables quicker responses and more accurate assessments of disaster situations.
Natural Resource Monitoring: This technology allows real-time detection of vegetation and other natural environments, facilitating timely responses to soil erosion and land desertification, thus supporting the sustainable use of resources [5].
Environmental Pollution Monitoring: In cases of illegal discharges from sewage or chemical plants, remote sensing image object detection technology can promptly identify and address these violations, contributing to sustainable environmental development [6].
Urban Monitoring: Remote sensing image object detection technology can identify and track vehicles, pedestrians, and other objects in urban areas, improving traffic management, enhancing pedestrian safety, and supporting infrastructure maintenance efforts [7], [8].
Remote sensing images, captured from an aerial perspective, present a unique challenge as objects within them are oriented in various directions. Generic object detection models, including those based on transformers [9], one-stage detectors [10], [11], and two-stage detectors [12], often struggle with angle generalization, rendering them ineffective at processing the angular information of these objects. When these models employ horizontal bounding boxes to identify objects oriented in multiple directions, they tend to capture excessive irrelevant background [13], [14], further complicating the detection process. Research has demonstrated that two-stage detectors generally outperform other approaches on autonomous aerial vehicle (AAV) datasets [15], underscoring the importance of suitable backbones for feature extraction. However, the existing backbones, which are primarily designed for natural images, may not be optimal for AAV images used in environmental monitoring [16], [17]. This misalignment leads to imprecise localization and an increased likelihood of false detections, particularly in dense scenes. To address these challenges, specialized oriented object detection models for remote sensing images have been developed. Nonetheless, many of these methods face significant limitations in effectively detecting objects, especially when dealing with numerous small- and medium-sized targets. The following sections will review some of the most prominent generic object detection methods as well as specific approaches tailored for AAV object detection, aiming to provide a comprehensive understanding of the current landscape and the advancements needed to enhance detection accuracy and efficiency in remote sensing applications.
Background Review
In the early days of object detection on aerial-based images, the lack of image representation led object detection algorithms to be based on handcrafted techniques [18]. Computational improvements, such as AlexNet [19], LSTM [20], Resnets [21], and the help of the new vast datasets, such as ImageNet, PASCAL Visual Object Classes (VOCs), and Microsoft Common Objects in Context (MS-COCO) [22] led to the growth of influential models in the task of object detection, such as transformers [9], [23] single-stage detectors, such as the you-only-look-once (YOLO) and subsequent versions [24], [25], and the two-stage detectors, such as the region convolutional neural network (R-CNN) family of models [12], [15], [26]. YOLOv5 architecture adopts a Cross-Stage Partial connection named CSP-Darknet53, and performs a multitask loss for each prediction layer [25]. Transformer architectures, which are effective at long-range dependencies, can be used to model the relationships between objects in an image and the spatial context in which they appear [27], [28]. Detection transformers (DETR) [23] directly predict object bounding boxes and class probabilities without requiring a separate feature extraction step.
While their performance on datasets with large objects is remarkably high, object detection methods, such as YOLOv5, Mask R-CNN (MRCNN), or DETR, do not perform very well on AAV data [15], [29]. One of the primary reasons for the drop in performance of these methods can be related to the uneven distribution of objects of different sizes, and high altitude can make it difficult to extract features which are used in the latter stages of models. One of the machine learning methods that could be beneficial in this case is multitask learning (MTL) [30]. MTL allows a model to perform multiple tasks simultaneously, sharing information between tasks to improve their performance jointly [31]. The soft parameter sharing method has been used in natural language processing (NLP) [32] which involves a separate network for each task and keeps regularizing the distance between the tasks network parameters. Hard parameter sharing method [33], [34] uses shared hidden layers among all tasks while it keeps a few output layers task specific [35]. HydraNet [33] is a hard parameter sharing architecture designed to reduce cost of inference and incorporates MTL with the help of gate functioning and branching which selects branches trained on a smaller subset of classes to combine important features for the prediction output.
In the case of detecting small objects on AAV datasets, the shared representation using Hard Parameter sharing, and regularization in MTL models are the main two advantages proposed to improve the performance [36]. Learning multiple tasks simultaneously, allows the model to learn more general and transferable features and better handle small objects [37]. This can be particularly beneficial when the tasks are related and share some underlying similarities, as the model can use the shared features to improve its performance on each task [38]. Our main contributions in this article include the following.
Adaptive Feature Extraction: The proposed method significantly enhances small object detection accuracy by integrating Hydranet into MRCNN. Through adaptive branching network (ABN), the model learns to select the most suitable module for each image, focusing on the most relevant features. This results in more precise feature representations and improved detection performance, addressing the limitations of traditional methods that struggle with small objects.
Dynamic Architecture for Adaptive Object Detection: The dynamic architecture introduced enables the model to adaptively switch between feature extraction modules based on each image’s characteristics. This flexibility improves detection accuracy and robustness, allowing the model to handle diverse datasets effectively. By dynamically adjusting its architecture, the model better copes with variations in object size, shape, and appearance.
Scalable Application of Hydranet for Object Detection: This article presents a novel application of Hydranet [33] in object detection, significantly improving accuracy, especially for small objects. The method is scalable, generalizable to other frameworks, and easily integrated into existing pipelines. This success opens new research avenues, leveraging Hydranet’s adaptive feature extraction for challenging detection tasks.
This article is organized as follows. A summary of AAV object detection results and motivation is detailed in Section III. The proposed architecture is described in Section IV. The experimental results, comparisons, and analysis are described in Section VI. Finally, conclusions are presented in Section VII.
AAV Object Detection Comparisons
To study the effect of object scales in the detection performance on AAV datasets, we analyze the performance based on three predefined ranges of object sizes as defined by MS-COCO [22]. Small objects span an area less than 322 square pixels, Medium objects span in-between 322 and 962 square pixels, and Large objects span more than 962 square pixels. The MS-COCO dataset contains 1.5M object instances in 80 categories with an approximately even distribution (33%) of each object size, Large, Medium, and Small. In contrast in the Aerial-Cars dataset, small objects comprise a majority, 80% of the total objects in the dataset [15]. This scattered pattern in AAV datasets makes it difficult for object detection methods that have been trained on MS-COCO or similar datasets. Moreover, the total number of pixels accounted for by small objects constitutes to a fraction <5% of the entire pixel counts, thus enhancing the difficulty of extracting information.
Two-stage object detection frameworks, such as MRCNN, have consistently demonstrated superior performance compared to single-stage detectors, particularly in terms of accuracy and precision. In a two-stage approach, the first stage generates region proposals, which are then refined and classified in the second stage. This separation of tasks allows the model to focus more precisely on the most promising regions, improving the detection of small and difficult-to-classify objects. The iterative refinement of object proposals in two-stage detectors results in fewer false positives and more accurate bounding box predictions, making them the preferred choice in applications where accuracy is critical [39].
MRCNN, in particular, excels at detecting and segmenting small objects within images. Its ability to refine object proposals through its two-stage process is especially advantageous for small object detection, as it allows for more precise localization of these objects by generating accurate bounding boxes. Additionally, the incorporation of a segmentation mask further enhances MRCNN’s capability to distinguish small objects from their backgrounds, solidifying its effectiveness in scenarios where small object detection is crucial [12]. For these reasons, we have adapted this object detection architecture with improvements we’ve made to better detect small objects in the AAV-based dataset and address its limitations in detecting such objects.
Hydra-Mask-RCNN
In this section, we describe the proposed Hydra-MRCNN (HMRCNN) architecture. Following the conventional MRCNN [12], we shall employ the fully pyramid network (FPN) as the stem of the HydraNet, and propose an ABN module that incorporates the branch, gate, and combine units of the HydraNet [33], shown in Fig. 1.
HydraNet template [33], which consists of five parts: a Stem trained for all inputs and Branches trained on the part of all classes to be specialized for different inputs. On the right, the Gate that chooses which branches’ output should be considered at inference. The Combiner output aggregates the features through a linear projection and weighted summation before the Prediction output.
The resulting HMRCNN minimizes the computational resources required and enhances the object detection capabilities of this model on AAV images at all scales by enriching feature maps. Fig. 2(a) shows the complete architecture of the HMRCNN.
(a) HMRCNN architecture: the ABN is applied on every output feature map of the FPN. Afterward, each feature map will be sent to RPN one by one. The RPN then performs prediction of the bounding box and class of object. In the second stage, ROI proposals and corresponding enriched features are applied to predict the class and bounding boxes. (b) Block diagram of ABN, where branches are selected and combined to form enriched
A. Fully Pyramid Network
MRCNN adopts an FPN [40], [41] with ResNet [21] in its backbone network architecture, which enables the model to detect objects of different sizes from a single image. Upon transmission of an image to the network, it undergoes a convolutional process that results in its conversion into multichannel feature maps \begin{equation*} C_{i}=F\left ({{x_{i},W_{i}}}\right)+x_{i} \tag {1}\end{equation*}
\begin{align*} P_{i-1}=& {\mathrm { conv}}\left ({{W_{i-1},C_{i-1}}}\right)+{\mathrm { upsample}}\left ({{2,P_{i} }}\right) \\& \qquad i=2,3,4,5. \tag {2}\end{align*}
B. Adaptive Branching Network
We propose an ABN, which incorporates a novel branch selection method depicted in Fig. 2(b). The ABN takes in the FPN outputs and produces an enriched feature pyramid map with the same number of features (channels/dimensions) as FPN. The ABN includes branching, gate functioning, and combiner modules based on HydraNet to perform the task of selecting relevant branches for a given input. The ABN architecture is applied to the set of four feature maps \begin{align*} BN(x)=& \frac {x-E[x]}{\sqrt {{\mathrm { var}}(x)}} \tag {3}\\ CB(x)=& {\mathrm { ReLU}}(BN({\mathrm { conv}}(x)) \tag {4}\\ b(x)=& {\mathrm { ReLU}}(x+CB(CB(CB(x))) \tag {5}\end{align*}
1) Gating Mechanism
Our proposed gate module comprises two key components: 1) bottleneck block and 2) pooling. The bottleneck block that filters information passing through the module using convolutional layers in a manner similar to (5) (as gate and branches do not necessarily have identical architecture), followed by a global average pooling layer (GAPL). GAPL performs a global pooling operation on the filtered information, further reducing the information to a compact representation that can be used to make the branch selection\begin{equation*} {\mathrm { GAPL}}\left ({{X_{c}(w,h)}}\right)=\frac {\sum _{i=0}^{w-1} \sum _{j=1}^{h-1}X_{c}(i,j)}{\left ({{w\times h}}\right)}. \tag {6}\end{equation*}
Simply put, in (6), the GAPL takes an intermediate feature map
2) Combiner
The combiner module aggregates all the selected feature maps based on the gating threshold, as described in (7), to create a more robust and informative feature map. Denote the output of the branch b as \begin{equation*} P^{\prime }(X)= \sum _{b \in g(X)\gt TH} B(b). \tag {7}\end{equation*}
In summary, the HMRCNN method integrates an ABN to dynamically select relevant branches during feature extraction based on the input image’s characteristics. This optimizes performance for different object sizes and types. The global adaptive pooling layer generates a compact representation to inform branch activation, ensuring the selection of the most relevant features. By applying this adaptive feature enrichment to the outputs of the FPN, detection accuracy is significantly improved, particularly for small- and medium-sized objects. This process allows the model to adjust dynamically during inference, maintaining high detection accuracy without a notable increase in computational overhead.
Prediction
The subsequent stages use the same architecture as the original MRCNN [12]. The feature map is sent to the RPN to obtain region proposals. The corresponding proposal in original pixel space is mapped to the EFM through the region-of-interest alignment (ROI-Align). The ROI is then sent to the prediction heads: for bounding box regression with loss \begin{align*}& L_{\mathrm { cls}} \left ({{ p_{i} }}\right) = - \log \left ({{p_{i}}}\right) \tag {8}\\& L_{\mathrm { bbox}} \left ({{t_{i},t_{i}^{*} }}\right)= \sum _{i\in \{x,y,w,h\}}\text {smooth} \left ({{t_{i}- t_{i}^{*} }}\right) \tag {9}\\& -\frac {1}{m^{2}} \sum _{1\leq i, 1 \leq j} \left [{{ y_{ij} \log \left ({{\hat {y}_{ij}^{k}}}\right) + \left ({{1-y_{ij}}}\right) \log \left ({{ 1- \hat {y}_{ij}^{k}}}\right) }}\right ] \tag {10}\\& L_{t}= L_{\mathrm { cls}} + L_{\mathrm { bbox}} + L_{\mathrm { mask}}. \tag {11}\end{align*}
To clearly illustrate the procedure of the proposed method, we provide Algorithm 1. The HMRCNN algorithm enhances object detection accuracy, particularly for small- and medium-sized objects in AAV imagery, through a multistage process. It starts by inputting images and initializing the ResNet50 backbone, FPN, and ABN. The feature extraction phase generates multiscale feature maps from input images, which are then processed by the ABN to produce EFMs. The RPN identifies potential object locations, which are refined through ROI alignment and detection heads for bounding box regression, object classification, and optional mask prediction. Loss calculations include classification, bounding box regression, and optional mask loss, leading to the final output of predicted bounding boxes, class labels, and masks. This structured approach leverages adaptive branching and multitask learning, improving detection accuracy and efficiency for AAV-based environmental monitoring and surveillance.
HMRCNN
Input: Aerial images from AAV sensors
Initialize:
Load the pre-trained ResNet50 backbone
Initialize the Fully Pyramid Network (FPN)
Initialize the Adaptive Branching Network (ABN)
Feature Extraction:
for each input image I do
Extract feature map
Generate multi-scale feature maps
end for
Adaptive Branching:
for each feature map
Apply ABN:
Use GAPL to reduce dimensions
Select relevant branches
Extract Enriched Feature Maps (EFM)
end for
Region Proposal:
Apply RPN to EFM
Generate region proposals for potential object locations
ROI Alignment and Detection:
for each region proposal do
Map proposals to EFMs using ROI-A
Use detection heads for:
Bounding box regression
Object classification
(Optional) Mask prediction for segmentation
end for
Loss Calculation:
Calculate total loss
Classification loss
Bounding box regression loss
(Optional) Mask loss
Output:
Predicted bounding boxes, class labels, and (optional) masks for detected objects
Experimental Results and Analysis
A. Datasets and Evaluation Metrics
To evaluate the proposed method, We use MS-COCO 2017 [22] dataset, Aerial-Cars dataset [29], VisDrone dataset [42], and Plastic in River. The average precision (AP) and average recall (AR) are used to measure true positives and minimize false negatives, respectively. The AR results for 1000 proposals output are demonstrated for both MS-COCO dataset and Aerial-Car’s dataset. Lastly, the AP metric also HMRCNN and other methods are demonstrated.
B. Model Architecture Details
The proposed ABN architecture attempts to keep the added computations to a minimum. Our empirical results, as obtained through experiments on the ResNet backbone, have shown that utilizing
C. Training Strategy
The models in this section are trained for on MS-COCO 2017 to follow the outlines presented for Detectron2 [43]; the learning rate (LR) is set to 0.0015 with batch size of 4. AR is calculated for the RPN part of the model is the main metric used in this work, in both cases of 1000 and 100 proposed regions per image, hence
The branching method is applied to each feature map. The model is trained on the MS-COCO train2017 dataset and fine-tuned on the AAV dataset with the same backbone architecture. To achieve this, we fine-tuned the model using 37 epochs and
In the following, the analysis with HMRCNN on four different datasets is demonstrated.
1) Results on MS-COCO Dataset
Table 3 displays the AR results of HMRCNN when performing on all the scales of the feature maps. The proposed method improves the results on the MS-COCO dataset at almost every scale except for a slight difference, less than a percent, for the
2) Results on Aerial-Cars Dataset
Table 4 includes the proposed and base model results with the AR metric on the Aerial-Cars dataset. HMRCNN improves the results on the Aerial-Cars dataset at every scale consistently. This enhancement is at its most for the small objects in which our method shows 12.27% higher results. In this dataset, the results on small-sized and medium-sized objects were our primary challenge, and we improved them by 12.27% and 5.6%, respectively.
Based on Table 5, we improve the performance of the two-staged detector MRCNN with ResNet50 backbone while only trained on the MS-COCO dataset with a task-specific branching architecture. The AP metric illustrates that our method has successfully improved this method by more than 6% in precision.
Sample detection results to demonstrate the approach are found in Fig. 3 for MRCNN and Fig. 4 for HMRCNN. As it is shown in Fig. 4, HMRCNN is displaying the best results with higher confidence scores as predicted. The final AP results of the MRCNN network depend directly on the inference of the RPN network, based on the results, there is a correlation between the proposal boxes with higher AR that would lead to better AP. Thus, we focus on improving the given feature map at each level to enhance the AR results at each scale.
MRCNN results on a AAV image. As it is demonstrated, there are many misdetections, and the confidence scores representing probability of cars are low.
HMRCNN results on a AAV image. There are no misdetections and the confidence scores representing the probability of car have improved.
3) Results on VisDrone Dataset
The VisDrone dataset, developed by the AISKYEYE team at Tianjin University, is a comprehensive resource for drone-based visual analysis. It includes 288 video clips and 10209 static images from 14 cities in China, featuring diverse environments and object densities. Annotated with over 2.6 million bounding boxes, it details scene visibility and object occlusion. The dataset supports tasks like object detection, tracking, and crowd counting, and is structured into five subsets for specific applications.
To show the efficiency of the model for detecting the small objects, the VisDrone dataset is employed. Based on [42], we compare the proposed HMRCNN results with four different methods which are introduced recently, namely, SABL [44], Cascade ADPN [45], YOLOx-s [46], and CFANet-s [47] on Table 6. Table 6 demonstrates the superior performance of the proposed HMRCNN method across various object sizes in the VisDrone dataset. HMRCNN achieves the highest AP for small objects (34.5%) and medium objects (48.2%), significantly outperforming other methods like MRCNN and Cascade ADPN. While SABL shows strong results for large objects (61.7%), HMRCNN remains competitive with an AP of (57.8%), highlighting its robustness and effectiveness across all object scales. Visualizations in Fig. 5 further illustrate the model’s ability to accurately detect even the smallest objects, showcasing its precision in object detection tasks.
4) Results on Plastic in River Dataset
The “Plastic in River” dataset, developed for the Kili Community Challenge1 and hosted on Hugging Face,2 provides annotated images of rivers, highlighting plastic waste items, such as plastic bags, plastic bottles, other plastic waste, and nonplastic waste. Each image includes bounding boxes to mark these items, facilitating the training of object detection models. This dataset is crucial for environmental monitoring and helps enhance the accuracy of detecting and classifying plastic pollution in aquatic environments. This dataset comprises over 3000 annotated images, capturing various instances of plastic litter across different river scenes. For this part of evaluation, the results for the test set are provided in Table 7. Also to compare our results, the performance of MRCNN, YOLOv5, has been measured based on Detectron2 library. HMRCNN outperforms YOLOv5 and MRCNN overall, especially in detecting small and medium objects. YOLOv5 shows strength at higher IoU thresholds and in large object detection, while MRCNN generally lags behind the other two models.
To better demonstrate the performance of the proposed model, some of the model outputs are visualized in Fig. 6. As can be seen, the model is capable of detecting all objects with ease, although there are some errors in the classification part.
D. Analysis Using Partial Pyramid Features
We perform an ablation study to analyze the effect of using different scales of features in the HMRCNN. Instead of utilizing all pyramid scales from FPN represented in the feature maps
The performance on large objects with
For medium objects size, the combinations of
Lastly, small objects, which are relatively rare in the MS-COCO dataset,
We apply the FPN and the ABN partially on the feature maps and evaluate the RPNs performance based on the presented feature maps. This way, we can observe how each feature map is enriched and how well it performs in the next stage.
In Table 9, the first two feature maps presented to the RPN network are
The results above focus on large objects; these feature maps are so downsized, and only large objects are readable on them by the RPN. Hence, the critical metric for this combination is
We will continue the analysis on the Aerial-Cars dataset which includes ten times more objects in each input image compared to the MS-COCO dataset. We apply the FPN and the ABN partially on the feature maps and evaluate the RPNs performance with AR metric. This way, we can observe how each feature map is enriched and how well it performs in the next stage by the AR metric. The number of large objects in this dataset are somewhat negligible compared to the medium, and small ones. Therefore, following the previous results on the Aerial-Cars dataset, we compare the results of the enriched pairs of features to the base model together on Table 10. Each of the mentioned feature map combinations are sent to the RPN.
The AR metric for EFMs is better than the base model at every scale. As we elaborated earlier, the important metric here is
Additionally, Fig. 7 visualizes the feature maps for
Plotted feature maps (
Plotted feature maps (
Plotted feature maps (
E. Feature Enhancement for Different Stages
In the case of MRCNN, one challenge is where to apply this module to achieve better performance. By adding the MTL module to the second stage of the model, the information flow becomes bottlenecked and limited to the later stages of the model, compared to earlier stage. The model also becomes more complex and more parameters. Second stage is designed to perform specific tasks compared to first stage “object-ness.” On the other hand, adding an MTL module to the backbone allows the model to learn to perform both object detection and instance segmentation tasks simultaneously, using features learned from the shared backbone. This results in improved precision because the shared features learned from the MTL module are more informative and useful for both tasks. The following results on Table 11 show the MTL applied on different stages of MRCNN which justifies the use of first-stage MTL. For easier comparison we kept the backbone small while only sending out two feature maps at all cases. For the second-stage investigation, we get about 10%–15% reduced performance compared to the first-stage HMRCNN architecture. These results justify that enhancing feature maps of the first stage provide more value in detecting object proposals for further processing.
Conclusion
In this article, we proposed the HMRCNN model, a novel MTL module based on the low-cost dynamic multitask architecture HydraNet, designed to enhance object detection, particularly for small- and medium-sized objects in AAV imagery. Our model introduces an adaptive process to the multilabel classification step of the backbone architecture, leading to improvements in detection precision, as evidenced by our experimental evaluation on datasets like MS-COCO, Aerial-Cars, VisDrone, and Plastic in River, where HMRCNN outperforms traditional methods, including MRCNN. While the HMRCNN model demonstrates superior accuracy and efficient detection capabilities, particularly with its enhanced feature maps, the complexity of its architecture—especially with the inclusion of the ABN—results in considerable computational overhead, leading to longer training times. Additionally, the model’s design presents challenges in interpretability. To address these issues, future work should focus on integrating explainable AI (XAI) techniques, such as attention and saliency maps, to improve transparency, and on optimizing the architecture through model pruning, quantization, and the adoption of lightweight backbones to reduce computational demands, thereby further enhancing the model’s applicability in various AAV and aerial imagery applications.