Loading web-font TeX/Math/Italic
Hydra-Mask-RCNN: An Adaptive HydraNet Architecture for Autonomous Aerial Vehicle Object Detection | IEEE Journals & Magazine | IEEE Xplore

Hydra-Mask-RCNN: An Adaptive HydraNet Architecture for Autonomous Aerial Vehicle Object Detection


Abstract:

Environmental monitoring is essential for understanding and mitigating the impact of human activities on the planet, as well as for developing effective strategies for su...Show More

Abstract:

Environmental monitoring is essential for understanding and mitigating the impact of human activities on the planet, as well as for developing effective strategies for sustainable development and conservation. Accurate object detection in aerial images is crucial for environmental monitoring and surveillance using autonomous aerial vehicles (AAVs). However, existing methods, including Mask R-CNN (MRCNN) and you-only-look-once (YOLO), struggle to detect small- and medium-sized objects from AAV sensors, limiting their usability for AAV surveillance. We propose Hydra-MRCNN (HMRCNN), a multitask learning network that enhances detection precision for small- and medium-sized objects in aerial images. By integrating an adaptive branching network (ABN) with HydraNet, HMRCNN improves feature extraction and object detection capabilities. Evaluations on Microsoft Common Objects in Context (MS-COCO), Aerial-Cars, VisDrone, and Plastic in River datasets show significant improvements in average recall (AR) compared to baseline models, including MRCNN and YOLO. Our approach has important implications for environmental monitoring, enabling more accurate detection of objects relevant to transportation, security, traffic, pollution, and infrastructure management. With the growing use of AAVs in environmental surveillance, HMRCNN offers a valuable tool for enhancing environmental measurement and assessment capabilities. Our method improves detection performance by over 6% on AAV datasets, making it a valuable contribution to the field as the commercial AAV market is expected to grow from 25 billion to 50 billion in the next decade.
Article Sequence Number: 5000112
Date of Publication: 20 November 2024
Electronic ISSN: 2768-7236

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Exponential growth of the global population, urbanization, and industrialization have precipitated environmental concerns like plastic pollution, traffic congestion, pedestrian safety hazards, harmful emissions, and infrastructure degradation. Remote sensing technology offers a cost-effective and efficient solution, combining image and digital detection to support sustainable development [1].

Object detection in remote sensing imagery stands as a crucial segment within remote sensing technology. It involves employing specialized algorithms to identify objects of interest within these images, ranging from waste to vehicles, aircraft, and pedestrians. The advancement of wireless networks has facilitated the application of such object detection techniques in environmental monitoring, fostering sustainable environmental development [2].

With the rapid advancement of wireless networks, object detection in remote sensing images has become more effectively applicable to environmental monitoring, thereby promoting sustainable environmental growth. For instance:

  1. Disaster Monitoring: For sudden natural disasters, such as floods [2], snowstorms [3], and forest fires [4], remote sensing image object detection technology enables quicker responses and more accurate assessments of disaster situations.

  2. Natural Resource Monitoring: This technology allows real-time detection of vegetation and other natural environments, facilitating timely responses to soil erosion and land desertification, thus supporting the sustainable use of resources [5].

  3. Environmental Pollution Monitoring: In cases of illegal discharges from sewage or chemical plants, remote sensing image object detection technology can promptly identify and address these violations, contributing to sustainable environmental development [6].

  4. Urban Monitoring: Remote sensing image object detection technology can identify and track vehicles, pedestrians, and other objects in urban areas, improving traffic management, enhancing pedestrian safety, and supporting infrastructure maintenance efforts [7], [8].

Remote sensing images, captured from an aerial perspective, present a unique challenge as objects within them are oriented in various directions. Generic object detection models, including those based on transformers [9], one-stage detectors [10], [11], and two-stage detectors [12], often struggle with angle generalization, rendering them ineffective at processing the angular information of these objects. When these models employ horizontal bounding boxes to identify objects oriented in multiple directions, they tend to capture excessive irrelevant background [13], [14], further complicating the detection process. Research has demonstrated that two-stage detectors generally outperform other approaches on autonomous aerial vehicle (AAV) datasets [15], underscoring the importance of suitable backbones for feature extraction. However, the existing backbones, which are primarily designed for natural images, may not be optimal for AAV images used in environmental monitoring [16], [17]. This misalignment leads to imprecise localization and an increased likelihood of false detections, particularly in dense scenes. To address these challenges, specialized oriented object detection models for remote sensing images have been developed. Nonetheless, many of these methods face significant limitations in effectively detecting objects, especially when dealing with numerous small- and medium-sized targets. The following sections will review some of the most prominent generic object detection methods as well as specific approaches tailored for AAV object detection, aiming to provide a comprehensive understanding of the current landscape and the advancements needed to enhance detection accuracy and efficiency in remote sensing applications.

SECTION II.

Background Review

In the early days of object detection on aerial-based images, the lack of image representation led object detection algorithms to be based on handcrafted techniques [18]. Computational improvements, such as AlexNet [19], LSTM [20], Resnets [21], and the help of the new vast datasets, such as ImageNet, PASCAL Visual Object Classes (VOCs), and Microsoft Common Objects in Context (MS-COCO) [22] led to the growth of influential models in the task of object detection, such as transformers [9], [23] single-stage detectors, such as the you-only-look-once (YOLO) and subsequent versions [24], [25], and the two-stage detectors, such as the region convolutional neural network (R-CNN) family of models [12], [15], [26]. YOLOv5 architecture adopts a Cross-Stage Partial connection named CSP-Darknet53, and performs a multitask loss for each prediction layer [25]. Transformer architectures, which are effective at long-range dependencies, can be used to model the relationships between objects in an image and the spatial context in which they appear [27], [28]. Detection transformers (DETR) [23] directly predict object bounding boxes and class probabilities without requiring a separate feature extraction step.

While their performance on datasets with large objects is remarkably high, object detection methods, such as YOLOv5, Mask R-CNN (MRCNN), or DETR, do not perform very well on AAV data [15], [29]. One of the primary reasons for the drop in performance of these methods can be related to the uneven distribution of objects of different sizes, and high altitude can make it difficult to extract features which are used in the latter stages of models. One of the machine learning methods that could be beneficial in this case is multitask learning (MTL) [30]. MTL allows a model to perform multiple tasks simultaneously, sharing information between tasks to improve their performance jointly [31]. The soft parameter sharing method has been used in natural language processing (NLP) [32] which involves a separate network for each task and keeps regularizing the distance between the tasks network parameters. Hard parameter sharing method [33], [34] uses shared hidden layers among all tasks while it keeps a few output layers task specific [35]. HydraNet [33] is a hard parameter sharing architecture designed to reduce cost of inference and incorporates MTL with the help of gate functioning and branching which selects branches trained on a smaller subset of classes to combine important features for the prediction output.

In the case of detecting small objects on AAV datasets, the shared representation using Hard Parameter sharing, and regularization in MTL models are the main two advantages proposed to improve the performance [36]. Learning multiple tasks simultaneously, allows the model to learn more general and transferable features and better handle small objects [37]. This can be particularly beneficial when the tasks are related and share some underlying similarities, as the model can use the shared features to improve its performance on each task [38]. Our main contributions in this article include the following.

  1. Adaptive Feature Extraction: The proposed method significantly enhances small object detection accuracy by integrating Hydranet into MRCNN. Through adaptive branching network (ABN), the model learns to select the most suitable module for each image, focusing on the most relevant features. This results in more precise feature representations and improved detection performance, addressing the limitations of traditional methods that struggle with small objects.

  2. Dynamic Architecture for Adaptive Object Detection: The dynamic architecture introduced enables the model to adaptively switch between feature extraction modules based on each image’s characteristics. This flexibility improves detection accuracy and robustness, allowing the model to handle diverse datasets effectively. By dynamically adjusting its architecture, the model better copes with variations in object size, shape, and appearance.

  3. Scalable Application of Hydranet for Object Detection: This article presents a novel application of Hydranet [33] in object detection, significantly improving accuracy, especially for small objects. The method is scalable, generalizable to other frameworks, and easily integrated into existing pipelines. This success opens new research avenues, leveraging Hydranet’s adaptive feature extraction for challenging detection tasks.

This article is organized as follows. A summary of AAV object detection results and motivation is detailed in Section III. The proposed architecture is described in Section IV. The experimental results, comparisons, and analysis are described in Section VI. Finally, conclusions are presented in Section VII.

SECTION III.

AAV Object Detection Comparisons

To study the effect of object scales in the detection performance on AAV datasets, we analyze the performance based on three predefined ranges of object sizes as defined by MS-COCO [22]. Small objects span an area less than 322 square pixels, Medium objects span in-between 322 and 962 square pixels, and Large objects span more than 962 square pixels. The MS-COCO dataset contains 1.5M object instances in 80 categories with an approximately even distribution (33%) of each object size, Large, Medium, and Small. In contrast in the Aerial-Cars dataset, small objects comprise a majority, 80% of the total objects in the dataset [15]. This scattered pattern in AAV datasets makes it difficult for object detection methods that have been trained on MS-COCO or similar datasets. Moreover, the total number of pixels accounted for by small objects constitutes to a fraction <5% of the entire pixel counts, thus enhancing the difficulty of extracting information.

Two-stage object detection frameworks, such as MRCNN, have consistently demonstrated superior performance compared to single-stage detectors, particularly in terms of accuracy and precision. In a two-stage approach, the first stage generates region proposals, which are then refined and classified in the second stage. This separation of tasks allows the model to focus more precisely on the most promising regions, improving the detection of small and difficult-to-classify objects. The iterative refinement of object proposals in two-stage detectors results in fewer false positives and more accurate bounding box predictions, making them the preferred choice in applications where accuracy is critical [39].

MRCNN, in particular, excels at detecting and segmenting small objects within images. Its ability to refine object proposals through its two-stage process is especially advantageous for small object detection, as it allows for more precise localization of these objects by generating accurate bounding boxes. Additionally, the incorporation of a segmentation mask further enhances MRCNN’s capability to distinguish small objects from their backgrounds, solidifying its effectiveness in scenarios where small object detection is crucial [12]. For these reasons, we have adapted this object detection architecture with improvements we’ve made to better detect small objects in the AAV-based dataset and address its limitations in detecting such objects.

SECTION IV.

Hydra-Mask-RCNN

In this section, we describe the proposed Hydra-MRCNN (HMRCNN) architecture. Following the conventional MRCNN [12], we shall employ the fully pyramid network (FPN) as the stem of the HydraNet, and propose an ABN module that incorporates the branch, gate, and combine units of the HydraNet [33], shown in Fig. 1.

FIGURE 1. - HydraNet template [33], which consists of five parts: a Stem trained for all inputs and Branches trained on the part of all classes to be specialized for different inputs. On the right, the Gate that chooses which branches’ output should be considered at inference. The Combiner output aggregates the features through a linear projection and weighted summation before the Prediction output.
FIGURE 1.

HydraNet template [33], which consists of five parts: a Stem trained for all inputs and Branches trained on the part of all classes to be specialized for different inputs. On the right, the Gate that chooses which branches’ output should be considered at inference. The Combiner output aggregates the features through a linear projection and weighted summation before the Prediction output.

The resulting HMRCNN minimizes the computational resources required and enhances the object detection capabilities of this model on AAV images at all scales by enriching feature maps. Fig. 2(a) shows the complete architecture of the HMRCNN.

FIGURE 2. - (a) HMRCNN architecture: the ABN is applied on every output feature map of the FPN. Afterward, each feature map will be sent to RPN one by one. The RPN then performs prediction of the bounding box and class of object. In the second stage, ROI proposals and corresponding enriched features are applied to predict the class and bounding boxes. (b) Block diagram of ABN, where branches are selected and combined to form enriched 
$P_{i}^{\prime }$
, which we refer to as EFM. (c) Branch block b() used inside the ABN is obtained by stacking convolutional blocks (CBs) and adding a shortcut connection from the input x.
FIGURE 2.

(a) HMRCNN architecture: the ABN is applied on every output feature map of the FPN. Afterward, each feature map will be sent to RPN one by one. The RPN then performs prediction of the bounding box and class of object. In the second stage, ROI proposals and corresponding enriched features are applied to predict the class and bounding boxes. (b) Block diagram of ABN, where branches are selected and combined to form enriched P_{i}^{\prime } , which we refer to as EFM. (c) Branch block b() used inside the ABN is obtained by stacking convolutional blocks (CBs) and adding a shortcut connection from the input x.

A. Fully Pyramid Network

MRCNN adopts an FPN [40], [41] with ResNet [21] in its backbone network architecture, which enables the model to detect objects of different sizes from a single image. Upon transmission of an image to the network, it undergoes a convolutional process that results in its conversion into multichannel feature maps P_{2},P_{3},P_{4},P_{5},P_{6} . This process occurs at the output of the FPN, which assumes the responsibilities of the stem modules. Given an input image X, the ResNet provides convolutional feature maps C_{1},C_{2},C_{3},C_{4},C_{5} given by\begin{equation*} C_{i}=F\left ({{x_{i},W_{i}}}\right)+x_{i} \tag {1}\end{equation*}

View SourceRight-click on figure for MathML and additional features.where F() is an arbitrary nonlinear function composed of residual blocks with weights W_{i} , and inputs to the layer x_{i} . In the conventional ResNet50 backbone architecture, the size of each feature map has strides s=\{2,4,8,16,32\} with respect to the input size of height H and width W, the feature maps are of image size (H/s ,W/s ). The FPN is produced by upsampling by a factor of 2 starting from the coarsest map and adding features to the finer (previous layer) map. By convention with the original MRCNN structure [12], P_{6}=C_{6} is not used in the FPN and only applied as a feature map in the region proposal network (RPN) stage. Disregarding the first convolutional map C_{1} to reduce computation, the pyramid features outputted from FPN are \{P_{2},P_{3},P_{4},P_{5}\} , where P_{i} can be calculated by\begin{align*} P_{i-1}=& {\mathrm { conv}}\left ({{W_{i-1},C_{i-1}}}\right)+{\mathrm { upsample}}\left ({{2,P_{i} }}\right) \\& \qquad i=2,3,4,5. \tag {2}\end{align*}
View SourceRight-click on figure for MathML and additional features.
The map is built from coarse to fine starting with P_{5} , therefore only P_{4} and lower is upsampled, and convolutional feature maps are multiplied by W_{i-1} , a set of 1\times 1\times d convolutional filters. The dimension of feature map is fixed (d=256) to enable per layer addition from previous layers. Finally, P_{6} is made using a larger 3\times 3 convolution of P_{5} which can handle anti-aliasing effects of upsampling. The first two feature maps: \{P_{2},P_{3}\} are responsible for the very small objects with height and weight (H/4, W/4) and (H/8, W/8) , respectively. Likewise, the combination of \{P_{3},P_{4}\} is mostly responsible for small objects, \{P_{4},P_{5}\} are for medium objects and \{P_{5},P_{6}\} is the combination which is responsible for large objects with feature map size (H/16,W/16) and (H/32, W/32) , respectively.

B. Adaptive Branching Network

We propose an ABN, which incorporates a novel branch selection method depicted in Fig. 2(b). The ABN takes in the FPN outputs and produces an enriched feature pyramid map with the same number of features (channels/dimensions) as FPN. The ABN includes branching, gate functioning, and combiner modules based on HydraNet to perform the task of selecting relevant branches for a given input. The ABN architecture is applied to the set of four feature maps \{P_{2},P_{3},P_{4},P_{5}\} outputted by the FPN, which serve as inputs to the m branches. Each P_{i} is of shape H_{c}\times W_{c} \times N_{c} representing maximum height, width, and number of channels of the feature map. These inputs are then processed, and passed to the subsequent stages based on the MRCNN architecture, namely, the RPN and region of interest (ROIAlign) pooling for further processing. Each branch is designed to be task-specific, processing a subgroup of classes from the MS-COCO dataset. To generate a rich feature map, the branches are structured using the same architecture as the bottleneck blocks of ResNet50 detailed in the following equations:\begin{align*} BN(x)=& \frac {x-E[x]}{\sqrt {{\mathrm { var}}(x)}} \tag {3}\\ CB(x)=& {\mathrm { ReLU}}(BN({\mathrm { conv}}(x)) \tag {4}\\ b(x)=& {\mathrm { ReLU}}(x+CB(CB(CB(x))) \tag {5}\end{align*}

View SourceRight-click on figure for MathML and additional features.where b() is the bottleneck block of ResNet50, and BN() is the batch normalization function, shown in Fig. 2(c). In (4), CB() represents a convolutional block with BN applied to the input x, and the ReLU activation function applied to the final output of the convolutional block. After processing through the task-specific branch blocks, the output feature maps are enriched and passed as inputs to the combiner.

1) Gating Mechanism

Our proposed gate module comprises two key components: 1) bottleneck block and 2) pooling. The bottleneck block that filters information passing through the module using convolutional layers in a manner similar to (5) (as gate and branches do not necessarily have identical architecture), followed by a global average pooling layer (GAPL). GAPL performs a global pooling operation on the filtered information, further reducing the information to a compact representation that can be used to make the branch selection\begin{equation*} {\mathrm { GAPL}}\left ({{X_{c}(w,h)}}\right)=\frac {\sum _{i=0}^{w-1} \sum _{j=1}^{h-1}X_{c}(i,j)}{\left ({{w\times h}}\right)}. \tag {6}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

Simply put, in (6), the GAPL takes an intermediate feature map X_{c} with dimensions (W_{c}\times H_{c}\times N_{c}) and to compute an average value per input channel, to produce a (1\times 1\times N_{c}) output. The advantage over using fully connected layers to produce the output is that GAPL uses no learnable parameters. The input to the gate module is the multichannel feature map from branches, and N_{c} is selected as equal to the number of classes in the dataset. The number of branches m is typically selected as divisible by N_{c} , and each branch will learn a subset of N_{c}/m classes.

2) Combiner

The combiner module aggregates all the selected feature maps based on the gating threshold, as described in (7), to create a more robust and informative feature map. Denote the output of the branch b as B(b) , and g(X)\in [{0,1}] is the calculated probability vector for input image X\begin{equation*} P^{\prime }(X)= \sum _{b \in g(X)\gt TH} B(b). \tag {7}\end{equation*}

View SourceRight-click on figure for MathML and additional features.TH is the defined threshold, and the selected branches at inference are the ones which include at least a class that holds a g(X) higher than the given TH. Typically m, the number of branches selected, may be capped in larger architectures [33]. Note that, in training the model with the labels, the gating function selects a few branches and ignores the inactive branches, acting as an adaptive dropout regularization. A separate ABN is used for each of the inputs P_{i} in the FPN to obtain an enriched feature map (EFM) \{P_{2}^{\prime },P_{3}^{\prime },P_{4}^{\prime },P_{5}^{\prime }\} . This EFM is then sent to the RPN module of the MRCNN, further refining the information, and preparing it for the final detection and segmentation tasks.

In summary, the HMRCNN method integrates an ABN to dynamically select relevant branches during feature extraction based on the input image’s characteristics. This optimizes performance for different object sizes and types. The global adaptive pooling layer generates a compact representation to inform branch activation, ensuring the selection of the most relevant features. By applying this adaptive feature enrichment to the outputs of the FPN, detection accuracy is significantly improved, particularly for small- and medium-sized objects. This process allows the model to adjust dynamically during inference, maintaining high detection accuracy without a notable increase in computational overhead.

SECTION V.

Prediction

The subsequent stages use the same architecture as the original MRCNN [12]. The feature map is sent to the RPN to obtain region proposals. The corresponding proposal in original pixel space is mapped to the EFM through the region-of-interest alignment (ROI-Align). The ROI is then sent to the prediction heads: for bounding box regression with loss L_{b}{\mathrm { box}} , classification head L_{c}ls , and a per-pixel mask head with loss L_{m}{\mathrm { ask}} . The loss function employed in this work is shown in the following equations:\begin{align*}& L_{\mathrm { cls}} \left ({{ p_{i} }}\right) = - \log \left ({{p_{i}}}\right) \tag {8}\\& L_{\mathrm { bbox}} \left ({{t_{i},t_{i}^{*} }}\right)= \sum _{i\in \{x,y,w,h\}}\text {smooth} \left ({{t_{i}- t_{i}^{*} }}\right) \tag {9}\\& -\frac {1}{m^{2}} \sum _{1\leq i, 1 \leq j} \left [{{ y_{ij} \log \left ({{\hat {y}_{ij}^{k}}}\right) + \left ({{1-y_{ij}}}\right) \log \left ({{ 1- \hat {y}_{ij}^{k}}}\right) }}\right ] \tag {10}\\& L_{t}= L_{\mathrm { cls}} + L_{\mathrm { bbox}} + L_{\mathrm { mask}}. \tag {11}\end{align*}

View SourceRight-click on figure for MathML and additional features.L_{\mathrm { cls}} is a cross-entropy loss, L_{\mathrm { bbox}} is the bounding box loss, and L_{\mathrm { maskis}} an average cross-entropy loss. And L_{t} is the total loss calculated at the end of each training epoch iteration. The mask head may be dropped if mask labels are unavailable. The displayed architecture in Fig. 2(a), with ABN applied on every scale of the output feature maps P_{i} from FPN, is the proposed HMRCNN. In the following, we analyze this network’s results on the MS-COCO and Aerial-Cars datasets for both MRCNN and the proposed HMRCNN.

To clearly illustrate the procedure of the proposed method, we provide Algorithm 1. The HMRCNN algorithm enhances object detection accuracy, particularly for small- and medium-sized objects in AAV imagery, through a multistage process. It starts by inputting images and initializing the ResNet50 backbone, FPN, and ABN. The feature extraction phase generates multiscale feature maps from input images, which are then processed by the ABN to produce EFMs. The RPN identifies potential object locations, which are refined through ROI alignment and detection heads for bounding box regression, object classification, and optional mask prediction. Loss calculations include classification, bounding box regression, and optional mask loss, leading to the final output of predicted bounding boxes, class labels, and masks. This structured approach leverages adaptive branching and multitask learning, improving detection accuracy and efficiency for AAV-based environmental monitoring and surveillance.

SECTION Algorithm 1

HMRCNN

1:

Input: Aerial images from AAV sensors

2:

Initialize:

3:

Load the pre-trained ResNet50 backbone

4:

Initialize the Fully Pyramid Network (FPN)

5:

Initialize the Adaptive Branching Network (ABN)

6:

Feature Extraction:

7:

for each input image I do

8:

Extract feature map C_{1} to C_{5} using ResNet50

9:

Generate multi-scale feature maps P_{2} to P_{6} using FPN

10:

end for

11:

Adaptive Branching:

12:

for each feature map P_{i} from FPN do

13:

Apply ABN:

14:

Use GAPL to reduce dimensions

15:

Select relevant branches

16:

Extract Enriched Feature Maps (EFM) P^{\prime }_{i}

17:

end for

18:

Region Proposal:

19:

Apply RPN to EFM P^{\prime }_{i}

20:

Generate region proposals for potential object locations

21:

ROI Alignment and Detection:

22:

for each region proposal do

23:

Map proposals to EFMs using ROI-A

24:

Use detection heads for:

25:

Bounding box regression

26:

Object classification

27:

(Optional) Mask prediction for segmentation

28:

end for

29:

Loss Calculation:

30:

Calculate total loss L_{t} :

31:

Classification loss L_{cls}

32:

Bounding box regression loss L_{bbox}

33:

(Optional) Mask loss L_{mask}

34:

Output:

35:

Predicted bounding boxes, class labels, and (optional) masks for detected objects

SECTION VI.

Experimental Results and Analysis

A. Datasets and Evaluation Metrics

To evaluate the proposed method, We use MS-COCO 2017 [22] dataset, Aerial-Cars dataset [29], VisDrone dataset [42], and Plastic in River. The average precision (AP) and average recall (AR) are used to measure true positives and minimize false negatives, respectively. The AR results for 1000 proposals output are demonstrated for both MS-COCO dataset and Aerial-Car’s dataset. Lastly, the AP metric also HMRCNN and other methods are demonstrated.

B. Model Architecture Details

The proposed ABN architecture attempts to keep the added computations to a minimum. Our empirical results, as obtained through experiments on the ResNet backbone, have shown that utilizing m=4 branches results in optimal performance on the MS-COCO dataset, although the difference in using four or five branches was observed to be 0.2%. Using m=4 branches, these 80 probabilities are divided into four groups, each containing twenty probabilities, to correspond with the combination of branches established in the architecture. Our combination strategy starts by combining (adding) any branch that provides a probability greater than TH (starting with TH=0.1 ). For the task-specific training our ABN architecture goes through two bottleneck blocks (with channel size 256) and the last layer has output channels equal to the number of classes N, followed by the GAPL which reduces the feature to (1\times 1\times N_{C}) . In the case of MS-COCO dataset, N_{C}=80 , therefore the output from the GAPL is of dimensions (1\times 1 \times 80) . In Table 1, we have reported the parameters for each module and its block types with the number of output channels of each block in Table 2, and the number of times the block is repeated. N is the number of training dataset’s classes. The convolution Stride is 1 for all layers in the block Type-a and Type-b.

TABLE 1 Block Types and Repetitions in ABN
Table 1- Block Types and Repetitions in ABN
TABLE 2 Parameters of Blocks in ABN
Table 2- Parameters of Blocks in ABN

C. Training Strategy

The models in this section are trained for on MS-COCO 2017 to follow the outlines presented for Detectron2 [43]; the learning rate (LR) is set to 0.0015 with batch size of 4. AR is calculated for the RPN part of the model is the main metric used in this work, in both cases of 1000 and 100 proposed regions per image, hence AR\text{@}1000 and AR\text{@}100 evaluation metrics. All images are resized to 800 by 800 pixels. For the final AP metric calculations, 512 ROI’s per image are generated. All the models utilize the PyTorch and Detectron2 libraries.

The branching method is applied to each feature map. The model is trained on the MS-COCO train2017 dataset and fine-tuned on the AAV dataset with the same backbone architecture. To achieve this, we fine-tuned the model using 37 epochs and 10\,000 iterations per epoch. The time spent on this part is approximately 32 h.

In the following, the analysis with HMRCNN on four different datasets is demonstrated.

1) Results on MS-COCO Dataset

Table 3 displays the AR results of HMRCNN when performing on all the scales of the feature maps. The proposed method improves the results on the MS-COCO dataset at almost every scale except for a slight difference, less than a percent, for the {\mathrm { AR}}_{m} \text{@}1000 metric; the original method appears better. The results on medium-sized and large-sized objects in this dataset already performed well. However, on small-sized objects, there is a difference of around 15% compared to large/medium objects, and our proposed method improved the performance by 3.64% compared to the original method.

TABLE 3 AR Results on Every Scale for the MS-COCO val2017 Dataset
Table 3- AR Results on Every Scale for the MS-COCO val2017 Dataset

2) Results on Aerial-Cars Dataset

Table 4 includes the proposed and base model results with the AR metric on the Aerial-Cars dataset. HMRCNN improves the results on the Aerial-Cars dataset at every scale consistently. This enhancement is at its most for the small objects in which our method shows 12.27% higher results. In this dataset, the results on small-sized and medium-sized objects were our primary challenge, and we improved them by 12.27% and 5.6%, respectively.

TABLE 4 AR Results on Every Scale for Aerial-Cars
Table 4- AR Results on Every Scale for Aerial-Cars

Based on Table 5, we improve the performance of the two-staged detector MRCNN with ResNet50 backbone while only trained on the MS-COCO dataset with a task-specific branching architecture. The AP metric illustrates that our method has successfully improved this method by more than 6% in precision.

TABLE 5 AP Results on Every Scale for Aerial-Cars
Table 5- AP Results on Every Scale for Aerial-Cars

Sample detection results to demonstrate the approach are found in Fig. 3 for MRCNN and Fig. 4 for HMRCNN. As it is shown in Fig. 4, HMRCNN is displaying the best results with higher confidence scores as predicted. The final AP results of the MRCNN network depend directly on the inference of the RPN network, based on the results, there is a correlation between the proposal boxes with higher AR that would lead to better AP. Thus, we focus on improving the given feature map at each level to enhance the AR results at each scale.

FIGURE 3. - MRCNN results on a AAV image. As it is demonstrated, there are many misdetections, and the confidence scores representing probability of cars are low.
FIGURE 3.

MRCNN results on a AAV image. As it is demonstrated, there are many misdetections, and the confidence scores representing probability of cars are low.

FIGURE 4. - HMRCNN results on a AAV image. There are no misdetections and the confidence scores representing the probability of car have improved.
FIGURE 4.

HMRCNN results on a AAV image. There are no misdetections and the confidence scores representing the probability of car have improved.

3) Results on VisDrone Dataset

The VisDrone dataset, developed by the AISKYEYE team at Tianjin University, is a comprehensive resource for drone-based visual analysis. It includes 288 video clips and 10209 static images from 14 cities in China, featuring diverse environments and object densities. Annotated with over 2.6 million bounding boxes, it details scene visibility and object occlusion. The dataset supports tasks like object detection, tracking, and crowd counting, and is structured into five subsets for specific applications.

To show the efficiency of the model for detecting the small objects, the VisDrone dataset is employed. Based on [42], we compare the proposed HMRCNN results with four different methods which are introduced recently, namely, SABL [44], Cascade ADPN [45], YOLOx-s [46], and CFANet-s [47] on Table 6. Table 6 demonstrates the superior performance of the proposed HMRCNN method across various object sizes in the VisDrone dataset. HMRCNN achieves the highest AP for small objects (34.5%) and medium objects (48.2%), significantly outperforming other methods like MRCNN and Cascade ADPN. While SABL shows strong results for large objects (61.7%), HMRCNN remains competitive with an AP of (57.8%), highlighting its robustness and effectiveness across all object scales. Visualizations in Fig. 5 further illustrate the model’s ability to accurately detect even the smallest objects, showcasing its precision in object detection tasks.

TABLE 6 AP Results on the VisDrone Dataset
Table 6- AP Results on the VisDrone Dataset
FIGURE 5. - Output visualization for the VisDrone dataset.
FIGURE 5.

Output visualization for the VisDrone dataset.

4) Results on Plastic in River Dataset

The “Plastic in River” dataset, developed for the Kili Community Challenge1 and hosted on Hugging Face,2 provides annotated images of rivers, highlighting plastic waste items, such as plastic bags, plastic bottles, other plastic waste, and nonplastic waste. Each image includes bounding boxes to mark these items, facilitating the training of object detection models. This dataset is crucial for environmental monitoring and helps enhance the accuracy of detecting and classifying plastic pollution in aquatic environments. This dataset comprises over 3000 annotated images, capturing various instances of plastic litter across different river scenes. For this part of evaluation, the results for the test set are provided in Table 7. Also to compare our results, the performance of MRCNN, YOLOv5, has been measured based on Detectron2 library. HMRCNN outperforms YOLOv5 and MRCNN overall, especially in detecting small and medium objects. YOLOv5 shows strength at higher IoU thresholds and in large object detection, while MRCNN generally lags behind the other two models.

TABLE 7 AP Results on Plastic in River Dataset
Table 7- AP Results on Plastic in River Dataset

To better demonstrate the performance of the proposed model, some of the model outputs are visualized in Fig. 6. As can be seen, the model is capable of detecting all objects with ease, although there are some errors in the classification part.

FIGURE 6. - Output visualization for Plastic in River dataset.
FIGURE 6.

Output visualization for Plastic in River dataset.

D. Analysis Using Partial Pyramid Features

We perform an ablation study to analyze the effect of using different scales of features in the HMRCNN. Instead of utilizing all pyramid scales from FPN represented in the feature maps P_{2} to P_{4} as inputs to the RPN to find the object proposals, we use a partial set of two feature maps at a time representing different scales, dividing into the following: {P_{2},P_{3} },{P_{3},P_{4} },{P_{4},P_{5} },{P_{5},P_{6}} . The results in Table 8 demonstrate the RPN performance with partial FPN on every scale on the MS-COCO val2017 dataset. Each column entry in the FPN only outputs two feature maps. {\mathrm { AR}}_{l} is the AR for large objects, {\mathrm { AR}}_{m} for medium objects and {\mathrm { AR}}_{s} for small objects. \text{@}100 and \text{@}1000 are the number of box proposals output. The significant difference between their performance on objects of different scales by different input feature maps proves the fact that each scale of the feature map is responsible for one or, at most, two sizes of objects.

TABLE 8 AR Results of RPN With Partial FPNs on MS-COCO
Table 8- AR Results of RPN With Partial FPNs on MS-COCO

The performance on large objects with \{P_{5},P_{6}\} —in which P_{6} is the sixth feature map output of the FPN and P_{5} is the fifth one, is rather significant compared to the rest of the results; however, it displays inferior results on objects of small size. This explains how using features at different scales will improve performance and is one of the reasons that MRCNN incorporates FPN.

For medium objects size, the combinations of \{P_{4},P_{5}\} and \{P_{3},P_{4}\} lead on ARm results where they show better results on every scale compared to \{P_{5},P_{6}\} except for large objects.

Lastly, small objects, which are relatively rare in the MS-COCO dataset, \{P_{2},P_{3}\} display negligible results as if they are less relevant in the inference of FPN.

We apply the FPN and the ABN partially on the feature maps and evaluate the RPNs performance based on the presented feature maps. This way, we can observe how each feature map is enriched and how well it performs in the next stage.

In Table 9, the first two feature maps presented to the RPN network are \{P_{5},P_{6}\} where we apply the FPN network and the ABN on P_{5} only as the MRCNN base architecture, P_{6} is sent directly to the RPN.

TABLE 9 HMRCNN AR Results on ( P_{5},~P_{6} ) for MS-COCO
Table 9- HMRCNN AR Results on (
$P_{5},~P_{6}$
) for MS-COCO

The results above focus on large objects; these feature maps are so downsized, and only large objects are readable on them by the RPN. Hence, the critical metric for this combination is {\mathrm { AR}}_{l} , in which the proposed method shows 3.48% higher results.

We will continue the analysis on the Aerial-Cars dataset which includes ten times more objects in each input image compared to the MS-COCO dataset. We apply the FPN and the ABN partially on the feature maps and evaluate the RPNs performance with AR metric. This way, we can observe how each feature map is enriched and how well it performs in the next stage by the AR metric. The number of large objects in this dataset are somewhat negligible compared to the medium, and small ones. Therefore, following the previous results on the Aerial-Cars dataset, we compare the results of the enriched pairs of features to the base model together on Table 10. Each of the mentioned feature map combinations are sent to the RPN.

TABLE 10 HMRCNN Results on ( P_{5},~P_{6} ) and ( P_{4},~P_{5} ) for Aerial-Cars
Table 10- HMRCNN Results on (
$P_{5},~P_{6}$
) and (
$P_{4},~P_{5}$
) for Aerial-Cars

The AR metric for EFMs is better than the base model at every scale. As we elaborated earlier, the important metric here is {\mathrm { AR}}_{l} , where enriched \{P_{5},P_{6}\} with HMRCNN is 10.12% higher than MRCNN with \{P_{5},P_{6}\} , and enriched \{P_{4},P_{5}\} performs better by 15.96%. We have also found higher performance on \{P_{3},P_{4}\} as well, on average about 13.66% higher AR compared to the base method with the partial feature map input of \{P_{3},P_{4}\} . The results for P_{2} shows 14.2% higher results compared to the base model.

Additionally, Fig. 7 visualizes the feature maps for P_{5} and P_{4} , alongside the actual input image and the corresponding EFMs. Figs. 8 and 9 plot the feature maps and EFMs of P_{3} and P_{2} , respectively. For such images with so many details included, the last feature map through FPN and ABN is downsized and needs to include more information to further process. On the other hand, P_{4} is showing slightly higher resolution, resulting in better AR\text{@}1000 in both cases.

FIGURE 7. - Plotted feature maps (
$P_{5}$
, Enriched), and (
$P_{4}$
, Enriched). Input image from the Aerial-Cars dataset.
FIGURE 7.

Plotted feature maps (P_{5} , Enriched), and (P_{4} , Enriched). Input image from the Aerial-Cars dataset.

FIGURE 8. - Plotted feature maps (
$P_{3}$
, Enriched). Input image from the Aerial-Cars dataset.
FIGURE 8.

Plotted feature maps (P_{3} , Enriched). Input image from the Aerial-Cars dataset.

FIGURE 9. - Plotted feature maps (
$P_{2}$
, Enriched). Input image from the Aerial-Cars dataset.
FIGURE 9.

Plotted feature maps (P_{2} , Enriched). Input image from the Aerial-Cars dataset.

E. Feature Enhancement for Different Stages

In the case of MRCNN, one challenge is where to apply this module to achieve better performance. By adding the MTL module to the second stage of the model, the information flow becomes bottlenecked and limited to the later stages of the model, compared to earlier stage. The model also becomes more complex and more parameters. Second stage is designed to perform specific tasks compared to first stage “object-ness.” On the other hand, adding an MTL module to the backbone allows the model to learn to perform both object detection and instance segmentation tasks simultaneously, using features learned from the shared backbone. This results in improved precision because the shared features learned from the MTL module are more informative and useful for both tasks. The following results on Table 11 show the MTL applied on different stages of MRCNN which justifies the use of first-stage MTL. For easier comparison we kept the backbone small while only sending out two feature maps at all cases. For the second-stage investigation, we get about 10%–15% reduced performance compared to the first-stage HMRCNN architecture. These results justify that enhancing feature maps of the first stage provide more value in detecting object proposals for further processing.

TABLE 11 AR Results on Effect of Stages on MS-COCO
Table 11- AR Results on Effect of Stages on MS-COCO

SECTION VII.

Conclusion

In this article, we proposed the HMRCNN model, a novel MTL module based on the low-cost dynamic multitask architecture HydraNet, designed to enhance object detection, particularly for small- and medium-sized objects in AAV imagery. Our model introduces an adaptive process to the multilabel classification step of the backbone architecture, leading to improvements in detection precision, as evidenced by our experimental evaluation on datasets like MS-COCO, Aerial-Cars, VisDrone, and Plastic in River, where HMRCNN outperforms traditional methods, including MRCNN. While the HMRCNN model demonstrates superior accuracy and efficient detection capabilities, particularly with its enhanced feature maps, the complexity of its architecture—especially with the inclusion of the ABN—results in considerable computational overhead, leading to longer training times. Additionally, the model’s design presents challenges in interpretability. To address these issues, future work should focus on integrating explainable AI (XAI) techniques, such as attention and saliency maps, to improve transparency, and on optimizing the architecture through model pruning, quantization, and the adoption of lightweight backbones to reduce computational demands, thereby further enhancing the model’s applicability in various AAV and aerial imagery applications.

References

References is not available for this document.