Introduction
With the explosive progress of computers and deep learning techniques, a variety of methods, i.e., DTNet [1] and CRSIOD [2], have been developed for autonomous aerial vehicles (AAV) images object detection, achieving significant performance improvements. As a result, AAV-based object detection plays a critical role in many fields, such as urban planning [3], disaster rescue [4], and environmental monitoring [5]. However, the dynamic flight altitudes, combined with the variability in scale and shooting perspectives [6], resulting in the small object detection in AAV images remains a challenging task.
Traditional object detection methods always utilize handcrafted features to separate objects from complex scenes, such as image filtering [7], low-rank approximation [8], and transform-domain methods [9], which have made significant progress. This manual design strategy mainly relies on expert prior knowledge and is typically time-consuming and difficult to generalize. To overcome this challenge, numerous research works have applied deep learning methods to object detection tasks [10], [11], [12], achieving substantial detection accuracy improvements. Deep learning-based methods typically employ a feature learning approach that automatically discovers the features necessary for detection and classification. Among these works, convolutional neural networks (CNNs) methods [13], [14], [15] have shown tremendous potential. These methods can automatically learn more abstract and higher dimensional feature representations from large datasets [16], [17], [18], demonstrating superior representational performance. Specifically, CNN-based methods can effectively handle the challenges posed by complex backgrounds. Therefore, they are more robust and adaptable to real-world conditions.
In recent years, more and more studies have focused on adapting these methods to the domain of AAV object detection [19], [20]. For example, compared to natural scenes, objects in aerial images exhibit tremendous scale variations. For this characteristic, S2A-Net [21] proposed to increases the focus on high-resolution feature information and employ multihead attention mechanisms to improve both classification scores and localization accuracy. To address the challenge that few-texture objects lack information, AF-SSD [22] devised a cascaded structure with spatial and channel attention modules to leverage the contextual information for assistance. Although these methods based on optical images have made certain progress, limitations inherent in the sensors tend to result in serious performance degradation when detecting small objects or in insufficient lighting conditions [23]. As shown in Fig. 1, under low-light condition, the objects marked with red annotations in optical images are almost invisible and those marked with green are faintly visible, but infrared images still enable clear object visualization. In Fig. 2, the multimodal fusion-based method typically illustrates higher accuracy compared with the method relying on unimodal images. Thus, several works utilize infrared images as additional modalities [24], [25], [26], [27]. For example, SuperYOLO [28] employs a pixel-level multimodal fusion strategy, which involves mask extraction and channel interaction operation on visible-infrared images. YOLOFIV [29] introduced a dual-stream backbone network along with a C3ECA module to adequately extract the features of both modalities.
Some Examples of visible-infrared aerial image pairs. (a) and (b) show vehicles captured at night by the roadside. (c) shows vehicles on the road captured during the daytime. The red boxes represent objects that difficult to detect in visible light images at night, the green represents objects that barely visible in weak lighting, the yellow represents areas where modalities are misaligned.
Comparison of
Although the aforementioned multimodal detection methods have greatly improved detection performance, these approaches of modal interaction typically recalibrate feature responses based on first-order statistics (global mean). This fundamental interaction form focuses on information within a single dimension, without delving into the complex interdependencies between different channels. Yuan and Wei [23] leveraged the intrinsic matrix multiplication of query, key, and value from Transformer to learn the cross-spatial relationships between optical and infrared modalities that facilitate second-order modal interaction. Compared to first order, it is more complex and involves the mutual influence between two or more elements. However, this interaction is still limited to the direct relationship between two modalities, without considering the potential of high-order spatial correlation and channel interactions. In this work, we extend the interaction to higher orders, achieving complex interactions between local details and global information of different modalities.
We propose a dynamic cascade cross-modal coassisted network (DCCCNet), a novel network based on high-order cross-spatial and channel attention for multimodal fusion AAV object detection. The method uses a dual-stream progressive down-sampling CNN as a backbone network for feature extraction. The purpose of designing this structure is to separately extract multiscale features of optical and infrared images. Then, a multimodal high-order interaction (MHI) module is employed to fully exploit the deep correlation between heterogeneous images. It alternates spatial and channel interaction to ensure that information in space and channel domain maintains a favorable distribution in the fused feature space. We carry out high-order iteration to gradually enhance salient features while suppressing irrelevant information, achieving the comprehensive integration of multimodal features from shallow to deep levels. In the multimodal fusion process and the forward propagation through CNN, small objects' information and sparse features tend to be lost [30]. Considering that reconstructing the input images from features in the training phase is a strategy for implicitly preserving discriminative details, we devise a scale-adaptive dynamic feature prompt (SADFP) module. SADFP is a fine-grained information exploration branch composed of dynamic convolution, dilated convolution, and normal convolution. This branch is constrained by the L1 loss function to reduce the discrepancy between the reconstructed image and the original input. For the purpose of integrating fused multilevel features and enhancing the global vision of the model, we propose a global collaborative enhancement (GCE) module. It utilizes a region-based self-attention mechanism to model the global dependencies of integrated modalities, further improving the fusion quality of complementary information between modalities.
In summary, our main contributions of this work can be summarized as follows.
We present the DCCCNet, a novel multimodal object detection framework that leverages high-order multimodal interactions and dynamic supervision to effectively improve the precision of all-weather object detection in AAV imagery.
We devise an efficient MHI module, which is a cascaded structure incorporating spatial and channel attention mechanisms. It can focus on important object-related information and iteratively aggregate features to integrate high-order spatial detail relationships between two modalities.
We utilize an innovative supervised prompt branch called SADFP, which can capture feature degradation cues in CNN through a coassisted mechanism, then enable backbone preserving more object detailed features.
A GCE module is introduced to model long-range dependencies in cross-modal fusion features. It compute correlations within and between local regions to preserve more global information during the fusion process, which can further enhance the fusion quality of visible and infrared images, improving detection accuracy.
Related Works
A. RGB-Based Object Detection
As a fundamental task in AAV applications, object detection plays a significant role in many fields, such as urban planning [3], disaster rescue [4], and military reconnaissance [31]. Recently, numerous efforts have been devoted to exploring object detection methods based on deep learning, achieving remarkable successes [12], [32], [33]. For example, R-CNN [34] introduced CNN into the field of object detection. It initially generated region proposals from the original image and then utilized CNN to extract features from these regions for classification. Fast R-CNN [14] employed parallel sliding windows on feature maps to generate region proposals, subsequently regressing their locations and classifying the objects at the same time, significantly improving computational speed. Faster R-CNN [35] introduced the region proposal network, which enables rapid candidate regions generation, enhancing detection speed and accuracy. In addition, its anchor mechanism enables the algorithm to handle objects with varying sizes and shapes. Redmon et al. [13] divided the input image into an S× S grid and directly predicted object existence probabilities and class probabilities, achieving extremely fast computational efficiency. Subsequently, researchers continuously optimized YOLO. YOLOv3 [36] leverages the concept of the feature pyramid networks (FPN) to utilize multiscale feature maps for detecting objects of varying sizes. YOLOv4 [37] introduces cross-stage partial connections into the Darknet53 backbone, which helps in reducing computational complexity while improving feature representation. YOLOv4 enhanced multiscale feature fusion by incorporating a path aggregation network. YOLOv5 [38] further refines the architecture by replacing the spatial pyramid pooling module with a more efficient spatial pyramid pooling fast structure, which allows better integration of multiscale features. YOLOv6 [39] introduces a new backbone network called EfficientRep, which incorporates more efficient convolution operations. YOLOv7 [40] merges multiple convolution layers into a single equivalent layer, reducing computational complexity and increasing inference speed. YOLOv8 [41] adopts EfficientNet-V2 as its backbone network, enabling it to maintain high performance while being computationally efficient.
Despite the success achieved by the YOLO series in various fields, there are still challenges in AAV object detection, such as large variations in object scale, complex backgrounds, and inadequate lighting. To address these issues, many researchers have improved the YOLO series algorithms. SCINet [42] leverages attention mechanisms and super-resolution reconstruction to highlight the spatial structural details of objects. FFAGRNet [43] reconstructs deep feature maps and incorporates spatial context information to effectively improve the detection of blurry objects. YoloOW [44] proposes the OaohRep convolutional block to improve scale adaptability for feature extraction and enhancement, effectively reducing both false negatives and false positives.
B. Multimodal Fusion Object Detection
Despite significant progress in AAV object detection in past years, existing object detection methods still face several limitations that hinder their effectiveness in complex real-world scenarios. A major limitation is that they rely exclusively on a single data modality, such as visual information from optical images, which may not be sufficient to capture comprehensive information of objects. In addition, this limitation can lead to reduced detection accuracy, especially in occlusions, low-light conditions, or variations in object appearance where the object features in unimodal data are insufficient. Integrating additional modality is an effective approach to overcome these problems. By combining modalities, such as optical and infrared, models can leverage the complementary strengths to provide a more comprehensive understanding of objects and their surroundings [45], [18], [46]. For example, Liu et al. [47] employed a meta-feature embedding model to extract meta-features from object detection tasks, guiding the fusion network to learn semantic information that incorporates rich textural details. Zhang et al. [28] proposed SuperYOLO, which applied a fusion module based on channel attention and a super-resolution branch to enhance small object detection in remote sensing images, but it overlooked the importance of unique characteristics from each modality. DDFN [48] obtain fused representation by aggregating local information from infrared features and long-term dependency information from visible light features. LF-MDet [49] leverages low-rank enhancement and dynamic lighting-aware masking to unbiasedly and compatibly extract multimodal features, achieve superior performance over state-of-the-art multimodal detectors. These methods achieve better performance compared with unimodal methods. Recently, several studies have explored the utilization of deep learning and attention mechanisms to improve multimodal object detection. For example, Shen et al. [50] proposed an iterative cross-attention interaction method, which leverages complementary information from auxiliary modalities to enhance the representation of unimodal features. Similarly, Jian et al. [51] introduced a discrepancy information injection module on the basis of the cross-attention mechanism, enabling the exploration of unique features from the source image separately.
While these attention mechanisms-based methods demonstrate the potential for multimodal integration, their complex interaction ability between modalities is still insufficient. In this article, we develop an efficient multimodal interaction method, which exploits a multilevel cascaded spatial and channel attention layer to iteratively aggregate local and global feature information. It can extract deep complementary features and reduce redundant background information.
C. Cross-Modal Interaction
Early multimodal fusion methods for object detection primarily focused on generating composite images that emphasize objects more prominently for human vision. Researchers have introduced various image processing techniques, such as sparse representation [47], principal component analysis [52], Laplacian pyramid [53], and discrete wavelet transform [54]. While achieving good fusion visual effects, these approaches face limitations in effectively capturing the intrinsic features of data and scaling to large-scale datasets. Recently, methods based on deep learning have achieved superior visual fusion. For example, SuperYOLO [28] utilized a squeeze-and-excitation block to facilitate the first-order interaction across different modalities in channel domains. Guided attentive feature fusion [55] dynamically weights and fuses multispectral features through cross-modal and intramodal attention modules. These methods can capture important complementary information, but they primarily focus on feature interactions in either spatial or channel dimensions, without conducting synergistic high-order interactions across both spatial and channel dimensions.
Proposed Method
A. Overall DCCCNet Architecture
The overall architecture of the proposed DCCCNet is shown in Fig. 3, which consists of three main components: the SADFP, MHI, and GCE module. First, the backbone network is extended by incorporating a parallel branch. It is devised to extract multiscale individual features from optical and infrared images. The SADFP is attached to the side of the backbone, takes these extracted features as input, and adaptively captures global and local information to reconstruct the original input images. The loss functions of SADFP are all L1 loss, which are used to calculate the pixel value differences between the reconstructed images and the input images. By reducing this loss, the backbone can preserve object detail information during convolutional downsampling. Subsequently, MHI integrates multiscale features through a colearning mechanism, which allow modality learning the advantageous information from others. Therefore, the fused multiscale features possess both texture details from optical images and spatial semantic information from infrared images. They are then unified to the same size for concatenation and input into the GCE to establish global relationships between objects and contextual. Finally, the processed feature maps are fed into the detection head to obtain the predicted categories and locations of objects. Detailed descriptions of these components are provided in Sections III-B–III-E.
Overall framework of the DCCCNet, which comprises three main components: SADFP, MHI, and GCE. The SADFP is designed to dynamically capture both local and global information from features, which aids in the reconstruction of details and guides the backbone toward learning super-resolution feature information. The MHI utilizes high-order spatial and channel interactions, effectively integrating fine-grained complementary information from optical and infrared imagery to facilitate the generation of rich, discriminative features, thereby enhancing detection and recognition performance. Finally, the GCE focuses on modeling long-range dependencies between point feature across the entire image, providing a more thorough understanding of the object context to enhance the robustness of models in challenging detection scenarios.
B. Scale-Adaptive Dynamic Feature Prompt Module
Current object detection methods commonly employ backbone networks for feature extraction, which are efficient in capturing the general characteristics of images. However, the inherent downsampling process in these backbone networks inevitably leads to many subtle information loss of objects, rendering it difficult for the network to achieve accurate predictions from such residual features. We consider that if the image reconstructed from feature maps is able to recover original details, then the network is capable of implicitly preserving fine-grained information that might be lost in deeper layers of the CNN.
Based on the analysis above, we devise an SADFP module to address the issue of detail loss. During the training stage, the SADFP works in parallel with the backbone network and dynamically captures both local and global information through the mechanism as follows.
Dynamic Convolutional Kernel Adjustment: The SADFP dynamically adjusts the parameters of convolutional kernels based on different inputs. This is achieved through an adaptive learning mechanism that utilizes the spatial structural diversity of features to optimize the convolutional kernels, enabling it to flexibly sample input features at different spatial positions.
Feature Reconstruction: The adjusted convolutional kernels are used to convolve the feature maps, generating a reconstructed image. This step aims to extract as much high-frequency detail information as possible to restore fine structures and textures in the image from intermediate features.
L1 Loss: We calculate the pixelwise discrepancy between the reconstructed image and the original image through L1 loss. By minimizing this difference, the backbone can be trained to more effectively preserve detailed structural information of the image during the feature extraction process. The L1 loss measures the absolute difference at each pixel position between the reconstructed image and the original image, making it particularly sensitive to capturing edges and fine details within the image.
The architecture of the SADFP is shown in Fig. 3. It comprises two branches, which are designed to comprehensively capture diverse types of information within images. Given an input feature map
\begin{equation*}
F_{\text{us}}=\text{Con}\nu _{1}(F),F_{\text{ls}}=\text{Con}\nu _{2}(F) \tag{1}
\end{equation*}
\begin{equation*}
F_{\text{SAB}}=\text{DynamicConv}(\text{DilatedConv}(\text{Conv}(F_{\text{us}}))). \tag{2}
\end{equation*}
\begin{equation*}
\mathrm{W}=\alpha _{1} \mathrm{W}_{1}+\cdots +\alpha _{n} \mathrm{W}_{n} \tag{3}
\end{equation*}
\begin{equation*}
F_{\text{upper}}=\text{SAB}({\ldots }(\text{SAB}(F_{\text{us}})+F_{\text{us}})+\cdots) +F_{\text{SAB}}^{n-1} \tag{4}
\end{equation*}
\begin{equation*}
F_{\text{lower}}=\text{CR}({\ldots }(\text{CR}(O_{n}+\text{CR}(O_{n}))+\cdots) +O_{1}. \tag{5}
\end{equation*}
\begin{equation*}
F_{\text{out}}=f_{\text{interpolate}}(F_{\text{upper}}\otimes F_{\text{lower}}) \tag{6}
\end{equation*}
C. Multimodal High-Order Interaction Module
Cross-modal collaborative interaction between global spatial locations and channels is crucial for efficient multimodal fusion. To achieve such high-order synergistic interaction and generate discriminative fused features, we propose an MHI, a cascaded structure incorporating spatial and channel attention modules. The spatial attention module is used to add corresponding object feature information in additional modalities and the channel attention module enables the network to focus on important channels related to object features. The MHI leverages cross-spatial and channel attention mechanisms to establish spatial detail connections and global statistic synergies between the two modalities. In addition, through iterative aggregation, this interaction is extended to high-order representations, facilitating the discrimination of intramodal interdependencies and thereby effectively utilizing the complementary information from infrared images.
By analyzing the spatial relationships between pixels and weighting each pixel in a feature map, the spatial attention module can add contextual information for few-texture objects. The channel attention module weights each channel of a feature map by learned weights, which can guide the network to pay attention to important features for few-texture and low-contrast objects.
The architecture of the MHI is illustrated in Fig. 4. Given a pair of visible and infrared images
\begin{equation*}
F_{\text{spatial}}^{1}=\mathrm{\text{softmax}}\left(\frac{\mathbf {Q}\otimes \mathbf {K}^{T}}{\sqrt{d_{k}}}\right)\otimes \mathbf {V} \tag{7}
\end{equation*}
\begin{align*}
\mathbf {X} &= \mathrm{\text{LN}}\Big(\mathcal {F}^{-1}\left(\mathcal {F}\left(\text{Conv}(F_{\text{vis}})\right)\odot \overline{\mathcal {F}\left(\text{Conv}(F_{\text{ir}}\right)}\right)\Big) \tag{8}
\\
F_{\text{spatial}}^{1} &= \mathcal {F}^{-1}\left(\mathcal {F}\left(\mathbf {X}\right)\odot \overline{\mathcal {F}\left(\text{Conv}(F_{\text{ir}}\right)}\right) \tag{9}
\end{align*}
\begin{align*}
\mathbf {X} &= \mathcal {F}^{-1}\left(\mathcal {F}\left(\text{Conv}(F_{\text{spatial}}^{N-1})\right)\odot \overline{\mathcal {F}\left(\text{Conv}(F_{\text{ir}}\right)}\right) \tag{10}
\\
F_{\text{spatial}}^{N} &= \mathcal {F}^{-1}\left(\mathcal {F}\left(\mathbf {\mathrm{\text{LN}}(X)}\right)\odot \overline{\mathcal {F}\left(\text{Conv}(F_{\text{ir}}\right)}\right). \tag{11}
\end{align*}
In the subsequent step, to perform a channel interaction operation, we concatenate the output of the N-order spatial interaction with
\begin{equation*}
F_{\text{concat}}=\text{concat}\left[F_{\text{spatial}}^{N},F_{\text{ir}}\right]. \tag{12}
\end{equation*}
\begin{equation*}
G_{1}^{i}=\frac{1}{H\times W}\sum _{x=1}^{H}\sum _{y=1}^{W}F_{\text{concat}}^{i}(x,y) \tag{13}
\end{equation*}
\begin{equation*}
\mathbf {W}_{1} = \sigma (\text{Conv}(G_{1}^{i})). \tag{14}
\end{equation*}
\begin{equation*}
F_{\text{channel}}^{1}=\mathbf {W}_{1}\odot F_{\text{concat}} \tag{15}
\end{equation*}
\begin{align*}
\mathbf {W}_{2} &= \sigma \left(\text{Conv}\left(\mathbf {W}_{1}\right)\right) \tag{16}
\\
F_{\text{channel}}^{2}&=\mathbf {W}_{2}\odot \text{Conv}\left(F_{\text{channel}}^{1}\right). \tag{17}
\end{align*}
\begin{align*}
\mathbf {W}_{N} &= \sigma \left(\text{Conv}\left(\mathbf {W}_{N-1}\right)\right) \tag{18}
\\
F_{\text{channel}}^{N}&=\mathbf {W}_{N}\odot F_{\text{channel}}^{N-1}. \tag{19}
\end{align*}
Proposed MHI. We utilize the left half part to generate a high-order spatial interaction feature map from optical and infrared images, integrating spatially fine-grained complementary information. Then, the output is concatenated with infrared features along the channel dimension. Subsequently, through high-order channel interaction in the right half part, the internal dependencies within the channel domain are extracted.
D. Global Collaborative Enhancement
With the assistance of multimodal fusion and dynamic feature prompt, the features extracted by the backbone network are embedded with rich semantic information. To further enhance the representational capacity of objects features and the discrimination between objects and backgrounds, we derive a GCE.
Before inputting into the detection head, feature maps will be processed by the GCE module, which incorporates an efficient window-based self-attention mechanism and models the global relationships within the feature maps to enable a more comprehensive contextual understanding. Notably, this self-attention mechanism significantly reduces computational costs, and compared with the swin-transformer, it demonstrates superior efficiency and performance when handling high-resolution images, thereby making it more suitable for applications requiring the processing of intricate details and complex scenes.
As shown in Fig. 5, the GCE divides the feature maps into multiple rectangular subregions. For each subregion, a carry token (ct) is computed to represent the overall information within that region. We leverage ct to facilitate information exchange both within and across regions, thereby modeling long-range dependencies. Specifically, the feature map
\begin{equation*}
\text{ct}_{i}=\text{AvgPool}(\text{Conv}(f)). \tag{20}
\end{equation*}
Then, a total of
\begin{equation*}
\text{MHSA}(\alpha _{i})={\sum _{j=1}^{n}}\left(\frac{\exp \left({q_{i}k_{j}^{T}}\right)}{\sum _{l=1}^{n}\exp \left({q_{i}k_{l}^{T}}\right)}v_{j}\right) \tag{21}
\end{equation*}
\begin{align*}
\text{CT}&=\text{CT}+\epsilon \cdot \text{MHSA}(\text{LN}(\text{CT})) \tag{22}
\\
\text{CT}&=\text{CT}+\eta \cdot \text{MLP}(\text{LN}(\text{CT})) \tag{23}
\end{align*}
\begin{equation*}
m=\text{concat}(\text{ct}_{i},X_{i}). \tag{24}
\end{equation*}
\begin{align*}
m&=m+\epsilon \cdot \text{MHSA}(\text{LN}(m)) \tag{25}
\\
m&=m+\eta \cdot \text{MLP}(\text{LN}(m)). \tag{26}
\end{align*}
\begin{equation*}
f_{\text{out}}=\text{Upsample}(\text{CT})+m. \tag{27}
\end{equation*}
As a result, we obtain a feature map of a subregion
Structure of the GCE. Large, medium, and small scales feature maps from FPN are divided into several square patches (subregions). For each region, a ct is computed, and all these ct are concatenated for global self-attention interaction. Subsequently, ct is concatenated with the flattened vector within the corresponding window. Through self-attention computation, long-range information is transmitted to the local region, enabling the modeling of long-distance dependencies.
E. Loss Function
During the training stage, the loss function of the DCCCNet is composed of a detection loss and a reconstruction loss as follows:
\begin{equation*}
L_{\mathrm{\text{total}}}=L_{\mathrm{\det }}+\lambda L_{\mathrm{\text{rec}}} \tag{28}
\end{equation*}
\begin{equation*}
L_{\mathrm{\det }}=\alpha L_{\mathrm{\text{box}}}+ \beta L_{\mathrm{\text{obj}}} + \gamma L_{\mathrm{\text{cls}}} \tag{29}
\end{equation*}
\begin{equation*}
L_{\mathrm{\text{rec}}}=\frac{1}{N} \sum _{i=1}^{N} |I_{\mathrm{\text{rec}}}(i)- I_{\mathrm{\text{orig}}}(i)| \tag{30}
\end{equation*}
Experiments
In this section, we provide a comprehensive overview of the datasets for the experiments, followed by an introduction of the selected comparison methods and evaluation metrics. Subsequently, we conduct an in-depth ablation analysis to prove the effectiveness of the key components in the DCCCNet. Following this, we perform both qualitative and quantitative comparisons with state-of-the-art methods to demonstrate the superiority of the DCCCNet in comparison to all competitive approaches.
We evaluate the performance of our model on three challenging remote sensing datasets: vehicle detection in aerial imagery (VEDAI) [56] and DroneVehicle [57]. In addition, YOLOv5 is selected as our baseline for its adaptable network architecture, high accuracy, and robustness, and has been widely embraced in industrial real-time object detection scenarios. In ablation analysis, we conducted comprehensive ablation experiments on the SADFP, MHI, and GCE modules, and discussed the results in detail. In the quantitative experiments, we compare the performance of different order interactions within the MHI model and carry out sufficient ablation studies to demonstrate the effectiveness of our proposed method. Furthermore, we perform feature visualization to intuitively showcase the effects of our model.
A. Experimental Dataset and Evaluation Metrics
1) VEDAI Dataset
VEDAI is a multimodal remote sensing image dataset dedicated to the task of vehicle detection. Derived from the Utah Automated Geographic Reference Center, the images in VEDAI were originally captured at a size of 16 000 × 16 000 pixels and subsequently cropped into 1024 × 1024 pixels, with further downsampling to produce a 512 × 512 pixel dataset. Both infrared and optical images in this dataset share identical perspectives and were captured at a uniform altitude. VEDAI offers high-resolution aerial imagery accompanied by precise bounding box and category label annotations, encompassing nine categories: “airplane,” “boat,” “car,” “truck,” “tractor,” “camper,” “van,” “pickup,” and “other.” Moreover, the images in VEDAI exhibit various challenges, including lighting and shadow variations, specular reflections, and occlusions, and as shown in Fig. 6, the VEDAI dataset mainly consists of small objects, which effectively differentiate the performance of vehicle detection algorithms.
Associated statistics of VEDAI dataset. Count of objects per type within the training subset. Dimensions and quantity of actual bounding boxes for various categories in the training data. Information on the centroid position of each object relative to the entire image frame. The aspect ratios (long-to-short edge) of the objects within an image, normalized against the dimensions of the entire image.
2) DroneVehicle Dataset
DroneVehicle dataset is a large-scale RGB-infrared vehicle detection dataset based on AAVs, comprising 56 878 images totaling 28 439 paired infrared and optical images, primarily captured in urban settings spanning from daytime to nighttime. The DroneVehicle dataset provides annotations for five distinct vehicle categories: “cars,” “trucks,” “buses,” “van,” and “freight car.” We selected 17 990 image pairs for the training set and 1469 image pairs for the test set. This division is chosen to ensure a robust model evaluation while maintaining a sufficient number of samples for training. In Fig. 7, it can be seen that the categories of the DroneVehicle dataset are very imbalanced and the scale differences are large, which poses a challenge for detection. In addition, each image has a resolution of 840 × 712 pixels, surrounded by a white border with a width of 100 pixels. In our experiments, the white border was removed, resulting in an input image resolution of 640 × 512 pixels for the model. These images were then resized to 512 × 512 within the model.
Associated statistics of DroneVehicle dataset. Count of objects per type within the training subset. Dimensions and quantity of actual bounding boxes for various categories in the training data. Information on the centroid position of each object relative to the entire image frame. The aspect ratios (long-to-short edge) of the objects within an image, normalized against the dimensions of the entire image.
3) Evaluation Metrics
In our experiments, we employ various metrics including precision (P), recall (R), and mean average precision (mAP) to evaluate the performance of the model in remote sensing object detection tasks. Precision (P) refers to the proportion of actual positive samples among all predicted positive samples, with a focus on the accuracy of the results. Recall (R) represents the proportion of actual positive samples that are correctly predicted as positive, calculated as the ratio of true positives to the sum of true positives and false negatives. The
\begin{align*}
P&=\frac{\text{TP}}{\text{TP}+\text{FP}} \tag{31}
\\
R&=\frac{\text{TP}}{\text{TP}+\text{FN}} \tag{32}
\end{align*}
\begin{align*}
\text{AP}&=\int _{0}^{1}P(R)dR \tag{33}
\\
\text{mAP}&=\frac{1}{N}\sum _{i=1}^{N}\int _{0}^{1}P_{i}(R_{i})dR_{i} \tag{34}
\end{align*}
B. Implementation Details
All our experiments are conducted on an NVIDIA GeForce RTX 2080 Ti GPU with 20 GB of memory, and its operating system is Ubuntu 18.04, CUDA Version 12.4. We select improved YOLOv5s as our baseline framework, setting the learning rate to 0.01, batch size to 2, and weight decay to 0.0005 for gradient descent. The training process spanned 300 epochs. During training, we apply various data augmentation strategies, including color space variations, color saturation enhancement, translation, scaling, horizontal flipping, and mosaic augmentation to enrich the feature learning of models and enhance its generalization ability. To ensure category balance, the category with fewer than 50 instances (airplane) is excluded from our experiments.
C. Ablation Studies
1) Ablation on Each Component
In our ablation studies on the VEDAI dataset, we incorporate the MHI, SADFP, and GCE into the baseline separately, aiming to validate the effectiveness of our proposed modules. (The results on VEDAI are given in Table I.) We can see that the accuracy has been improved upon the addition of these modules, the GCE module had the most significant effect, with improvements of 6.57% compared to the baseline method, whereas the MHI and the SADFP achieved improvements of 0.63% and 1.74%, respectively. Notably, the introduction of MHI and SADFP individually improved the baseline by 0.63% and 1.74%, respectively. An even more impressive achievement was integrating these two modules simultaneously into the baseline, resulting in a 5.67% improvement, indicating that SADFP effectively enhanced the quality of fused multimodal features. To fully validate the effectiveness of our module, we conduct ablation experiments of all modules on the DroneVehicle dataset, with the results in Table II. Compared to the accuracy on the VEDAI dataset (0.63%), the MHI achieved a higher accuracy enhancement (5.25%) on the DroneVehicle dataset. It is worth noting that the AAV dataset contains a significant number of images with inadequate lighting, which are difficult to accurately detect based on optical images. This demonstrates that mutimodal methods can be more effective in handling scenarios with specific challenges (insufficient lighting) and enables the detector to exhibit greater robustness when faced with complex or adverse conditions.
Specifically, the MHI module effectively integrates complementary information from multiple modalities, which helps the model better capture and utilize correlations between different modalities, thereby enhancing detection accuracy. SADFP guides the backbone network to learn finer grained information from the inputs through image reconstruction and L1 loss, reinforcing the fused features. This allows the model to accurately identify objects based on more discriminative features. The GCE module significantly boosts detection accuracy on the VEDAI dataset because the scale of most of the objects in this dataset is small, and GCE is capable of modeling the spatial relationships between objects and their contexts. This enables the model to leverage environmental information surrounding the objects or relationships with other easily obvious objects to assist in small object detection, addressing the issue of limited discriminative features inherent to such objects.
2) Ablation of the MHI
We conduct research on the effectiveness of the spatial interaction orders
Comparison of the impact of different interaction order combinations on the detection performance of the VEDAI dataset in MHI, where
To more intuitively demonstrate the impact of interaction order, we provide the feature visualization on the VEDAI dataset. As shown in Fig. 9, the first row contains some optical images and the second row contains their corresponding infrared images. The third to sixth rows present the features obtained from single-modal results and the fused features from the first- and second-order interactions, respectively. Compared with the blue area, the highlighted red area in the figure indicates higher confidence in containing objects. We can observe that models trained on optical images exhibit greater sensitivity to objects characterized by prominent textures, whereas infrared image features primarily focus on region with geometric contours that approximate the object. As illustrated in the fifth row of the figure, the confidence of fused features unrelated to the background decreases after the first-order spatial and channel interaction by MHI, whereas the confidence in object regions increases. In the sixth row, after undergoing second-order interaction, the fused features capture the most of object features and further focus on the ground truth. This phenomenon indicates that when executing the second-order interaction, MHI enables accurate object detection by balancing the advantageous features of both modalities, such as textural details and distinct.
Feature visualization of unimodal object detection and multimodal object detection method under different interaction orders of the MHI. They are both based on the DCCCNet framework with brighter colors indicating higher confidence. These visualizations demonstrate the advantages of different modalities and the effectiveness of utilizing multimodal inputs to improve detection performance.
Fig. 10 illustrates the effectiveness of our module from the visualization of feature distribution. In Fig. 10(b)–(c), compared to optical object detection, the object feature distribution boundaries from infrared detection models are not distinctly differentiated, indicating that the lack of texture details in infrared images poses challenges for object classification. In Fig. 10(d), direct concatenation of optical and infrared images provides some improvement, yet the intraclass similarity remains inadequate, which can lead to a large amount of background information being recognized as objects for its wide range of distribution. Fig. 10(e)–(h) presents the first spatial and channel interaction, the first spatial and the second channel interaction, the second spatial and the first channel interaction, and the second spatial and channel interaction, respectively. It can be seen that as the spatial interaction order increases, the discrimination between categories becomes more apparent. Similarly, as the channel interaction order increases, the feature distribution of objects in the same category become more compact. It demonstrates that second-order spatial and channel interactions significantly enhance the interclass discriminability and intraclass similarity in the object feature distribution.
Feature distribution visualization comparison on VEDAI. Different colors representing different categories, and the corresponding 3D density plots at the bottom. (a)–(h) represent the feature distributions of objects in different detection models, (a) the original feature distribution, (b) optical, (c) infrared, (d) direct concatenation of optical and infrared, (e) first-order spatial and channel interaction, (f) first-order spatial and second-order channel interaction, (g) second-order spatial and first-order channel interaction, and (h) second-order spatial and channel interaction. The visualizations demonstrate the differences between optical and infrared images in terms of object feature consistency and refined boundaries. The MHI not only further enhance feature consistency but also contribute to the precise delineation of object edge contours.
D. Evaluation on VEDAI Dataset
To evaluate the performance of our proposed method, we conduct experiments on the VEDAI dataset, comparing our DCCCNet with 12 previous state-of-the-art object detection approaches, encompassing classic single-modal object detection algorithms, such as YOLOv3 [36], YOLOv4 [37], YOLOv5s [38], YOLOv6s [39], YOLOv7 [40], YOLOv8s [41], along with YOLO-based frameworks tailored for high-density occlusion objects in remote sensing images (TPH-YOLO [58]), small objects (FFCA-YOLO [59]), and multimodal object detection algorithms (SuperYOLO [28], YOLOFIV[29], and YOLOFusion [60]). Table III presents the detection accuracy of these algorithms. The DCCCNet achieve the highest score (81.96%) in terms of detection accuracy, improves the mAP by 7.16% compared with the best single visible modality detection algorithm. Multimodal object detection algorithms are superior to these single-modal methods on this dataset, with SuperYOLO (75.09%), YOLOFusion (76.0%), and YOLOFIV (80.16%). This shown the benefit of multisensor synergy in providing abundant identification information and enhancing detection robustness. Our method improves the mAP by 1.8% compared with the state-of-the-art multimodal object detection algorithms, which exhibits highest performance among these methods. It attributed to the fact that DCCCNet can effectively extract salient information from different modalities and facilitate the model to learn cross-modal complementary information through high-order interactions.
E. Evaluation on DroneVehicle Dataset
To evaluate the robustness of our DCCCNet in low-light conditions, we conduct experiments on the DroneVehicle dataset that comprises a substantial number of optical images captured in dim or even no-light environments. In the experiments, the proposed DCCCNet is compared with several detection algorithms, including nine unimodal methods: RetinaNet [61], Faster R-CNN [35], S2ANet [21], Mask R-CNN [62], HTC [63], ReDet [64], RoITransformer [65], Hybrid Task Cascade [66], and Gliding Vertex [67], and three multimodal methods: UA-CMDet [57], GLFNet [68], and TarDAL [46]. Table IV presents a comparison of detection accuracy on the DroneVehicle dataset, highlighting that our method achieves the best detection performance. It can be observed that compared with the S2ANet algorithm, our method achieves a higher mAP of 16.6% (78.8% versus 62.2%). When compared with the TarDAL algorithm, DCCCNet exhibits 6.2% higher mAP (78.8% versus 72.6%), demonstrating the best detection performance among the multimodal detection networks.
F. Qualitative Analysis
Figs. 11 and 12 illustrate several examples to qualitatively compare the detection results from the DCCCNet with real labeled images under different complex scenes. The rows (a) and (b) of each figure display optical and infrared image pairs with ground-truth annotations, whereas rows (c)–(e) showcase the detection results on optical images, infrared images, and multimodal inputs, respectively. Fig. 11 shows detection results on the VEDAI dataset, where our DCCCNet rarely exhibits false detections (marked by red triangles) or misses (marked by blue triangles) on multimodal images. For the car in the seventh column image, it is lost in the optical image detection due to complex textures, and in the eighth column, severe background interference leads to error classification. With multimodal image input, both objects are correctly detected. Overall, in the third and fourth rows of Fig. 11, our method fails to achieve satisfactory performance with single image modality. However, through the multimodal fusion approach, we accurately detect objects that may not have been detected or predicted incorrectly when using a single modal model. Furthermore, Fig. 12 highlights the exceptional detection capability of DCCCNet in scenarios with varying lighting conditions. For example, in the first column under nighttime infrared imaging, there are two false detections arise due to the lack of rich texture information. In the second column, poor lighting conditions in the optical image distort the object structural contour features, resulting in four missed detections. In such cases, DCCCNet achieves accurate detection of these objects with the effective guidance of multimodal high-order fusion features.
Some detection results on the VEDAI dataset. (a) and (b) First and second rows refer to optical and infrared real labeled images. Proposed method's (c) optical detection result images, (d) infrared detection result images, and (e) multimodal detection result images. The area where the red or blue triangle sign is located represents false detection and missed detection, respectively.
Some detection results on the DroneVehicle dataset. (a) and (b) First and second rows refer to optical and infrared real labeled images. Proposed method's (c) optical detection result images, (d) infrared detection result images, and (e) multimodal detection result images. The area where the red or blue triangle sign is located represents false detection and missed detection, respectively.
Conclusion
In this article, we propose a multimodal object detection network for optical and infrared images called DCCCNet. First, an MHI module is proposed to achieve high-order interactions between optical and infrared images, integrating complementary information to provide more discriminative features for detection. Second, we design an SADFP module to learn super-resolution features of objects, preserving more fine-grained object information in deep networks. In addition, we introduce a GCE module, which establishes semantic correlations between objects and their backgrounds through the interplay of local spatial characteristics with comprehensive global information. This aids the model in distinguishing contextual differences between objects and backgrounds. Evaluations of DCCCNet on two public datasets demonstrate superior performance.