Introduction
Due to the swift advancement of aerial remote sensing technology [1], [2], optical remote sensing imagery has been extensively utilized for feature classification and urban planning. Optical images enable accurate object detection and recognition due to their high resolution and rich detail information [3], [4], [5]. However, under poor lighting conditions such as bad weather and nighttime scenarios, the imaging quality and detection performance provided by optical images can be significantly degraded, thus limiting their applicability.
In contrast, infrared images can capture thermal radiation information from objects, remaining unaffected by lighting conditions. As a result, they hold significant application value in scenarios such as military reconnaissance and disaster rescue [6], [7], [8], [9], [10]. Due to the availability of numerous optical remote sensing datasets [11], [12], [13], current object detection research primarily focuses on the visible-light domain. However, research on infrared target detection remains limited, constrained by the difficulty of acquiring high-value infrared target data and the high cost of annotation.
In recent years, deep learning-based algorithms have achieved substantial advancements in object detection to satisfy the imposed detection accuracy requirements based on their formidable computational power and intricate network architectures. [14], [15], [16], [17]. These data-driven detectors [18], [19], [20] were developed inside supervised end-to-end frameworks, making them significantly reliant on extensive labeled datasets. To alleviate the need for labeled data, researchers have explored few-shot learning techniques [21], [22], [23], [24], [25] in airborne remote sensing object detection and have achieved considerable success.
However, the spectral properties of visible and infrared images exhibit significant differences, leading to inconsistent feature representations for the same scene in different domains. As a result, it is difficult for traditional deep learning-based detection models to effectively learn visible image features that correspond to infrared images.
A prevalent method involves training the utilized model on the source domain and subsequently fine-tuning it on the target domain to enhance its generalization performance in the target domain. However, implementing this method presupposes that manual annotations are available for the target domain images. Unsupervised domain adaptation (UDA) seeks to address this challenge. Cross-domain object detection is achieved by using labeled source-domain data (visible images) and unlabeled target-domain data (infrared images). In the remote sensing field, most UDA works focus on semantic segmentation [26], [27], [28], [29], [30], [31] and image classification [32], [33], [34], which are methods dedicated to mitigating discrepancies by exploiting the alignment between the semantic features of the source and target domains. However, few studies on domain-adaptive object detection methods for remote sensing data exist.
In recent years, the study of domain adaptation object detection has garnered the attention of academics. Many advanced detectors [17], [18], [19], [20] have been proposed for computer vision and can be applied to remote sensing tasks. However, two challenges remain with respect to adapting target detection results in the visible domain to the infrared remote sensing image domain.
Low small target detection accuracy in infrared remote sensing cases: The number of pixels possessed by small targets in infrared images is small, and their feature expression ability is insufficient, making it difficult for the utilized model to effectively capture the detailed information of these targets. In practical applications, large amounts of visible remote sensing data and unlabeled infrared remote sensing datasets exist. However, the process of annotating infrared datasets is difficult, and the existing data volume is relatively small. Models trained only on labeled visible data have difficulty adapting to changes in infrared images.
Significant distributional differences between the source and target domains: The performance of a detection model trained for a specific environment often suffers due to a sample distribution change when it migrates to a target scene, yielding significant differences. This distribution difference, known as a domain bias, leads to decreased detection accuracy. Mapping data from different scenarios to a unified feature space and carefully analyzing their distributional properties can reveal the distributional differences between datasets derived from different domains. Fig. 1 shows the significant variability between data distributions, revealing the gaps between pairs of domains. This visual result not only highlights the existence of the domain bias problem but also provides an intuitive basis for developing further strategies that can reduce interdomain differences and improve the generalization performance of the constructed model.
Comparison between the distributions of data derived from different domains. (a) Comparison between the distributions of the VEDAI visible dataset and the VEDAI infrared dataset. (b) Comparison between the distributions of the DroneVehicle visible dataset and the DroneVehicle infrared dataset. The distributions are calculated and displayed via 2-D t-SNE [35]. Significant distributional differences are observed between the datasets, which makes it difficult to obtain satisfactory results from the detection model.
Currently, many studies address cross-domain object detection issues using UDA methods, which primarily include feature-level alignment, image-level alignment, and pseudo-label-based self-training approaches. However, these methods still exhibit the following limitations in the transfer from visible-light to infrared images: By aligning only the feature distributions between the source and target domains, they neglect the differences in visual styles, making it difficult to completely bridge the domain gap. Pseudo-label-based training strategies are prone to the influence of low-quality labels, leading to unstable object detection results.
To address the above-mentioned problems, we propose the two-phase distillation training framework (TPDTNet), a novel unsupervised domain adaptive framework for object detection in optical to infrared remote sensing imagery. Unlike computational losses, which realize direct alignment between the source and target domains, we build a deep student–teacher framework on the basis of the advanced YOLOv8 detection model. In addition, domain alignment is achieved through an enhanced domain construction technique and a two-phase teacher network training process.
Owing to substantial domain disparities, models trained in the source domain struggle to achieve optimal performance in the target domain. Therefore, we first train an unpaired image to the image generator for initial domain generalization. To simplify the domain alignment task, we establish an enhanced domain, which includes both the generated domain and the source domain. Next, we input the enhanced domain into the first-level teacher network for training to achieve pseudolabel generation for the target domain. To eradicate the impact of the source domain on the detection model, a secondary training session is performed with the teacher network. This enhances the consistency of the target domain and improves the robustness of the detection network. Finally, a channeled knowledge distillation strategy is utilized to achieve training weight migration, further improving the accuracy of the detection network.
In summary, the main contributions of this study are as follows.
We first propose a novel TPDTNet framework for unsupervised domain adaptative object detection tasks involving visible-to-infrared remote sensing images.
An image-level domain alignment method based on the source and generated domains is proposed. This approach narrows the interdomain gap by converting visible images into fake infrared images. Mixing the source domain and the generated domain to construct an enhanced domain provides the model with more comprehensive representation information and improves the domain adaptation ability of the detector.
We design a multidimensional progressive fusion detection (MPFD) framework to exploit the relationships between nonadjacent layers, improving the feature extraction ability of the model through progressive cross fusion and achieving deep semantic information capture for small targets.
We design a knowledge distillation training strategy on the basis of the teacher–student structure. Pseudodetection labels are used to calculate the target loss, and distillation learning is implemented by softly aligning the corresponding channel activation functions.
The rest of this article is organized as follows. Section II offers a succinct overview of the relevant literature. The main content of Section III includes a detailed description of the proposed method. The experimental results are presented in Section IV. Section V presents the conclusion of the article.
Related Works
This section succinctly discusses the relevant approaches from four perspectives: object detection, UDA, UDA object detection, and motivations.
A. Object Detection
The swift advancement of deep learning in computer vision has led to exceptional performance in object identification tasks, which are extensively utilized on intelligent platforms (e.g., drones, unmanned vehicles, and robots). Currently, the available object detection methods are categorized into convolutional neural network (CNN) based methods and transformer-based methods [36].
CNN-based target detectors, such as the You Only Look Once (YOLO) family [15], [17], [37] and the region-based CNN (R-CNN) family [14], [16], have made impressive achievements. Deep convolutional networks use “end-to-end” feature learning to automatically learn global features from many training sets, thereby producing accurate object detection results. CNN-based object detection methods can be classified into two types. The first includes two-stage object detection methods, which obtain the target candidate region first and then pinpoint the target location. The typical algorithms of this type are faster R-CNN [14] and cascade R-CNN [38]. The other approaches are one-stage object detection methods, which directly regress the location and category of the target frame; examples include YOLOv8 [17], Retina-Net [39], and Center-Net [40].
Transformer-based target detectors perform comparably to or even better than CNNs in object detection cases, especially in tasks requiring global contextual information. The vision transformer [41] was the first approach to migrate the original transformer to an image classification task, and the DETR [36] algorithm applied a generalized transformer backbone network to an object detection task with good results.
Many advanced detection models are available in the field of infrared target detection, e.g., UIU-Net [70], the ORSIm detector [71], and PEDNet [72]. UIU-Net embeds a tiny U-Net in a large U-Net backbone network to achieve multilevel representation learning for infrared small targets. The ORSIm detector effectively solves the image rotation and scaling problems in object detection scenarios by jointly considering rotation-invariant channel features and spatial channel features. PEDNet fully integrates the feature information of prominent regions, rotating targets, and strong semantic information through a multiscale feature-based cross-fusion structure. This method yields improved detection accuracy while performing detection in real time. A subpixel-level decoupled and coupled framework was proposed for image-level and feature-level fusion [73].
The above-mentioned methods have demonstrated good detection performance in their respective fields. However, the background temperature changes occurring in infrared remote sensing images may lead to an increase in background noise. Complex scenes may have similar temperature characteristics to those of the target, which can interfere with the judgment of the utilized detection algorithm.
Currently, the most commonly used domain adaptive detector is still Faster R-CNN. However, even the state-of-the-art ResNet101 [42] backbone network-based Faster R-CNN model still has significant accuracy and real-time performance gaps relative to YOLOv8.
YOLOv8 adopts a modularized design and performs better in small target and complex scenarios through an improved network structure and a better training method. We choose YOLOv8 as the base detection model to construct a domain-adaptive training framework for two-phase distillation learning to improve its detection performance in cross-domain scenarios.
B. Unsupervised Domain Adaptation
UDA seeks to leverage annotated data from a source domain to develop a model that generalizes effectively to an unlabeled target domain while preserving its high performance. The fundamental tactics of this method encompass feature-level alignment techniques and image-level alignment techniques.
Feature alignment approaches strive to make the feature distributions in the source and target domains as consistent as possible via techniques such as adversarial training and metric learning. In a domain-adaptive classification task, the maximum mean discrepancy [43] and correlation alignment [44] accomplish domain adaptation by reducing the distributional divergence between the source and target domains within a high-dimensional feature space. The domain-adversarial neural network [45] employs an adversarial training mechanism to achieve domain alignment via a domain classifier that makes the feature distributions of the source and target domains indistinguishable in the feature space.
Image-level alignment methods focus on aligning global characteristics and emphasize the alignment of target objects in the source and target domains. A source-domain image is converted into the style of a target-domain image to reduce the distribution difference between the two domains, which can be achieved by a generative adversarial network [46]. Mean Teacher [47] guides its model to produce stable and consistent predictions over the target domain through consistent regularization and teacher networks. Our work achieves dual image-level and feature-level alignment through a two-phase framework, which further improves the domain adaptive detection ability of the model.
C. Unsupervised Domain Adaptative Object Detection
Although UDA models have achieved better results in classification and segmentation tasks than other methods have, more research needs to be done on object detection methods for UDA. Chen et al. [48] pioneered DA Faster R-CNN for effective domain adaptation in object detection tasks. This approach employs a dual adversarial training strategy that achieves domain alignment at both the image and instance levels. The existing unsupervised domain-adaptive object detection methods can be broadly categorized as follows.
1) Domain Invariant Feature Learning
These methods perform adversarial training on their detection models to align the feature distributions of the two domains through an adversarial loss. Ganin et al. [49] used a feature learning method based on gradient inversion layers for adversarial training and generated domain invariant features. SWDA [50], Every pixel matters [51], PDA [52], MeGA-CDA [53], and RFA-Net [54] are all adversarial-based domain invariant feature learning methods that follow the DA Faster R-CNN.
2) Image-to-Image Translation
These strategies aim to train an unpaired image translator that transforms the given source-domain data into target-domain class data, visually reducing the distributional bias. Compared with feature alignment methods, image-to-image translation-based methods can reduce the domain gap at the input level. The current leading image translation methodologies include the cyclical generative adversarial network [55], and CUT [56].
3) Pseudo-Label-Based Self-Training
This approach employs a self-training strategy that leverages the predictions produced by a model on the target domain as pseudolabels to enhance the performance achieved in that domain. Khodabandeh et al. [57] suggested a pseudolabel denoising strategy for increasing the generalizability of a model throughout the target domain. Kim et al. [58] proposed a pseudolabeling approach for single-stage detection models, which led to better results.
4) Mean Teacher Training
The Mean Teacher training method is guided by consistency regularization and teacher networks so that the model produces stable and consistent predictions over the target domain. The teacher model produces stable pseudolabels by employing an exponential moving average to update the parameters of the student models. The student models are then trained on the target domain via these pseudolabels to adapt to the data distribution of the target domain. UMT [59], HT [60], SCL [61], and AT [62] all utilize average teacher model cotraining to increase their object detection efficacy over the target domain. However, the above methods have not effectively solved the domain bias problem. Their effectiveness has not yet been proven in the field of remote sensing image detection. Therefore, we develop a two-phase distillation learning-based training framework. It achieves image-level domain adaptation through generative modeling. We add a dual-domain teacher pseudo-label-based learning and training model to the distillation learning process to achieve feature-level domain alignment, which improves the ability of the network to extract features from the target domain and improves the resulting detection accuracy.
D. Motivation
Unlike single-modal UDA, visible-to-infrared UDA faces multiple challenges. We need to simultaneously address the problems related to data distribution differences, the scarcity of target-domain labeling data, and the low detection accuracy attained for small targets derived from infrared remote sensing images. A comparison between single-modal UDA and cross-modal UDA is shown in Fig. 2.
With the above-mentioned motivations, this article proposes TPDTNet to solve visible-to-infrared unsupervised domain-adaptive object detection tasks. We first migrate visible images to the infrared domain via a generative network in the first phase, which achieves image-level domain alignment and initially reduces the interdomain gap. Second, an MPFD module is embedded in the detection network to enhance the interlayer feature fusion process. This addresses the lack of IR-domain labels through teacher network-generated labels. Finally, channel distillation is integrated into the teacher–student network to achieve feature-level domain alignment and enhance the performance of the model in the IR domain. The proposed method attains better results than the existing mainstream methods on both the remote sensing dataset VEDAI and the generalization dataset DroneVehicle and LLVIP. Furthermore, the extensive ablation experiments further validate the effectiveness of the various modules used in the proposed method.
Method
The comprehensive framework of the proposed TPDTNet model is illustrated in Fig. 2. Our TPDTNet approach consists of two phases: image-level domain alignment and domain-adaptive distillation training. In this work, the source domain is the visible domain, and the target domain is the infrared domain.
In the first phase, the source-domain images are transformed into target-domain images via a domain generation model. Consequently, the images possess identical visual contents but belong to distinct domains. The transformed fake target domain is called the generative domain. We mix the source-domain and generated-domain data to construct an enhanced domain. The specific process is shown in the blue dashed box in Fig. 3.
Overall network framework TPDTNet. As shown in the blue dashed box, in the first phase the source domain image is converted into the target domain image through the domain generation network; these images are combined to form the enhanced domain image. The initial teacher network was trained to generate pseudolabels using enhanced domain in the second phase, and the student network is subsequently trained via knowledge distillation, as shown in the yellow dashed box.
A. Image-Level Domain Alignment
An image-level domain alignment method is designed to compensate for the difference between the source domain and the target domain. We employ a style transfer network to transform the source domain into the target domain. The source domain is denoted as
It is evident that the enhanced domain consists of data from both the source domain and the generated domain. Our domain generator combines adversarial learning with contrastive learning. Through the adversarial interaction between the generator and the discriminator, the distribution of the generated domain is brought closer to that of the target domain. Simultaneously, contrastive learning narrows the feature representation gap between the generated domain and the target domain in the feature space while distinguishing the generated domain from other unrelated domains (e.g., the source domain). This process effectively achieves image-level domain alignment.
The domain generator is crucial for image-level domain alignment, inspired by the work of Park et al. [56], we design a domain generation network based on the contrastive learning framework by learning the unidirectional modal transformation and mapping relationship. Our generator maximizes the mutual information between the source and target domains, which reduces the required training time to a certain extent. Fig. 4 illustrates the architecture of our domain generation network.
Domain generation network structure. The generated image block should be closer to the corresponding input image block than the other negative sample image blocks
Specifically, images are first obtained from the source and target domains, from which the image's content and domain information are subsequently extracted. The size of the image is 512 × 512, and three input channels are present. The source-domain image is fed into the generator, and the content information of the visible image is fused with the style information of the infrared image to generate a new feature map, which is remapped to obtain the infrared style source-domain image.
The generator model uses a simple encoding and decoding architecture, which is separated into two parts: an encoder
\begin{equation*} \alpha _i^G = \varphi (z) = {{\varphi }_{\text{dec}}}({{\varphi }_{\text{enc}}}(\alpha _i^S)) \tag{1} \end{equation*}
B. Detection Network
Based on the enhanced domain, we pretrain the detection model by feeding the data into the teacher network. The YOLOv8 model is widely used in remote sensing object detection and achieves excellent performance. However, this model cannot solve the problem of cross-modal domain adaptation detection for multimodal remote sensing data. To resolve this issue, we present an object detection framework, as shown in Fig. 5.
We design a multilayer cross-feature fusion detection network, embedding space-to-depth convolution (SPD-Conv) into the backbone network to extract image texture details. An MPFD framework is built to obtain high-level semantic information from the input images. The network consists of three basic parts: backbone (feature extraction), neck (feature fusion), and head (object detection).
To improve the feature extraction capability of the backbone of YOLOv8, SPD-Conv [63] is introduced in layers P2 to P5. Specifically, SPD-Conv consists of an SPD layer and a nonstrided convolutional layer (stride = 1) in series. The input feature maps are transformed through the SPD layer, and then the output results are subjected to a convolution operation through a nonstrided Conv layer. This operation can significantly reduce the spatial dimensionality of the feature maps while maintaining the channel information. Specifically, each pixel of the input feature map is mapped to a channel, in which the spatial dimensionality is reduced while the channel dimensionality is increased. The associated process is shown in Fig. 6.
The size of the input feature map
\begin{equation*} F = \left\{ {\begin{array}{cccc} {{{f}_{0,0}}}&{{{f}_{1,0}}}& \cdots &{{{f}_{{\rm{scale - 1,0}}}}}\\ {{{f}_{0,1}}}&{{{f}_{1,1}}}& \cdots &{{{f}_{{\rm{scale - 1,1}}}}}\\ \vdots & \vdots & \vdots & \vdots \\ {{{f}_{{\rm{0,scale - 1}}}}}&{{{f}_{{\rm{1,scale - 1}}}}}& \cdots &{{{f}_{{\rm{scale - 1,scale - 1}}}}} \end{array}} \!\right\} \tag{2} \end{equation*}
To enhance the feature extraction ability of the model for tiny object regions and capture deeper semantic information, we design an MPFD framework. In the feature extraction stage within the backbone network, different levels of features are formed from low to high. Unlike traditional feature pyramid fusion [64] frameworks, we utilize the relationships between nonadjacent layers and improve the feature extraction ability of the model through progressive cross fusion, as shown in the neck section of Fig. 5. We employ 1 × 1 convolution and bilinear interpolation techniques to upsample the features, thereby achieving dimensional alignment before conducting feature fusion. In addition, we perform downsampling with different convolution kernels and step sizes. During multilevel feature fusion, we utilize ASFF [65] to assign various spatial weights to the characteristics at different levels, attenuating the mutually exclusive information while increasing the relevance of the critical levels. Let
\begin{equation*} y_{ij}^m = \alpha _{ij}^m \cdot x_{ij}^{a \to m} + \beta _{ij}^m \cdot x_{ij}^{b \to m} + \gamma _{ij}^m \cdot x_{ij}^{c \to m} \tag{3} \end{equation*}
Finally, we propose detection heads for weak, small, medium, and large feature maps, which results in our proposed model having a broader detection range. As shown in Fig. 5, the decoupled detection head we use is divided into two branches, classification and regression, which improves the feature extraction in multicategory training.
C. Teacher–Student Distillation Learning
To better understand the knowledge in each channel of the teacher network, we implement soft alignment for the activation functions of the corresponding channels. First, we normalize the feature maps in each channel to obtain a softened probability map. Subsequently, we can measure the differences via the Kullback–Leibler (KL) divergence and other probability distance measures. Finally, we convert the channel activations into probability distributions by minimizing the KL scatter between the channel probability maps of the two networks, with the distillation process emphasizing the most prominent portions of each channel.
As shown in Fig. 7, the significance of the scene categories corresponding to the activation areas of different channels varies. Therefore, we employ a new channel distillation paradigm to guide the student networks in learning knowledge from a well-trained teacher network.
D. Two-Phase Training Strategy
This section elucidates the relationship between the two phases, image-level domain alignment and domain adaptive distillation training, which are performed sequentially. The specific training algorithm of the proposed TPDTNet is shown in Algorithm 1. In the first phase, the source- and target-domain images are fed into the generative network to train the generative model so that the source-domain image is converted into a target-like domain image. Subsequently, we combine the source and generative domains to realize enhanced domain construction. Theoretically, constructing an enhanced domain can compensate for the interdomain differences that occur during model training to a certain extent and achieve preliminary domain alignment, which provides a basis for conducting cross-domain detection models training.
Algorithm 1: Two-Phase Training Strategy of TPDTNet.
source domain dataset
enhanced domain
[The first phase of training]
while
Input the source domain images
Merge
end while
while
Input the enhanced domain
Input
Utilize the channel knowledge distillation strategy and target domain data for model distillation training;
Calculate the loss between the prediction result and the ground truth;
Update the weight parameters of the pretrained detection network;
end while
In the second phase, we divide the model training into three parts.
First, we feed the enhanced domain data constructed in the first phase into the detection network in a mixed manner to obtain the weights of the teacher network. These teacher weights are then used to detect target-domain images and generate corresponding pseudolabels.
Second, to eliminate the influence of the source domain on the detection model, we utilize the target-domain images equipped with pseudolabels and the generated-domain images for performing secondary training on the teacher network. These two styles of images are utilized to constrain the consistency of the target domain and improve the stability of the detection network.
Finally, the channelwise knowledge distillation strategy is utilized to migrate the training weights from the teacher network to the student network, and the model distillation training is performed using only the target-domain data.
As a result, the student network has more robust target-domain detection performance, which improves the resulting detection accuracy. Knowledge distillation is used to migrate complex teacher networks to lightweight student networks, and our motivation for using channel distillation is to achieve improved network detection performance while reducing the computational complexity of the model.
E. Loss Function
This section describes the loss function used to train TPDTNet. In the first phase, we utilize a domain generation network to migrate source domains to target domains. In this phase, we use an adversarial loss and a contrast loss.
The adversarial loss is employed to interfere with the discriminator when determining the authenticity of the generated image. The adversarial loss function is employed to enhance the visual similarity between the generated image and the corresponding target domain image. The specific formula is defined as follows:
\begin{align*} {{L}_{\text{adv}}} \left(\varphi,\lambda,S,T \right) =& {{{\mathbb{E}}}_{\alpha _i^T \sim T}}\log \lambda \left(\alpha _i^T\right)\\ &{ + {{{\mathbb{E}}}_{\alpha _i^S \sim S}}} \log \left(1 - \lambda \left(\varphi \left(\alpha _i^S\right)\right)\right) \tag{4} \end{align*}
The generative model uses a noise contrastive estimation framework [66] to maximize the mutual information between the inputs and outputs. The goal of the model is to match the relevant input and output image blocks at a specific location, whereas the other image blocks contained in the input can be utilized as negative samples. We select
\begin{align*}&{{L}_c}\left( {\varphi,H,S} \right) = {{{\mathbb{E}}}_{\alpha _i^S \sim S}}\sum\limits_{l = 1}^L {\sum\limits_{s = 1}^{{{S}_l}} {\ell \left( {\hat{z}_l^s,z_l^s,z_l^{S\backslash s}} \right)} } \tag{5}\\
&\ell {\bf (}\hat{z}_l^s,z_l^s,z_l^{S\backslash s}{\bf )}\\
&\quad = - \log \left[ {\frac{{\exp ({{\hat{z}_l^s \cdot z_l^s} \mathord{\left/ {\vphantom {{\hat{z}_l^s \cdot z_l^s} \tau }} \right. } \tau })}}{{\exp ({{\hat{z}_l^s \cdot z_l^s} \mathord{\left/ {\vphantom {{\hat{z}_l^s \cdot z_l^s} \tau }} \right. } \tau }) + \sum\nolimits_{n = 1}^N {\exp ({{\hat{z}_l^s \cdot z_l^{S\backslash s}} \mathord{\left/ {\vphantom {{\hat{z}_l^s \cdot z_l^{S\backslash s}} \tau }} \right. } \tau })} }}} \right] \tag{6} \end{align*}
The comprehensive loss function for the first phase is expressed as follows:
\begin{equation*}{{L}_{\mathrm{stage1}}} = {{L}_{\text{adv}}} + {{L}_c}\left( {\varphi,H,S} \right) + {{L}_c}\left( {\varphi,H,T} \right). \tag{7} \end{equation*}
The training losses in the second phase mainly include classification loss, regression loss, and distillation loss. The training process of the teacher network only requires classification and regression losses, whereas the training process of the student network requires all three above types of losses.
We choose the binary cross-entropy (BCE) loss as the classification loss of the detection network, and it is defined as follows:
\begin{equation*}{{L}_{\text{cls}}} = \frac{1}{N}\sum\limits_i { - \left[{{\mu }_i} \cdot \log \left({{p}_i}\right) + \left(1 - {{\mu }_i}\right) \cdot \log \left(1 - {{p}_i}\right)\right]} \tag{8} \end{equation*}
In the regression task, the degree of regression of a frame can be measured by the ratio of the ground truth (GT) box
\begin{equation*} \text{IoU} = \frac{{\left| {{{B}^{\text{pre}}} \cap {{B}^{\text{gt}}}} \right|}}{{\left| {{{B}^{\text{pre}}} \cup {{B}^{\text{gt}}}} \right|}}. \tag{9} \end{equation*}
Our IoU loss formula is as follows:
\begin{align*}&{{L}_{\text{IoU}}} = 1 - \text{IoU} + \text{distance} + 0.5 \times \Omega \tag{10}\\
&\text{distance} = hh \times \frac{{{{{(x_c^{\text{pre}} - x_c^{\text{gt}})}}^2}}}{{{{c}^2}}} + ww \times \frac{{{{{(y_c^{\text{pre}} - y_c^{\text{gt}})}}^2}}}{{{{c}^2}}} \tag{11}\\
&ww = \frac{{2 \times {{{({{w}^{\text{gt}}})}}^{\text{scale}}}}}{{{{{({{w}^{\text{gt}}})}}^{\text{scale}}} + {{{({{h}^{\text{gt}}})}}^{\text{scale}}}}} \tag{12}\\
&hh = \frac{{2 \times {{{({{h}^{\text{gt}}})}}^{\text{scale}}}}}{{{{{({{w}^{\text{gt}}})}}^{\text{scale}}} + {{{({{h}^{\text{gt}}})}}^{\text{scale}}}}} \tag{13}\\
&\Omega = \sum\limits_{t = w,h} {{{{\left(1 - {{e}^{ - \omega t}}\right)}}^\theta },\theta = 4} \tag{14}\\
&\left\{ {\begin{array}{c} {{{\omega }_w} = hh \times \frac{{\left| {{{w}^{\text{pre}}} - {{w}^{\text{gt}}}} \right|}}{{\max ({{w}^{\text{pre}}},{{w}^{\text{gt}}})}}}\\ {{{\omega }_h} = ww \times \frac{{\left| {{{h}^{\text{pre}}} - {{h}^{\text{gt}}}} \right|}}{{\max ({{h}^{\text{pre}}},{{h}^{\text{gt}}})}}} \end{array}} \right. \tag{15} \end{align*}
The loss of the knowledge distillation learning in the second phase is described as
\begin{equation*}{{L}_{kd}}\left(\phi \left({{y}^{T^{\prime}}}\right),\phi \left({{y}^{S^{\prime}}}\right)\right) = \psi \left(\phi \left({y}_{c^{\prime}}^{T^{\prime}}\right),\phi \left({y}_{c^{\prime}}^{S^{\prime}}\right)\right) \tag{16} \end{equation*}
\begin{equation*} \phi ({{y}_{c^{\prime}}}) = \frac{{\exp (\frac{{{{y}_{c^{\prime},i}}}}{{T}})}}{{\sum\nolimits_{i = 1}^{W \cdot H} {\exp (\frac{{{{y}_{c^{\prime},i}}}}{{T}})} }} \tag{17} \end{equation*}
For the knowledge distillation loss in (16),
\begin{equation*} \psi \left({{y}^{T^{\prime}}},{{y}^{S^{\prime}}}\right) = \frac{{{{{T}}^2}}}{{C^{\prime}}}\sum\limits_{c^{\prime} = 1}^{C^{\prime}} {\sum\limits_{i = 1}^{W \cdot H} {\phi \left({y}_{c^{\prime},i}^{T^{\prime}}\right) \cdot \log \left[\frac{{\phi ({y}_{c^{\prime},i}^{T^{\prime}})}}{{\phi ({y}_{c^{\prime},i}^{S^{\prime}})}}\right]} } . \tag{18} \end{equation*}
If the distribution of the teacher network is large, the distribution of the student network should be as close as possible to the teacher distribution to minimize the KL divergence value.
Therefore, the detection network in the second phase distills the total training loss function as follows:
\begin{equation*}{{L}_{\mathrm{stage2}}} = {{L}_{\text{cls}}} + {{L}_{\text{IoU}}} + {{L}_{kd}}. \tag{19} \end{equation*}
Experimental Results
In this section, we methodically clarify the employed dataset, describe the experimental details, outline the used evaluation metrics, and discuss the various experimental outcomes of our model, accompanied by a thorough analysis. All the experiments in the following are carried out on a server with an NVIDIA RTX 4090 GPU, 24 GB of memory, and Python code running on Windows 11. With CUDA11.6, PyTorch1.12, and other widely used deep learning and image processing libraries installed, our development environment is PyCharm2022.
A. Dataset
The VEDAI [68] dataset contains 1246 high-resolution RGB and infrared images, including 3640 objects distributed across 8 common categories. The images have 1024 × 1024 pixels resolutions and cover different environments and terrains. The vehicles in each image were obtained under different weather and lighting conditions, providing variety and challenges. This dataset encompasses complex backgrounds, such as urban, rural, mountainous, and industrial areas, where objects frequently experience occlusion and interference from the background. In most cases, the targeted vehicles occupy only a small fraction of the total pixels within an image, thereby posing significant challenges for small object detection. Visible-light images provide clear details and well-defined edges. Although infrared images maintain the same spatial resolution as visible-light images, they exhibit fewer texture details and primarily rely on temperature differences between the objects and their surroundings to delineate contours. Consequently, variations in background temperatures may lead to the emergence of spurious targets. We set the ratio of the training set, validation set, and test set to 7:1.5:1.5. In our experiments, we employ infrared images as the target domain and RGB images as the source domain. Some typical remote sensing images and target object sizes contained in the dataset are shown in Fig. 8.
B. Implementation Details
In the first phase, we train a generator that converts images from optical to infrared. To construct the enhanced domain, we first train the domain generator. Specifically, we construct generative domain images by using optical remote sensing images as source domain data and infrared remote sensing images as target domain data to train the domain generation process. The training objective is to minimize confrontation loss and contrastive loss and facilitate precise domain transfer from the optical images to the infrared target domain. The total number of training rounds is 200, with 50 iterations designated for linearly decaying the learning rate to zero. The training process is optimized via the adaptive moment estimation (Adam) optimizer with momentum parameters of 0.5 and 0.999. The initial learning rate of the network is established at 0.0002.
In the second phase, we train the detection network via the stochastic gradient descent optimizer with some hyperparameters set as follows. The initial learning rate is 0.01, the weight decay rate is 0.0005, the optimization momentum is 0.937, the number of iterations is 300, and the number of early stopping patience rounds is set to 100, i.e., the number of rounds to be waited for after the effect is no longer boosted. The remaining configurations are left as the default values of the original YOLOv8 model.
C. Evaluation Indices
We use Precision, Recall, F1 Score, and mean average precision (mAP) as an evaluation metric to measure the performance of the detector as follows:
\begin{align*}{\rm{Precision }}=& \frac{{\text{TP}}}{{{\rm{TP + FP}}}} \tag{20}\\{\rm{Recall }}=& \frac{{\text{TP}}}{{{\rm{TP + FN}}}} \tag{21}\\{\rm{F1 }}=& 2 \times \frac{{{\rm{Precision \times Recall}}}}{{{\rm{Precision + Recall}}}} \tag{22}\\ \text{AP} =& \sum\limits_n {\left({{R}_n} - {{R}_{n - 1}}\right){{P}_n}} \tag{23}\\ \text{mAP} =& \frac{1}{C}\sum\limits_{c = 1}^C {\mathrm{A}{{\mathrm{P}}_c}} \tag{24} \end{align*}
D. Experimental Results and Analyses
In this section, we discuss our findings and compare our method (denoted as TPDTNet) with the state-of-the-art methods. We select multiple domain adaptive target detection methods based on YOLO detectors to conduct experimental validations on multimodal datasets. Source only indicates that the source-domain images with labels are used for training, and testing is directly performed on the target-domain data without domain adaptation. Oracle indicates that the model trained and tested on labeled target domains is the best-performing benchmark. The specific experimental setup and results are as follows.
We use VEDAI-Vis as the source domain and the VEDAI-Ir dataset as the target domain, and we test the model by employing a validation set to assess the efficacy of the domain adaptation method in terms of detecting optical-to-infrared images.
The transformed enhanced domain images are shown in Fig. 9. The enhanced domain images consist of the source-domain image and the generated images. A comparison between the distributions of the generated domain images and the target domain images reveals that our method yields excellent image-level alignment results, as shown in Fig. 10.
Comparison between the distributions of data from the target domain and generated domain.
Table I summarizes the quantitative results of several advanced detection methods. Under these challenging conditions, TPDTNet improves the detection performance achieved on the validation set consisting of small infrared remote sensing targets and produces the best results. Specifically, TPDTNet reaches an impressive 69.7% mAP@50 and 42.9% mAP@50-95 when YOLOv8 is used as the baseline within the YOLO paradigm. Compared with state-of-the-art methods such as YOLOv5 [37], YOLOv8 [17], DETR [36], SuperYOLO [80], CMF [81], M2FP [74], SSDA-YOLO [77], VT [75], ConfMix [76], YOLO-G [78], and SF-YOLO [79], we achieve an astonishing increase in accuracy and the same parameter quantity level. However, our method has relatively high giga floating-point operations per second (GFLOPs). This is mainly because GFLOPs are related not only to the number of parameters but also to the types of operations used and the computational complexity of each layer contained in the network. The progressive fusion module used in the network structure of this method employs many convolution operations. Therefore, compared with that of the above-mentioned algorithms, our computational complexity is relatively high. However, we are pleasantly surprised to find that the TPDTNet results are 17.3% and 0.91% higher than the Oracle results for mAP@50 and mAP@50-95, respectively. This is a breakthrough achievement. It is evident that our method also achieves superior performance in terms of precision and recall. The high F1 score demonstrates that our approach strikes a good balance between precision and recall, effectively reducing both false positives and false negatives.
SSDA-YOLO and VT were initially proposed for computer vision tasks, so they may not perform as well in remote sensing scenarios. In this case, SSDA-YOLO feeds the source and target domain data into the backbone network and does not perform a separate feature extraction for the target domain. VT is a method that uses a teacher–student framework for domain-adaptive detection. However, the detector of this method does not provide relevant improvements for remote sensing small target features. The ConfMix approach utilizes mixing strategies in the source and target domains on the basis of region-level detection confidence. However, the mixed samples may not adequately represent the characteristics of the target domain, which, in turn, affects the ability of the model to adapt to the target domain. YOLO-G introduces an adversarial training branch that achieves feature alignment through a gradient reversal layer; however, its adaptability to cross-domain scenarios between visible-light and infrared data remains insufficient. SF-YOLO proposes a practical UDA method that mitigates pseudolabel drift significantly through smooth weight updates between the teacher and student networks. However, when the feature distribution of the target domain deviates significantly from that of the source domain, the model's performance declines noticeably. DETR detector introduces a Transformer architecture, enabling end-to-end training and simplifying the detection process. However, while DETR performs well on large object detection, it struggles with small object detection.
Fig. 11 visualizes the detection results. We can intuitively find that all the compared models have different degrees of leakage and misdetection. In contrast, the method presented in this work has more substantial advantages in terms of both the accuracy of the regression box and the confidence of the classification results.
Visual comparison among different detection approaches. (a) Source only (YOLOv5). (b) Source only (YOLOv8). (c) SSDA-YOLO. (d) VT. (e) ConfMix. (f) Oracle (YOLOv5). (g) Oracle (YOLOv8). (h) TPDTNet (Ours). (i) GT.
E. Ablation Study
This section initially examines the baseline model of the detector within the domain adaptive detection method to make the best choice. We subsequently demonstrate the effectiveness of the proposed modules by verifying the influence of each module on the overall performance through ablation experiments, which are performed in this section on the VEDAI_Vis to VEDAI_Ir datasets.
1) Validation of the Baseline Framework
In Table II, the model sizes and inference capabilities of different baselines are evaluated based on their parameter sizes, GFLOPs, and frames per second (FPS). The YOLOv8 model provides five different baseline variants, namely, YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x, each of which differs in its numbers of parameters (Params) and convolutional layers. YOLOv8n represents the smallest model with the fewest parameters, GFLOPs, and the fastest computing speed (FPS), whereas YOLOv8x is the largest and slowest model. We apply the above-mentioned five models to our proposed method to select the optimal baseline model. The experimental results indicate that YOLOv8m attains an optimal balance between its inference speed, detection accuracy, and number of model parameters. Thus, in subsequent experiments, we use the YOLOv8m model as our baseline model.
2) Effectiveness of Each Component
This ablation experiment evaluates the impacts of components such as the SPD convolution, MPFD, and the regression loss on the performance of the target detection model. The ablation experiments conducted for different components are given in Table III. Compared with the bare baseline, adding all three modules results in good performance.
YOLOv8+SPD has a 4.1% better mAP@50 than that of the original baseline, YOLOv8+MPFD achieves 11.2% better mAP@50 than that of the original baseline, and YOLOv8+L_IOU achieves 2% better mAP@50 than that of the original baseline. SPD can help the backbone network the process of extracting features from small remote sensing targets, MPFD improves the multilevel fusion ability of the network, and the L_IOU regression loss can realize the precise localization of detected targets. When we add SPD to the backbone network while utilizing MPFD instead of the neck network and use L_IOU as the loss function, we achieve excellent results in terms of both evaluation metrics (52.4% versus 69.7%, and 29.8% versus 42.9% ).
3) Effectiveness of Improving Performance for Targets of Different Sizes
To evaluate the contributions of different modules in our method to detection performance at various scales, we conducted ablation experiments, as given in Table III. We classified object sizes using the COCO dataset format, categorizing objects based on their area (in pixels). Objects with an area smaller than 32 × 32 pixels are defined as small objects. Those with an area greater than 32 × 32 but smaller than 96 × 96 pixels are considered medium objects, while objects with an area larger than 96 × 96 pixels are classified as large objects. Analysis of the table reveals that when SPD, MPFD, and L_IOU components are added individually, the detection accuracy for targets of different sizes improves to varying degrees. Notably, these components exhibit a more significant performance improvement for small target detection. Specifically, the MPFD module achieves results that are 13.1% higher than the baseline in terms of mAP@50-95. When all components are incorporated into the baseline, it is evident that our method achieves optimal performance across multiple scales of target detection. Our method achieves mAP@50-95 of 38.4%, 46.4%, and 42.5% for small, medium, and large object detection, respectively.
F. Generalization Experiment on Dataset DroneVehicle
To validate the generalization ability of the proposed TPDTNet, this section compares the detection accuracy of TPDTNet with other methods on the public dataset DroneVehicle [69]. The DroneVehicle dataset is a large-scale RGB-IR (visible-infrared) multimodal dataset for vehicle detection in drones imagery. The dataset contains a variety of shooting conditions and scenes covering different viewing angles, heights, and lighting conditions, totaling 20 000 images. The object scale in the image varies greatly, and in some scenes, the object density is high, resulting in occlusion between objects. In the visible modality, oblique viewing angles often result in object occlusion and deformation, thereby compromising the integrity of target features. In contrast, in infrared imagery, the thermal characteristics of targets vary minimally across different viewpoints, and although the boundaries may appear blurred, the overall shape remains relatively consistent. We select 6000, 1000, and 868 image pairs from this dataset as the training, validation, and test sets, respectively. Five vehicle types such as cars, trucks, buses, vans, and freight-cars are considered.
Table IV illustrates the comparison results produced by the different algorithms in detail, and Fig. 12 provides some visualization results. The following can be concluded.
The mAP@50 attained by the proposed TPDTNet model on the public DroneVehicle dataset is 60.3%, 10.0% –12.4% higher than those of the other methods. Our algorithm also achieves excellent detection accuracy in the generalization experiment conducted on aerial drone data relative to that of the other methods.
The visualization results clearly reveal that the proposed method, TPDTNet, has significant advantages with respect to both missed and false detections and in terms of the localization accuracy of its regression and its categorical confidence relative to those of other methods, confirming its good performance.
We have also added comparative experiments with the latest RGB-T multimodal object detection methods M2FP [74] and CMF [81]. It can be observed that, although our overall mAP@50 still lags behind these advanced multimodal methods, we achieve an impressive accuracy of up to 98% in car object detection.
In summary, the proposed method has high detection accuracy on the DroneVehicle UAV aerial photography dataset, indicating its good detection accuracy and generalizability.
Visualization of the experimental comparisons on the DroneVehicle dataset. (a) Source only (YOLOv5). (b) Source only (YOLOv8). (c) SSDA-YOLO. (d) VT. (e) ConfMix. (f) Oracle (YOLOv5). (g) Oracle (YOLOv8). (h) TPDTNet (Ours). (i) GT.
G. Generalization Experiment on Dataset LLVIP
To further validate the adaptability of the proposed method, we conduct generalization experiments on the LLVIP dataset. This dataset is primarily intended for pedestrian detection under dim lighting conditions utilizing both infrared and visible-light imagery. In total, it comprises 30 000 pairs of registered visible-light and infrared images. LLVIP encompasses a variety of low-illumination scenarios, including nighttime, dimly lit environments, and foggy conditions. In infrared imagery, pedestrian targets often appear prominently due to pronounced temperature contrasts. In contrast, visible-light images present complex backgrounds that include noise, shadows, and other forms of interference. We select 1891 pairs of visible infrared data as the training set and 350 images for both validation and test sets. Table V lists the comparison results of the different algorithms in detail, and Fig. 13 provides some visualization results. The following can be concluded.
Visualization of the experimental comparisons on the LLVIP dataset. (a) Source only (YOLOv5). (b) Source only (YOLOv8). (c) SSDA-YOLO. (d) VT. (e) ConfMix. (f) Oracle (YOLOv5). (g) Oracle (YOLOv8). (h) TPDTNet (Ours). (i) GT.
Our proposed TPDTNet method achieves excellent detection accuracy on the LLVIP dataset, and we achieve optimal results in terms of both evaluation metrics compared to other state-of-the-art methods. We obtain mAP@50 and mAP@50-95 88.4% and 54.6%, respectively. Similarly, we can see from the visualization results that our method can still accurately detect targets that are occluded in backgrounds. The experimental results show that our method has good generalizability in domain-adaptive visible-to-infrared detection tasks.
Discussion
The proposed TPDTNet framework effectively addresses the challenges associated with cross-domain object detection between visible and infrared modalities. By leveraging a two-phase distillation training process, TPDTNet bridges the significant domain gap between source (visible) and target (infrared) data, showcasing its efficacy in scenarios where labeled target domain data are scarce or unavailable.
While TPDTNet demonstrates significant advancements in addressing domain shifts between visible and infrared modalities, certain challenges persist. Although the use of generated images effectively mitigates domain discrepancies to some extent, the inherent differences between generated and real infrared images can still be pronounced, particularly in complex scenes. This discrepancy is especially evident when capturing small target features, where subtle texture and intensity variations in real infrared images may not be fully replicated by generated images.
Our method still has the following limitations. Realism of generated images: In complex environments, the synthetic infrared images generated during domain alignment may lack sufficient fidelity to represent the nuanced characteristics of real-world infrared scenes. This can lead to challenges in accurately detecting small objects, as these features often require precise texture and intensity representations that generated images may fail to capture. Small target detection: The difficulty in fully replicating small target features in generated images could result in reduced detection performance for these objects, particularly in cluttered or noisy scenes.
To address these challenges and enhance the applicability of TPDTNet, focus on the following points in the future. Improved generative techniques: Incorporating advanced generative methods, such as diffusion models or enhanced GAN architectures, to better bridge the gap between generated and real infrared images. Small object enhancement: Developing specialized modules or training strategies to prioritize the detection of small targets, especially in challenging environments.
Conclusion
In this article, we introduce the TPDTNet model, which is a novel two-phase framework that is developed to address the performance degradations exhibited by object detection models in cross-domain remote sensing scenarios. The main objective of the TPDPNet model is to mitigate the domain bias that occurs from visible images to infrared images and to enable unsupervised domain adaptative detection. In the first phase, we use a powerful domain generator to build enhanced domain for initial domain alignment at a low cost. In the second phase, we propose a detector specifically designed for small targets in remote sensing imagery. SPD convolution is employed to capture detailed spatial information. Meanwhile, MPFD progressively and asymmetrically fuses features from high-level semantics down to low-level representations, thereby enabling more efficient integration of multi-scale information.
These modules are carefully crafted to ensure coherence and complementarity, leading to enhanced performance in domain-adaptive object detection. Moreover, to address the challenge of domain migration, we propose a teacher–student distillation training strategy, which trains an excellent detection model in unlabeled target domain scenarios. The experimental results show that our proposed method is robust and adaptable in terms of improving its cross-domain unsupervised detection capabilities for visible-to-infrared remote sensing images. In the future, we will focus more on effectively reducing the domain gap for adaptive target detection tasks in the multimodal remote sensing domain and realizing efficient unsupervised domain-adaptive target detection.