Journals & Magazines >IEEE Journal of Selected Topi... >Volume: 18

TPDTNet: Two-Phase Distillation Training for Visible-to-Infrared Unsupervised Domain Adaptive Object Detection

Abstract:

In remote sensing target detection cases, great challenges are faced when migrating detection models from the visible domain to the infrared domain. Cross-domain migratio...Show More

Metadata

Abstract:

In remote sensing target detection cases, great challenges are faced when migrating detection models from the visible domain to the infrared domain. Cross-domain migration suffers from problems such as a lack of data annotations in the infrared domain and interdomain feature differences. To improve the detection accuracy attained for infrared images, we propose a novel two-phase distillation training network (TPDTNet). Specifically, in the first phase, we incorporate a contrastive learning framework to maximize the mutual information between the source and target domains. In addition, we construct a generative model that learns only a unidirectional modality conversion mapping, thereby capturing the associations between their visual contents. The source-domain image is converted to an image with the style of the target domain, thereby achieving image-level domain alignment. The generated image is combined with the source-domain image to form an enhanced domain for cross-modal training. Enhanced domain data are fed into the teacher network to initialize the weights and produce pseudolabels. Next, to address small remote sensing target detection tasks, we construct a multidimensional progressive feature fusion detection framework, which initially fuses two adjacent low-level feature maps and then progressively incorporates high-level features to enhance the quality of fusing nonadjacent layer features. Subsequently, a spatial-dimension convolution is integrated into the backbone network. This convolutional operation is embedded following standard convolution to mitigate the loss of detailed features. Finally, a distillation training strategy that utilizes pseudodetection labels to calculate target information. By minimizing the Kullback–Leibler divergence between the probability maps of the teacher and student networks, the channel activations are transformed into probability distributions, thereby achieving knowledge distillation. The training weights are transferred...

Published in: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing ( Volume: 18)

Page(s): 4255 - 4272

Date of Publication: 10 January 2025

ISSN Information:

DOI: 10.1109/JSTARS.2025.3528057

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Due to the swift advancement of aerial remote sensing technology [1], [2], optical remote sensing imagery has been extensively utilized for feature classification and urban planning. Optical images enable accurate object detection and recognition due to their high resolution and rich detail information [3], [4], [5]. However, under poor lighting conditions such as bad weather and nighttime scenarios, the imaging quality and detection performance provided by optical images can be significantly degraded, thus limiting their applicability.

In contrast, infrared images can capture thermal radiation information from objects, remaining unaffected by lighting conditions. As a result, they hold significant application value in scenarios such as military reconnaissance and disaster rescue [6], [7], [8], [9], [10]. Due to the availability of numerous optical remote sensing datasets [11], [12], [13], current object detection research primarily focuses on the visible-light domain. However, research on infrared target detection remains limited, constrained by the difficulty of acquiring high-value infrared target data and the high cost of annotation.

In recent years, deep learning-based algorithms have achieved substantial advancements in object detection to satisfy the imposed detection accuracy requirements based on their formidable computational power and intricate network architectures. [14], [15], [16], [17]. These data-driven detectors [18], [19], [20] were developed inside supervised end-to-end frameworks, making them significantly reliant on extensive labeled datasets. To alleviate the need for labeled data, researchers have explored few-shot learning techniques [21], [22], [23], [24], [25] in airborne remote sensing object detection and have achieved considerable success.

However, the spectral properties of visible and infrared images exhibit significant differences, leading to inconsistent feature representations for the same scene in different domains. As a result, it is difficult for traditional deep learning-based detection models to effectively learn visible image features that correspond to infrared images.

A prevalent method involves training the utilized model on the source domain and subsequently fine-tuning it on the target domain to enhance its generalization performance in the target domain. However, implementing this method presupposes that manual annotations are available for the target domain images. Unsupervised domain adaptation (UDA) seeks to address this challenge. Cross-domain object detection is achieved by using labeled source-domain data (visible images) and unlabeled target-domain data (infrared images). In the remote sensing field, most UDA works focus on semantic segmentation [26], [27], [28], [29], [30], [31] and image classification [32], [33], [34], which are methods dedicated to mitigating discrepancies by exploiting the alignment between the semantic features of the source and target domains. However, few studies on domain-adaptive object detection methods for remote sensing data exist.

In recent years, the study of domain adaptation object detection has garnered the attention of academics. Many advanced detectors [17], [18], [19], [20] have been proposed for computer vision and can be applied to remote sensing tasks. However, two challenges remain with respect to adapting target detection results in the visible domain to the infrared remote sensing image domain.

Low small target detection accuracy in infrared remote sensing cases: The number of pixels possessed by small targets in infrared images is small, and their feature expression ability is insufficient, making it difficult for the utilized model to effectively capture the detailed information of these targets. In practical applications, large amounts of visible remote sensing data and unlabeled infrared remote sensing datasets exist. However, the process of annotating infrared datasets is difficult, and the existing data volume is relatively small. Models trained only on labeled visible data have difficulty adapting to changes in infrared images.
Significant distributional differences between the source and target domains: The performance of a detection model trained for a specific environment often suffers due to a sample distribution change when it migrates to a target scene, yielding significant differences. This distribution difference, known as a domain bias, leads to decreased detection accuracy. Mapping data from different scenarios to a unified feature space and carefully analyzing their distributional properties can reveal the distributional differences between datasets derived from different domains. Fig. 1 shows the significant variability between data distributions, revealing the gaps between pairs of domains. This visual result not only highlights the existence of the domain bias problem but also provides an intuitive basis for developing further strategies that can reduce interdomain differences and improve the generalization performance of the constructed model.

Fig. 1.

Comparison between the distributions of data derived from different domains. (a) Comparison between the distributions of the VEDAI visible dataset and the VEDAI infrared dataset. (b) Comparison between the distributions of the DroneVehicle visible dataset and the DroneVehicle infrared dataset. The distributions are calculated and displayed via 2-D t-SNE [35]. Significant distributional differences are observed between the datasets, which makes it difficult to obtain satisfactory results from the detection model.

Show All

Currently, many studies address cross-domain object detection issues using UDA methods, which primarily include feature-level alignment, image-level alignment, and pseudo-label-based self-training approaches. However, these methods still exhibit the following limitations in the transfer from visible-light to infrared images: By aligning only the feature distributions between the source and target domains, they neglect the differences in visual styles, making it difficult to completely bridge the domain gap. Pseudo-label-based training strategies are prone to the influence of low-quality labels, leading to unstable object detection results.

To address the above-mentioned problems, we propose the two-phase distillation training framework (TPDTNet), a novel unsupervised domain adaptive framework for object detection in optical to infrared remote sensing imagery. Unlike computational losses, which realize direct alignment between the source and target domains, we build a deep student–teacher framework on the basis of the advanced YOLOv8 detection model. In addition, domain alignment is achieved through an enhanced domain construction technique and a two-phase teacher network training process.

Owing to substantial domain disparities, models trained in the source domain struggle to achieve optimal performance in the target domain. Therefore, we first train an unpaired image to the image generator for initial domain generalization. To simplify the domain alignment task, we establish an enhanced domain, which includes both the generated domain and the source domain. Next, we input the enhanced domain into the first-level teacher network for training to achieve pseudolabel generation for the target domain. To eradicate the impact of the source domain on the detection model, a secondary training session is performed with the teacher network. This enhances the consistency of the target domain and improves the robustness of the detection network. Finally, a channeled knowledge distillation strategy is utilized to achieve training weight migration, further improving the accuracy of the detection network.

In summary, the main contributions of this study are as follows.

We first propose a novel TPDTNet framework for unsupervised domain adaptative object detection tasks involving visible-to-infrared remote sensing images.
An image-level domain alignment method based on the source and generated domains is proposed. This approach narrows the interdomain gap by converting visible images into fake infrared images. Mixing the source domain and the generated domain to construct an enhanced domain provides the model with more comprehensive representation information and improves the domain adaptation ability of the detector.
We design a multidimensional progressive fusion detection (MPFD) framework to exploit the relationships between nonadjacent layers, improving the feature extraction ability of the model through progressive cross fusion and achieving deep semantic information capture for small targets.
We design a knowledge distillation training strategy on the basis of the teacher–student structure. Pseudodetection labels are used to calculate the target loss, and distillation learning is implemented by softly aligning the corresponding channel activation functions.

The rest of this article is organized as follows. Section II offers a succinct overview of the relevant literature. The main content of Section III includes a detailed description of the proposed method. The experimental results are presented in Section IV. Section V presents the conclusion of the article.

SECTION II.

Related Works

This section succinctly discusses the relevant approaches from four perspectives: object detection, UDA, UDA object detection, and motivations.

A. Object Detection

The swift advancement of deep learning in computer vision has led to exceptional performance in object identification tasks, which are extensively utilized on intelligent platforms (e.g., drones, unmanned vehicles, and robots). Currently, the available object detection methods are categorized into convolutional neural network (CNN) based methods and transformer-based methods [36].

CNN-based target detectors, such as the You Only Look Once (YOLO) family [15], [17], [37] and the region-based CNN (R-CNN) family [14], [16], have made impressive achievements. Deep convolutional networks use “end-to-end” feature learning to automatically learn global features from many training sets, thereby producing accurate object detection results. CNN-based object detection methods can be classified into two types. The first includes two-stage object detection methods, which obtain the target candidate region first and then pinpoint the target location. The typical algorithms of this type are faster R-CNN [14] and cascade R-CNN [38]. The other approaches are one-stage object detection methods, which directly regress the location and category of the target frame; examples include YOLOv8 [17], Retina-Net [39], and Center-Net [40].

Transformer-based target detectors perform comparably to or even better than CNNs in object detection cases, especially in tasks requiring global contextual information. The vision transformer [41] was the first approach to migrate the original transformer to an image classification task, and the DETR [36] algorithm applied a generalized transformer backbone network to an object detection task with good results.

Many advanced detection models are available in the field of infrared target detection, e.g., UIU-Net [70], the ORSIm detector [71], and PEDNet [72]. UIU-Net embeds a tiny U-Net in a large U-Net backbone network to achieve multilevel representation learning for infrared small targets. The ORSIm detector effectively solves the image rotation and scaling problems in object detection scenarios by jointly considering rotation-invariant channel features and spatial channel features. PEDNet fully integrates the feature information of prominent regions, rotating targets, and strong semantic information through a multiscale feature-based cross-fusion structure. This method yields improved detection accuracy while performing detection in real time. A subpixel-level decoupled and coupled framework was proposed for image-level and feature-level fusion [73].

The above-mentioned methods have demonstrated good detection performance in their respective fields. However, the background temperature changes occurring in infrared remote sensing images may lead to an increase in background noise. Complex scenes may have similar temperature characteristics to those of the target, which can interfere with the judgment of the utilized detection algorithm.

Currently, the most commonly used domain adaptive detector is still Faster R-CNN. However, even the state-of-the-art ResNet101 [42] backbone network-based Faster R-CNN model still has significant accuracy and real-time performance gaps relative to YOLOv8.

YOLOv8 adopts a modularized design and performs better in small target and complex scenarios through an improved network structure and a better training method. We choose YOLOv8 as the base detection model to construct a domain-adaptive training framework for two-phase distillation learning to improve its detection performance in cross-domain scenarios.

B. Unsupervised Domain Adaptation

UDA seeks to leverage annotated data from a source domain to develop a model that generalizes effectively to an unlabeled target domain while preserving its high performance. The fundamental tactics of this method encompass feature-level alignment techniques and image-level alignment techniques.

Feature alignment approaches strive to make the feature distributions in the source and target domains as consistent as possible via techniques such as adversarial training and metric learning. In a domain-adaptive classification task, the maximum mean discrepancy [43] and correlation alignment [44] accomplish domain adaptation by reducing the distributional divergence between the source and target domains within a high-dimensional feature space. The domain-adversarial neural network [45] employs an adversarial training mechanism to achieve domain alignment via a domain classifier that makes the feature distributions of the source and target domains indistinguishable in the feature space.

Image-level alignment methods focus on aligning global characteristics and emphasize the alignment of target objects in the source and target domains. A source-domain image is converted into the style of a target-domain image to reduce the distribution difference between the two domains, which can be achieved by a generative adversarial network [46]. Mean Teacher [47] guides its model to produce stable and consistent predictions over the target domain through consistent regularization and teacher networks. Our work achieves dual image-level and feature-level alignment through a two-phase framework, which further improves the domain adaptive detection ability of the model.

C. Unsupervised Domain Adaptative Object Detection

Although UDA models have achieved better results in classification and segmentation tasks than other methods have, more research needs to be done on object detection methods for UDA. Chen et al. [48] pioneered DA Faster R-CNN for effective domain adaptation in object detection tasks. This approach employs a dual adversarial training strategy that achieves domain alignment at both the image and instance levels. The existing unsupervised domain-adaptive object detection methods can be broadly categorized as follows.

1) Domain Invariant Feature Learning

These methods perform adversarial training on their detection models to align the feature distributions of the two domains through an adversarial loss. Ganin et al. [49] used a feature learning method based on gradient inversion layers for adversarial training and generated domain invariant features. SWDA [50], Every pixel matters [51], PDA [52], MeGA-CDA [53], and RFA-Net [54] are all adversarial-based domain invariant feature learning methods that follow the DA Faster R-CNN.

2) Image-to-Image Translation

These strategies aim to train an unpaired image translator that transforms the given source-domain data into target-domain class data, visually reducing the distributional bias. Compared with feature alignment methods, image-to-image translation-based methods can reduce the domain gap at the input level. The current leading image translation methodologies include the cyclical generative adversarial network [55], and CUT [56].

3) Pseudo-Label-Based Self-Training

This approach employs a self-training strategy that leverages the predictions produced by a model on the target domain as pseudolabels to enhance the performance achieved in that domain. Khodabandeh et al. [57] suggested a pseudolabel denoising strategy for increasing the generalizability of a model throughout the target domain. Kim et al. [58] proposed a pseudolabeling approach for single-stage detection models, which led to better results.

4) Mean Teacher Training

The Mean Teacher training method is guided by consistency regularization and teacher networks so that the model produces stable and consistent predictions over the target domain. The teacher model produces stable pseudolabels by employing an exponential moving average to update the parameters of the student models. The student models are then trained on the target domain via these pseudolabels to adapt to the data distribution of the target domain. UMT [59], HT [60], SCL [61], and AT [62] all utilize average teacher model cotraining to increase their object detection efficacy over the target domain. However, the above methods have not effectively solved the domain bias problem. Their effectiveness has not yet been proven in the field of remote sensing image detection. Therefore, we develop a two-phase distillation learning-based training framework. It achieves image-level domain adaptation through generative modeling. We add a dual-domain teacher pseudo-label-based learning and training model to the distillation learning process to achieve feature-level domain alignment, which improves the ability of the network to extract features from the target domain and improves the resulting detection accuracy.

D. Motivation

Unlike single-modal UDA, visible-to-infrared UDA faces multiple challenges. We need to simultaneously address the problems related to data distribution differences, the scarcity of target-domain labeling data, and the low detection accuracy attained for small targets derived from infrared remote sensing images. A comparison between single-modal UDA and cross-modal UDA is shown in Fig. 2.

Fig. 2.

Comparison between single-modal UDA and cross-modal UDA.

Show All

With the above-mentioned motivations, this article proposes TPDTNet to solve visible-to-infrared unsupervised domain-adaptive object detection tasks. We first migrate visible images to the infrared domain via a generative network in the first phase, which achieves image-level domain alignment and initially reduces the interdomain gap. Second, an MPFD module is embedded in the detection network to enhance the interlayer feature fusion process. This addresses the lack of IR-domain labels through teacher network-generated labels. Finally, channel distillation is integrated into the teacher–student network to achieve feature-level domain alignment and enhance the performance of the model in the IR domain. The proposed method attains better results than the existing mainstream methods on both the remote sensing dataset VEDAI and the generalization dataset DroneVehicle and LLVIP. Furthermore, the extensive ablation experiments further validate the effectiveness of the various modules used in the proposed method.

SECTION III.

Method

The comprehensive framework of the proposed TPDTNet model is illustrated in Fig. 2. Our TPDTNet approach consists of two phases: image-level domain alignment and domain-adaptive distillation training. In this work, the source domain is the visible domain, and the target domain is the infrared domain.

In the first phase, the source-domain images are transformed into target-domain images via a domain generation model. Consequently, the images possess identical visual contents but belong to distinct domains. The transformed fake target domain is called the generative domain. We mix the source-domain and generated-domain data to construct an enhanced domain. The specific process is shown in the blue dashed box in Fig. 3.

Fig. 3.

Overall network framework TPDTNet. As shown in the blue dashed box, in the first phase the source domain image is converted into the target domain image through the domain generation network; these images are combined to form the enhanced domain image. The initial teacher network was trained to generate pseudolabels using enhanced domain in the second phase, and the student network is subsequently trained via knowledge distillation, as shown in the yellow dashed box.

Show All

A. Image-Level Domain Alignment

An image-level domain alignment method is designed to compensate for the difference between the source domain and the target domain. We employ a style transfer network to transform the source domain into the target domain. The source domain is denoted as $S = \{ \alpha _i^S,\beta _i^S,\gamma _i^S\} _{i = 1}^{{{N}_S}}$ , where ${{x}_i}$ is the $i\text{th}$ source domain image, ${{y}_i}$ and ${{z}_i}$ represent the bounding-box label and class label, respectively, and ${{N}_S}$ is the number of images. The target domain $T = \{ \alpha _i^T\} _{i = 1}^{{{N}_T}}$ has images with no ground-truth labels. The generation domain (i.e., target-like domain) transfer from the source domain is represented by $G = \{ \alpha _i^G,\beta _i^G,\gamma _i^G\} _{i = 1}^{{{N}_S}}$ . Hence, the enhanced domain is denoted as $E = S + G = \{ \alpha _i^E,\beta _i^E,\gamma _i^E\} _{i = 1}^{{{N}_{2S}}}$ .

It is evident that the enhanced domain consists of data from both the source domain and the generated domain. Our domain generator combines adversarial learning with contrastive learning. Through the adversarial interaction between the generator and the discriminator, the distribution of the generated domain is brought closer to that of the target domain. Simultaneously, contrastive learning narrows the feature representation gap between the generated domain and the target domain in the feature space while distinguishing the generated domain from other unrelated domains (e.g., the source domain). This process effectively achieves image-level domain alignment.

The domain generator is crucial for image-level domain alignment, inspired by the work of Park et al. [56], we design a domain generation network based on the contrastive learning framework by learning the unidirectional modal transformation and mapping relationship. Our generator maximizes the mutual information between the source and target domains, which reduces the required training time to a certain extent. Fig. 4 illustrates the architecture of our domain generation network.

Fig. 4.

Domain generation network structure. The generated image block should be closer to the corresponding input image block than the other negative sample image blocks $x_1^ -$ , $x_2^ -$ .

Show All

Specifically, images are first obtained from the source and target domains, from which the image's content and domain information are subsequently extracted. The size of the image is 512 × 512, and three input channels are present. The source-domain image is fed into the generator, and the content information of the visible image is fused with the style information of the infrared image to generate a new feature map, which is remapped to obtain the infrared style source-domain image.

The generator model uses a simple encoding and decoding architecture, which is separated into two parts: an encoder ${{\varphi }_{\text{enc}}}$ and a decoder ${{\varphi }_{\text{dec}}}$ . Visible images from the source domain are converted into pseudoinfrared images within the target domain, thereby achieving preliminary domain alignment. The output image of the source domain is described as follows:

$\begin{equation*} \alpha _i^G = \varphi (z) = {{\varphi }_{\text{dec}}}({{\varphi }_{\text{enc}}}(\alpha _i^S)) \tag{1} \end{equation*}$ View Source

where

$\alpha _i^S$

is the visible image in the source domain,

$\alpha _i^G$

represents the generated pseudoinfrared target domain image, and

${{\varphi }_{\text{enc}}}$

and

${{\varphi }_{\text{dec}}}$

denote the encoder and decoder in the generator

$\varphi$

, respectively.

B. Detection Network

Based on the enhanced domain, we pretrain the detection model by feeding the data into the teacher network. The YOLOv8 model is widely used in remote sensing object detection and achieves excellent performance. However, this model cannot solve the problem of cross-modal domain adaptation detection for multimodal remote sensing data. To resolve this issue, we present an object detection framework, as shown in Fig. 5.

Fig. 5.

Detection network structure.

Show All

We design a multilayer cross-feature fusion detection network, embedding space-to-depth convolution (SPD-Conv) into the backbone network to extract image texture details. An MPFD framework is built to obtain high-level semantic information from the input images. The network consists of three basic parts: backbone (feature extraction), neck (feature fusion), and head (object detection).

To improve the feature extraction capability of the backbone of YOLOv8, SPD-Conv [63] is introduced in layers P2 to P5. Specifically, SPD-Conv consists of an SPD layer and a nonstrided convolutional layer (stride = 1) in series. The input feature maps are transformed through the SPD layer, and then the output results are subjected to a convolution operation through a nonstrided Conv layer. This operation can significantly reduce the spatial dimensionality of the feature maps while maintaining the channel information. Specifically, each pixel of the input feature map is mapped to a channel, in which the spatial dimensionality is reduced while the channel dimensionality is increased. The associated process is shown in Fig. 6.

Fig. 6.

Illustration of SPD-Conv.

Show All

The size of the input feature map $X$ is $S \times S \times {{C}_1}$ , and the subfeature map set $F$ is denoted as

$\begin{equation*} F = \left\{ {\begin{array}{cccc} {{{f}_{0,0}}}&{{{f}_{1,0}}}& \cdots &{{{f}_{{\rm{scale - 1,0}}}}}\\ {{{f}_{0,1}}}&{{{f}_{1,1}}}& \cdots &{{{f}_{{\rm{scale - 1,1}}}}}\\ \vdots & \vdots & \vdots & \vdots \\ {{{f}_{{\rm{0,scale - 1}}}}}&{{{f}_{{\rm{1,scale - 1}}}}}& \cdots &{{{f}_{{\rm{scale - 1,scale - 1}}}}} \end{array}} \!\right\} \tag{2} \end{equation*}$ View Source

when the scale = 2, where we obtain four submaps

${{f}_{0,0}}$

${{f}_{1,0}}$

${{f}_{0,1}}$

, and

${{f}_{1,1}}$

each of which is of size

${S \mathord{/ {\vphantom {S 2}} } 2} \times {S \mathord{/ {\vphantom {S 2}} } 2} \times {{C}_1}$

. The submap is downsampled from

$X$

. Subsequently, we obtain a feature map

$X^{\prime}$

, which is composed of subfeature maps connected along the channel dimension. The size of

$X^{\prime}$

${S \mathord{/ {\vphantom {S {\text{scale}}}} } {\text{scale}}} \times {S \mathord{/ {\vphantom {S {\text{scale}}}} } {\text{scale}}} \times \text{scal}{{\mathrm{e}}^2}{{C}_1}$

. Finally, a nonstrided convolutional layer with

$C$

filters is used to convert

$X^{\prime}({S \mathord{/ {\vphantom {S {\text{scale}}}} } {\text{scale}}},{S \mathord{/ {\vphantom {S {\text{scale}}}} } {\text{scale}}},\text{scal}{{\mathrm{e}}^2}{{C}_1})$

into

$X^{\prime\prime}({S \mathord{/ {\vphantom {S {\text{scale}}}} } {\text{scale}}},{S \mathord{/ {\vphantom {S {\text{scale}}}} } {\text{scale}}},{{C}_2})$

, where

${{C}_2} < \text{scal}{{\mathrm{e}}^2}{{C}_1}$

. The nonstrided convolution does not traverse the feature map but rather performs convolution on each pixel. This operation helps suppress the oversampling problem that may occur in the SPD layer and maintains more fine-grained information.

To enhance the feature extraction ability of the model for tiny object regions and capture deeper semantic information, we design an MPFD framework. In the feature extraction stage within the backbone network, different levels of features are formed from low to high. Unlike traditional feature pyramid fusion [64] frameworks, we utilize the relationships between nonadjacent layers and improve the feature extraction ability of the model through progressive cross fusion, as shown in the neck section of Fig. 5. We employ 1 × 1 convolution and bilinear interpolation techniques to upsample the features, thereby achieving dimensional alignment before conducting feature fusion. In addition, we perform downsampling with different convolution kernels and step sizes. During multilevel feature fusion, we utilize ASFF [65] to assign various spatial weights to the characteristics at different levels, attenuating the mutually exclusive information while increasing the relevance of the critical levels. Let $x_{ij}^{s \to l}$ indicate the feature vector at location (i, j) from level s to level l. $y_{ij}^m$ denotes the fused feature map of the mth layer. The fusion formula is described as

$\begin{equation*} y_{ij}^m = \alpha _{ij}^m \cdot x_{ij}^{a \to m} + \beta _{ij}^m \cdot x_{ij}^{b \to m} + \gamma _{ij}^m \cdot x_{ij}^{c \to m} \tag{3} \end{equation*}$ View Source

where

$\alpha _{ij}^m$

$\beta _{ij}^m$

, and

$\gamma _{ij}^m$

represent the spatial importance weights of the different levels. They are subject to the constraint that

$\alpha _{ij}^m + \beta _{ij}^m + \gamma _{ij}^m = 1$

Finally, we propose detection heads for weak, small, medium, and large feature maps, which results in our proposed model having a broader detection range. As shown in Fig. 5, the decoupled detection head we use is divided into two branches, classification and regression, which improves the feature extraction in multicategory training.

C. Teacher–Student Distillation Learning

To better understand the knowledge in each channel of the teacher network, we implement soft alignment for the activation functions of the corresponding channels. First, we normalize the feature maps in each channel to obtain a softened probability map. Subsequently, we can measure the differences via the Kullback–Leibler (KL) divergence and other probability distance measures. Finally, we convert the channel activations into probability distributions by minimizing the KL scatter between the channel probability maps of the two networks, with the distillation process emphasizing the most prominent portions of each channel.

As shown in Fig. 7, the significance of the scene categories corresponding to the activation areas of different channels varies. Therefore, we employ a new channel distillation paradigm to guide the student networks in learning knowledge from a well-trained teacher network.

Fig. 7.

Illustration of the teacher–student distillation learning process.

Show All

D. Two-Phase Training Strategy

This section elucidates the relationship between the two phases, image-level domain alignment and domain adaptive distillation training, which are performed sequentially. The specific training algorithm of the proposed TPDTNet is shown in Algorithm 1. In the first phase, the source- and target-domain images are fed into the generative network to train the generative model so that the source-domain image is converted into a target-like domain image. Subsequently, we combine the source and generative domains to realize enhanced domain construction. Theoretically, constructing an enhanced domain can compensate for the interdomain differences that occur during model training to a certain extent and achieve preliminary domain alignment, which provides a basis for conducting cross-domain detection models training.

Algorithm 1: Two-Phase Training Strategy of TPDTNet.

Input:

source domain dataset $S = \{ \alpha _{}^S,\beta _{}^S,\gamma _{}^S\}$ , target domain dataset $T = \{ \alpha _{}^T\}$ , number of epochs for the first phase $E1$ , number of epochs for the second phase $E2$ ;

Output:

enhanced domain $E$ and detection network weights;

[The first phase of training]

$i = 0$

while $i \leq E1$ do

Input the source domain images ${{\alpha }^S}$ and target domain images ${{\alpha }^T}$ into generative model, and obtain the image ${{\alpha }^G}$ generated by ${{\alpha }^S}$ . Therefore, the generated domain is $G = \{ \alpha _{}^G,\beta _{}^G,\gamma _{}^G\}$ ;

Merge $S$ and $G$ to construct an enhanced domain $E = \{ \alpha _{}^S + \alpha _{}^G,\beta _{}^S + \beta _{}^G,\gamma _{}^S + \gamma _{}^G\}$ ;

end while

while $i \leq E2$ do

Input the enhanced domain $E$ into the teacher network to obtain pseudolabels for the target domain images and a target domain dataset $T = \{ \alpha _{}^T,\beta _{}^T,\gamma _{}^T\}$ ;

Input $T$ and $G$ into the teacher network for retraining to obtain the detection weights of the teacher network;

Utilize the channel knowledge distillation strategy and target domain data for model distillation training;

Calculate the loss between the prediction result and the ground truth;

Update the weight parameters of the pretrained detection network;

end while

In the second phase, we divide the model training into three parts.

First, we feed the enhanced domain data constructed in the first phase into the detection network in a mixed manner to obtain the weights of the teacher network. These teacher weights are then used to detect target-domain images and generate corresponding pseudolabels.
Second, to eliminate the influence of the source domain on the detection model, we utilize the target-domain images equipped with pseudolabels and the generated-domain images for performing secondary training on the teacher network. These two styles of images are utilized to constrain the consistency of the target domain and improve the stability of the detection network.
Finally, the channelwise knowledge distillation strategy is utilized to migrate the training weights from the teacher network to the student network, and the model distillation training is performed using only the target-domain data.

As a result, the student network has more robust target-domain detection performance, which improves the resulting detection accuracy. Knowledge distillation is used to migrate complex teacher networks to lightweight student networks, and our motivation for using channel distillation is to achieve improved network detection performance while reducing the computational complexity of the model.

E. Loss Function

This section describes the loss function used to train TPDTNet. In the first phase, we utilize a domain generation network to migrate source domains to target domains. In this phase, we use an adversarial loss and a contrast loss.

The adversarial loss is employed to interfere with the discriminator when determining the authenticity of the generated image. The adversarial loss function is employed to enhance the visual similarity between the generated image and the corresponding target domain image. The specific formula is defined as follows:

$\begin{align*} {{L}_{\text{adv}}} \left(\varphi,\lambda,S,T \right) =& {{{\mathbb{E}}}_{\alpha _i^T \sim T}}\log \lambda \left(\alpha _i^T\right)\\ &{ + {{{\mathbb{E}}}_{\alpha _i^S \sim S}}} \log \left(1 - \lambda \left(\varphi \left(\alpha _i^S\right)\right)\right) \tag{4} \end{align*}$ View Source

where

$\varphi$

represents the generator,

$\lambda$

refers to the discriminator, and

${\mathbb{E}}$

stands for the expectation.

The generative model uses a noise contrastive estimation framework [66] to maximize the mutual information between the inputs and outputs. The goal of the model is to match the relevant input and output image blocks at a specific location, whereas the other image blocks contained in the input can be utilized as negative samples. We select $L$ layers of interest and processed the feature maps through a two-layer multilayer perceptron network ${{H}_l}$ . The source-domain image selects some of the image blocks that are similar to the output domain image but not in the same location to calculate the contrastive loss between the input and output images. The formula for the contrastive loss function is as follows:

$\begin{align*}&{{L}_c}\left( {\varphi,H,S} \right) = {{{\mathbb{E}}}_{\alpha _i^S \sim S}}\sum\limits_{l = 1}^L {\sum\limits_{s = 1}^{{{S}_l}} {\ell \left( {\hat{z}_l^s,z_l^s,z_l^{S\backslash s}} \right)} } \tag{5}\\ &\ell {\bf (}\hat{z}_l^s,z_l^s,z_l^{S\backslash s}{\bf )}\\ &\quad = - \log \left[ {\frac{{\exp ({{\hat{z}_l^s \cdot z_l^s} \mathord{\left/ {\vphantom {{\hat{z}_l^s \cdot z_l^s} \tau }} \right. } \tau })}}{{\exp ({{\hat{z}_l^s \cdot z_l^s} \mathord{\left/ {\vphantom {{\hat{z}_l^s \cdot z_l^s} \tau }} \right. } \tau }) + \sum\nolimits_{n = 1}^N {\exp ({{\hat{z}_l^s \cdot z_l^{S\backslash s}} \mathord{\left/ {\vphantom {{\hat{z}_l^s \cdot z_l^{S\backslash s}} \tau }} \right. } \tau })} }}} \right] \tag{6} \end{align*}$ View Source

where

${{z}_l} = {{H}_l}(\varphi _{\text{enc}}^l(\alpha _i^S))$

$l \in \{ 1,2, \ldots,L\}$

is the index value of each layer,

$s \in \{ 1, \ldots,{{S}_l}\}$

, and

${{S}_l}$

is the number of spatial locations contained inside each layer. The corresponding characteristics are

$z_l^s \in {{{\mathbb{R}}}^{{{C}_l}}}$

, the other characteristics are

$z_l^{S\backslash s} \in {{{\mathbb{R}}}^{( {{{S}_l} - 1} ) \times {{C}_l}}}$

${{C}_l}$

is the number of channels per layer, the output image

$\alpha _i^G$

is encoded as

${{\hat{z}}_l} = {{H}_l}(\varphi _{\text{enc}}^l(\varphi (\alpha _i^S)))$

, and

$\tau$

is taken as a constant 0.07.

The comprehensive loss function for the first phase is expressed as follows:

$\begin{equation*}{{L}_{\mathrm{stage1}}} = {{L}_{\text{adv}}} + {{L}_c}\left( {\varphi,H,S} \right) + {{L}_c}\left( {\varphi,H,T} \right). \tag{7} \end{equation*}$ View Source

The training losses in the second phase mainly include classification loss, regression loss, and distillation loss. The training process of the teacher network only requires classification and regression losses, whereas the training process of the student network requires all three above types of losses.

We choose the binary cross-entropy (BCE) loss as the classification loss of the detection network, and it is defined as follows:

$\begin{equation*}{{L}_{\text{cls}}} = \frac{1}{N}\sum\limits_i { - \left[{{\mu }_i} \cdot \log \left({{p}_i}\right) + \left(1 - {{\mu }_i}\right) \cdot \log \left(1 - {{p}_i}\right)\right]} \tag{8} \end{equation*}$ View Source

where

$N$

represents the sample size,

${{\mu }_i}$

denotes the label of sample

$i$

, (positive samples have labels of 1, while negative samples have labels of 0),

${{p}_i}$

refers to the probability that sample

$i$

is predicted to be a positive sample, and the BCE loss sums the entropy values of the samples predicted to be positive and negative and avoids a loss of 0, thus allowing for a representation of the correctness or incorrectness of the classification in any case.

In the regression task, the degree of regression of a frame can be measured by the ratio of the ground truth (GT) box ${{B}^{\text{gt}}}$ to the predicted box ${{B}^{\text{pre}}}$ . The intersection over union (IoU) [67] is specifically formulated as

$\begin{equation*} \text{IoU} = \frac{{\left| {{{B}^{\text{pre}}} \cap {{B}^{\text{gt}}}} \right|}}{{\left| {{{B}^{\text{pre}}} \cup {{B}^{\text{gt}}}} \right|}}. \tag{9} \end{equation*}$ View Source

Our IoU loss formula is as follows:

$\begin{align*}&{{L}_{\text{IoU}}} = 1 - \text{IoU} + \text{distance} + 0.5 \times \Omega \tag{10}\\ &\text{distance} = hh \times \frac{{{{{(x_c^{\text{pre}} - x_c^{\text{gt}})}}^2}}}{{{{c}^2}}} + ww \times \frac{{{{{(y_c^{\text{pre}} - y_c^{\text{gt}})}}^2}}}{{{{c}^2}}} \tag{11}\\ &ww = \frac{{2 \times {{{({{w}^{\text{gt}}})}}^{\text{scale}}}}}{{{{{({{w}^{\text{gt}}})}}^{\text{scale}}} + {{{({{h}^{\text{gt}}})}}^{\text{scale}}}}} \tag{12}\\ &hh = \frac{{2 \times {{{({{h}^{\text{gt}}})}}^{\text{scale}}}}}{{{{{({{w}^{\text{gt}}})}}^{\text{scale}}} + {{{({{h}^{\text{gt}}})}}^{\text{scale}}}}} \tag{13}\\ &\Omega = \sum\limits_{t = w,h} {{{{\left(1 - {{e}^{ - \omega t}}\right)}}^\theta },\theta = 4} \tag{14}\\ &\left\{ {\begin{array}{c} {{{\omega }_w} = hh \times \frac{{\left| {{{w}^{\text{pre}}} - {{w}^{\text{gt}}}} \right|}}{{\max ({{w}^{\text{pre}}},{{w}^{\text{gt}}})}}}\\ {{{\omega }_h} = ww \times \frac{{\left| {{{h}^{\text{pre}}} - {{h}^{\text{gt}}}} \right|}}{{\max ({{h}^{\text{pre}}},{{h}^{\text{gt}}})}}} \end{array}} \right. \tag{15} \end{align*}$ View Source

where

$\text{scale}$

is related to the scale of the target in the dataset,

$ww$

and

$hh$

represent the weighting coefficients in the horizontal and vertical directions, respectively.

$c$

refers to the diagonal distance between

${{B}^{\text{gt}}}$

and

${{B}^{\text{pre}}}$

minimum closed bounding boxes.

$(x_c^{\text{pre}},y_c^{\text{pre}})$

and

$(x_c^{\text{gt}},y_c^{\text{gt}})$

represent the coordinates of the centers of the prediction box and the GT box, respectively.

$({{h}^{\text{pre}}},{{w}^{\text{pre}}})$

and

$({{h}^{\text{gt}}},{{w}^{\text{gt}}})$

denote the heights and widths of the prediction box and the GT box, respectively.

The loss of the knowledge distillation learning in the second phase is described as

$\begin{equation*}{{L}_{kd}}\left(\phi \left({{y}^{T^{\prime}}}\right),\phi \left({{y}^{S^{\prime}}}\right)\right) = \psi \left(\phi \left({y}_{c^{\prime}}^{T^{\prime}}\right),\phi \left({y}_{c^{\prime}}^{S^{\prime}}\right)\right) \tag{16} \end{equation*}$ View Source

where

$T^{\prime}$

and

$S^{\prime}$

represent the teacher network and the student network, respectively,

${{y}^{T^{\prime}}}$

and

${{y}^{S^{\prime}}}$

denote the activation values of the corresponding networks, and the function

$\phi ( \cdot )$

that is used to transform the activation values into probability distributions is defined as

$\begin{equation*} \phi ({{y}_{c^{\prime}}}) = \frac{{\exp (\frac{{{{y}_{c^{\prime},i}}}}{{T}})}}{{\sum\nolimits_{i = 1}^{W \cdot H} {\exp (\frac{{{{y}_{c^{\prime},i}}}}{{T}})} }} \tag{17} \end{equation*}$

View Source

where

$c^{\prime}$

is the channel index,

$i$

is the position in the indexed channel, and

${T}$

is a hyperparameter used to soften the labels.

For the knowledge distillation loss in (16), $\phi ( \cdot )$ is used to soften the probability distribution and $\psi ( \cdot )$ is used to assess the difference between the probability distributions of the student and teacher networks. We use the KL divergence to assess the difference between the two distributions, and it is defined as follows:

$\begin{equation*} \psi \left({{y}^{T^{\prime}}},{{y}^{S^{\prime}}}\right) = \frac{{{{{T}}^2}}}{{C^{\prime}}}\sum\limits_{c^{\prime} = 1}^{C^{\prime}} {\sum\limits_{i = 1}^{W \cdot H} {\phi \left({y}_{c^{\prime},i}^{T^{\prime}}\right) \cdot \log \left[\frac{{\phi ({y}_{c^{\prime},i}^{T^{\prime}})}}{{\phi ({y}_{c^{\prime},i}^{S^{\prime}})}}\right]} } . \tag{18} \end{equation*}$ View Source

If the distribution of the teacher network is large, the distribution of the student network should be as close as possible to the teacher distribution to minimize the KL divergence value.

Therefore, the detection network in the second phase distills the total training loss function as follows:

$\begin{equation*}{{L}_{\mathrm{stage2}}} = {{L}_{\text{cls}}} + {{L}_{\text{IoU}}} + {{L}_{kd}}. \tag{19} \end{equation*}$ View Source

SECTION IV.

Experimental Results

In this section, we methodically clarify the employed dataset, describe the experimental details, outline the used evaluation metrics, and discuss the various experimental outcomes of our model, accompanied by a thorough analysis. All the experiments in the following are carried out on a server with an NVIDIA RTX 4090 GPU, 24 GB of memory, and Python code running on Windows 11. With CUDA11.6, PyTorch1.12, and other widely used deep learning and image processing libraries installed, our development environment is PyCharm2022.

A. Dataset

The VEDAI [68] dataset contains 1246 high-resolution RGB and infrared images, including 3640 objects distributed across 8 common categories. The images have 1024 × 1024 pixels resolutions and cover different environments and terrains. The vehicles in each image were obtained under different weather and lighting conditions, providing variety and challenges. This dataset encompasses complex backgrounds, such as urban, rural, mountainous, and industrial areas, where objects frequently experience occlusion and interference from the background. In most cases, the targeted vehicles occupy only a small fraction of the total pixels within an image, thereby posing significant challenges for small object detection. Visible-light images provide clear details and well-defined edges. Although infrared images maintain the same spatial resolution as visible-light images, they exhibit fewer texture details and primarily rely on temperature differences between the objects and their surroundings to delineate contours. Consequently, variations in background temperatures may lead to the emergence of spurious targets. We set the ratio of the training set, validation set, and test set to 7:1.5:1.5. In our experiments, we employ infrared images as the target domain and RGB images as the source domain. Some typical remote sensing images and target object sizes contained in the dataset are shown in Fig. 8.

Fig. 8.

Examples of the images and target object sizes contained in the dataset.

Show All

B. Implementation Details

In the first phase, we train a generator that converts images from optical to infrared. To construct the enhanced domain, we first train the domain generator. Specifically, we construct generative domain images by using optical remote sensing images as source domain data and infrared remote sensing images as target domain data to train the domain generation process. The training objective is to minimize confrontation loss and contrastive loss and facilitate precise domain transfer from the optical images to the infrared target domain. The total number of training rounds is 200, with 50 iterations designated for linearly decaying the learning rate to zero. The training process is optimized via the adaptive moment estimation (Adam) optimizer with momentum parameters of 0.5 and 0.999. The initial learning rate of the network is established at 0.0002.

In the second phase, we train the detection network via the stochastic gradient descent optimizer with some hyperparameters set as follows. The initial learning rate is 0.01, the weight decay rate is 0.0005, the optimization momentum is 0.937, the number of iterations is 300, and the number of early stopping patience rounds is set to 100, i.e., the number of rounds to be waited for after the effect is no longer boosted. The remaining configurations are left as the default values of the original YOLOv8 model.

C. Evaluation Indices

We use Precision, Recall, F1 Score, and mean average precision (mAP) as an evaluation metric to measure the performance of the detector as follows:

$\begin{align*}{\rm{Precision }}=& \frac{{\text{TP}}}{{{\rm{TP + FP}}}} \tag{20}\\{\rm{Recall }}=& \frac{{\text{TP}}}{{{\rm{TP + FN}}}} \tag{21}\\{\rm{F1 }}=& 2 \times \frac{{{\rm{Precision \times Recall}}}}{{{\rm{Precision + Recall}}}} \tag{22}\\ \text{AP} =& \sum\limits_n {\left({{R}_n} - {{R}_{n - 1}}\right){{P}_n}} \tag{23}\\ \text{mAP} =& \frac{1}{C}\sum\limits_{c = 1}^C {\mathrm{A}{{\mathrm{P}}_c}} \tag{24} \end{align*}$ View Source

where

$\text{TP}$

denotes the number of positive samples that are correctly detected,

$\text{FP}$

denotes the number of negative samples that are detected as positive samples;

$\text{FN}$

denotes the number of positive samples that are not detected;

${{R}_n}$

and

${{R}_{n - 1}}$

are the

$n\text{th}$

and

$(n - 1)\text{th}$

recall values, respectively;

${{P}_n}$

is the precision value corresponding to the

$n\text{th}$

point;

$C$

is the total number of categories; and

$\mathrm{A}{{\mathrm{P}}_c}$

is the average precision achieved for the

$c\text{th}$

category.

D. Experimental Results and Analyses

In this section, we discuss our findings and compare our method (denoted as TPDTNet) with the state-of-the-art methods. We select multiple domain adaptive target detection methods based on YOLO detectors to conduct experimental validations on multimodal datasets. Source only indicates that the source-domain images with labels are used for training, and testing is directly performed on the target-domain data without domain adaptation. Oracle indicates that the model trained and tested on labeled target domains is the best-performing benchmark. The specific experimental setup and results are as follows.

We use VEDAI-Vis as the source domain and the VEDAI-Ir dataset as the target domain, and we test the model by employing a validation set to assess the efficacy of the domain adaptation method in terms of detecting optical-to-infrared images.

The transformed enhanced domain images are shown in Fig. 9. The enhanced domain images consist of the source-domain image and the generated images. A comparison between the distributions of the generated domain images and the target domain images reveals that our method yields excellent image-level alignment results, as shown in Fig. 10.

Fig. 9.

Enhanced domain image visualization results.

Show All

Fig. 10.

Comparison between the distributions of data from the target domain and generated domain.

Show All

Table I summarizes the quantitative results of several advanced detection methods. Under these challenging conditions, TPDTNet improves the detection performance achieved on the validation set consisting of small infrared remote sensing targets and produces the best results. Specifically, TPDTNet reaches an impressive 69.7% mAP@50 and 42.9% mAP@50-95 when YOLOv8 is used as the baseline within the YOLO paradigm. Compared with state-of-the-art methods such as YOLOv5 [37], YOLOv8 [17], DETR [36], SuperYOLO [80], CMF [81], M2FP [74], SSDA-YOLO [77], VT [75], ConfMix [76], YOLO-G [78], and SF-YOLO [79], we achieve an astonishing increase in accuracy and the same parameter quantity level. However, our method has relatively high giga floating-point operations per second (GFLOPs). This is mainly because GFLOPs are related not only to the number of parameters but also to the types of operations used and the computational complexity of each layer contained in the network. The progressive fusion module used in the network structure of this method employs many convolution operations. Therefore, compared with that of the above-mentioned algorithms, our computational complexity is relatively high. However, we are pleasantly surprised to find that the TPDTNet results are 17.3% and 0.91% higher than the Oracle results for mAP@50 and mAP@50-95, respectively. This is a breakthrough achievement. It is evident that our method also achieves superior performance in terms of precision and recall. The high F1 score demonstrates that our approach strikes a good balance between precision and recall, effectively reducing both false positives and false negatives.

TABLE I Predictive Performance Comparison Between TPDTNet and Other State-of-the-Art Methods on the VEDAI-Vis→VEDAI-Ir Dataset

SSDA-YOLO and VT were initially proposed for computer vision tasks, so they may not perform as well in remote sensing scenarios. In this case, SSDA-YOLO feeds the source and target domain data into the backbone network and does not perform a separate feature extraction for the target domain. VT is a method that uses a teacher–student framework for domain-adaptive detection. However, the detector of this method does not provide relevant improvements for remote sensing small target features. The ConfMix approach utilizes mixing strategies in the source and target domains on the basis of region-level detection confidence. However, the mixed samples may not adequately represent the characteristics of the target domain, which, in turn, affects the ability of the model to adapt to the target domain. YOLO-G introduces an adversarial training branch that achieves feature alignment through a gradient reversal layer; however, its adaptability to cross-domain scenarios between visible-light and infrared data remains insufficient. SF-YOLO proposes a practical UDA method that mitigates pseudolabel drift significantly through smooth weight updates between the teacher and student networks. However, when the feature distribution of the target domain deviates significantly from that of the source domain, the model's performance declines noticeably. DETR detector introduces a Transformer architecture, enabling end-to-end training and simplifying the detection process. However, while DETR performs well on large object detection, it struggles with small object detection.

Fig. 11 visualizes the detection results. We can intuitively find that all the compared models have different degrees of leakage and misdetection. In contrast, the method presented in this work has more substantial advantages in terms of both the accuracy of the regression box and the confidence of the classification results.

Fig. 11.

Visual comparison among different detection approaches. (a) Source only (YOLOv5). (b) Source only (YOLOv8). (c) SSDA-YOLO. (d) VT. (e) ConfMix. (f) Oracle (YOLOv5). (g) Oracle (YOLOv8). (h) TPDTNet (Ours). (i) GT.

Show All

E. Ablation Study

This section initially examines the baseline model of the detector within the domain adaptive detection method to make the best choice. We subsequently demonstrate the effectiveness of the proposed modules by verifying the influence of each module on the overall performance through ablation experiments, which are performed in this section on the VEDAI_Vis to VEDAI_Ir datasets.

1) Validation of the Baseline Framework

In Table II, the model sizes and inference capabilities of different baselines are evaluated based on their parameter sizes, GFLOPs, and frames per second (FPS). The YOLOv8 model provides five different baseline variants, namely, YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x, each of which differs in its numbers of parameters (Params) and convolutional layers. YOLOv8n represents the smallest model with the fewest parameters, GFLOPs, and the fastest computing speed (FPS), whereas YOLOv8x is the largest and slowest model. We apply the above-mentioned five models to our proposed method to select the optimal baseline model. The experimental results indicate that YOLOv8m attains an optimal balance between its inference speed, detection accuracy, and number of model parameters. Thus, in subsequent experiments, we use the YOLOv8m model as our baseline model.

TABLE II Results of Ablation Experiments Concerning the Baseline Model of Our Proposed TPDTNet

2) Effectiveness of Each Component

This ablation experiment evaluates the impacts of components such as the SPD convolution, MPFD, and the regression loss on the performance of the target detection model. The ablation experiments conducted for different components are given in Table III. Compared with the bare baseline, adding all three modules results in good performance.

TABLE III Result of the Ablation Experiments in VEDAI-Vis→VEDAI-Ir Dataset

YOLOv8+SPD has a 4.1% better mAP@50 than that of the original baseline, YOLOv8+MPFD achieves 11.2% better mAP@50 than that of the original baseline, and YOLOv8+L_IOU achieves 2% better mAP@50 than that of the original baseline. SPD can help the backbone network the process of extracting features from small remote sensing targets, MPFD improves the multilevel fusion ability of the network, and the L_IOU regression loss can realize the precise localization of detected targets. When we add SPD to the backbone network while utilizing MPFD instead of the neck network and use L_IOU as the loss function, we achieve excellent results in terms of both evaluation metrics (52.4% versus 69.7%, and 29.8% versus 42.9% ).

3) Effectiveness of Improving Performance for Targets of Different Sizes

To evaluate the contributions of different modules in our method to detection performance at various scales, we conducted ablation experiments, as given in Table III. We classified object sizes using the COCO dataset format, categorizing objects based on their area (in pixels). Objects with an area smaller than 32 × 32 pixels are defined as small objects. Those with an area greater than 32 × 32 but smaller than 96 × 96 pixels are considered medium objects, while objects with an area larger than 96 × 96 pixels are classified as large objects. Analysis of the table reveals that when SPD, MPFD, and L_IOU components are added individually, the detection accuracy for targets of different sizes improves to varying degrees. Notably, these components exhibit a more significant performance improvement for small target detection. Specifically, the MPFD module achieves results that are 13.1% higher than the baseline in terms of mAP@50-95. When all components are incorporated into the baseline, it is evident that our method achieves optimal performance across multiple scales of target detection. Our method achieves mAP@50-95 of 38.4%, 46.4%, and 42.5% for small, medium, and large object detection, respectively.

F. Generalization Experiment on Dataset DroneVehicle

To validate the generalization ability of the proposed TPDTNet, this section compares the detection accuracy of TPDTNet with other methods on the public dataset DroneVehicle [69]. The DroneVehicle dataset is a large-scale RGB-IR (visible-infrared) multimodal dataset for vehicle detection in drones imagery. The dataset contains a variety of shooting conditions and scenes covering different viewing angles, heights, and lighting conditions, totaling 20 000 images. The object scale in the image varies greatly, and in some scenes, the object density is high, resulting in occlusion between objects. In the visible modality, oblique viewing angles often result in object occlusion and deformation, thereby compromising the integrity of target features. In contrast, in infrared imagery, the thermal characteristics of targets vary minimally across different viewpoints, and although the boundaries may appear blurred, the overall shape remains relatively consistent. We select 6000, 1000, and 868 image pairs from this dataset as the training, validation, and test sets, respectively. Five vehicle types such as cars, trucks, buses, vans, and freight-cars are considered.

Table IV illustrates the comparison results produced by the different algorithms in detail, and Fig. 12 provides some visualization results. The following can be concluded.

The mAP@50 attained by the proposed TPDTNet model on the public DroneVehicle dataset is 60.3%, 10.0% –12.4% higher than those of the other methods. Our algorithm also achieves excellent detection accuracy in the generalization experiment conducted on aerial drone data relative to that of the other methods.
The visualization results clearly reveal that the proposed method, TPDTNet, has significant advantages with respect to both missed and false detections and in terms of the localization accuracy of its regression and its categorical confidence relative to those of other methods, confirming its good performance.
We have also added comparative experiments with the latest RGB-T multimodal object detection methods M2FP [74] and CMF [81]. It can be observed that, although our overall mAP@50 still lags behind these advanced multimodal methods, we achieve an impressive accuracy of up to 98% in car object detection.
In summary, the proposed method has high detection accuracy on the DroneVehicle UAV aerial photography dataset, indicating its good detection accuracy and generalizability.

TABLE IV Predictive Performance Comparison Between TPDTNet and Other State-of-the-Art Methods in DroneVehicle Vis→DroneVehicle Ir Dataset

Fig. 12.

Visualization of the experimental comparisons on the DroneVehicle dataset. (a) Source only (YOLOv5). (b) Source only (YOLOv8). (c) SSDA-YOLO. (d) VT. (e) ConfMix. (f) Oracle (YOLOv5). (g) Oracle (YOLOv8). (h) TPDTNet (Ours). (i) GT.

Show All

G. Generalization Experiment on Dataset LLVIP

To further validate the adaptability of the proposed method, we conduct generalization experiments on the LLVIP dataset. This dataset is primarily intended for pedestrian detection under dim lighting conditions utilizing both infrared and visible-light imagery. In total, it comprises 30 000 pairs of registered visible-light and infrared images. LLVIP encompasses a variety of low-illumination scenarios, including nighttime, dimly lit environments, and foggy conditions. In infrared imagery, pedestrian targets often appear prominently due to pronounced temperature contrasts. In contrast, visible-light images present complex backgrounds that include noise, shadows, and other forms of interference. We select 1891 pairs of visible infrared data as the training set and 350 images for both validation and test sets. Table V lists the comparison results of the different algorithms in detail, and Fig. 13 provides some visualization results. The following can be concluded.

TABLE V Predictive Performance Comparison Between TPDTNet and Other State-of-the-Art Methods on the LLVIP Vis→LLVIP Ir Dataset

Fig. 13.

Visualization of the experimental comparisons on the LLVIP dataset. (a) Source only (YOLOv5). (b) Source only (YOLOv8). (c) SSDA-YOLO. (d) VT. (e) ConfMix. (f) Oracle (YOLOv5). (g) Oracle (YOLOv8). (h) TPDTNet (Ours). (i) GT.

Show All

Our proposed TPDTNet method achieves excellent detection accuracy on the LLVIP dataset, and we achieve optimal results in terms of both evaluation metrics compared to other state-of-the-art methods. We obtain mAP@50 and mAP@50-95 88.4% and 54.6%, respectively. Similarly, we can see from the visualization results that our method can still accurately detect targets that are occluded in backgrounds. The experimental results show that our method has good generalizability in domain-adaptive visible-to-infrared detection tasks.

SECTION V.

Discussion

The proposed TPDTNet framework effectively addresses the challenges associated with cross-domain object detection between visible and infrared modalities. By leveraging a two-phase distillation training process, TPDTNet bridges the significant domain gap between source (visible) and target (infrared) data, showcasing its efficacy in scenarios where labeled target domain data are scarce or unavailable.

While TPDTNet demonstrates significant advancements in addressing domain shifts between visible and infrared modalities, certain challenges persist. Although the use of generated images effectively mitigates domain discrepancies to some extent, the inherent differences between generated and real infrared images can still be pronounced, particularly in complex scenes. This discrepancy is especially evident when capturing small target features, where subtle texture and intensity variations in real infrared images may not be fully replicated by generated images.

Our method still has the following limitations. Realism of generated images: In complex environments, the synthetic infrared images generated during domain alignment may lack sufficient fidelity to represent the nuanced characteristics of real-world infrared scenes. This can lead to challenges in accurately detecting small objects, as these features often require precise texture and intensity representations that generated images may fail to capture. Small target detection: The difficulty in fully replicating small target features in generated images could result in reduced detection performance for these objects, particularly in cluttered or noisy scenes.

To address these challenges and enhance the applicability of TPDTNet, focus on the following points in the future. Improved generative techniques: Incorporating advanced generative methods, such as diffusion models or enhanced GAN architectures, to better bridge the gap between generated and real infrared images. Small object enhancement: Developing specialized modules or training strategies to prioritize the detection of small targets, especially in challenging environments.

SECTION VI.

Conclusion

In this article, we introduce the TPDTNet model, which is a novel two-phase framework that is developed to address the performance degradations exhibited by object detection models in cross-domain remote sensing scenarios. The main objective of the TPDPNet model is to mitigate the domain bias that occurs from visible images to infrared images and to enable unsupervised domain adaptative detection. In the first phase, we use a powerful domain generator to build enhanced domain for initial domain alignment at a low cost. In the second phase, we propose a detector specifically designed for small targets in remote sensing imagery. SPD convolution is employed to capture detailed spatial information. Meanwhile, MPFD progressively and asymmetrically fuses features from high-level semantics down to low-level representations, thereby enabling more efficient integration of multi-scale information.

These modules are carefully crafted to ensure coherence and complementarity, leading to enhanced performance in domain-adaptive object detection. Moreover, to address the challenge of domain migration, we propose a teacher–student distillation training strategy, which trains an excellent detection model in unlabeled target domain scenarios. The experimental results show that our proposed method is robust and adaptable in terms of improving its cross-domain unsupervised detection capabilities for visible-to-infrared remote sensing images. In the future, we will focus more on effectively reducing the domain gap for adaptive target detection tasks in the multimodal remote sensing domain and realizing efficient unsupervised domain-adaptive target detection.

References is not available for this document.

MIT Libraries

MIT Libraries

TPDTNet: Two-Phase Distillation Training for Visible-to-Infrared Unsupervised Domain Adaptive Object Detection

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction

Related Works

A. Object Detection

B. Unsupervised Domain Adaptation

C. Unsupervised Domain Adaptative Object Detection

1) Domain Invariant Feature Learning

2) Image-to-Image Translation

3) Pseudo-Label-Based Self-Training

4) Mean Teacher Training

D. Motivation

Method

A. Image-Level Domain Alignment

B. Detection Network

C. Teacher–Student Distillation Learning

D. Two-Phase Training Strategy

Algorithm 1: Two-Phase Training Strategy of TPDTNet.

E. Loss Function

Experimental Results

A. Dataset

B. Implementation Details

C. Evaluation Indices

D. Experimental Results and Analyses

E. Ablation Study

1) Validation of the Baseline Framework

2) Effectiveness of Each Component

3) Effectiveness of Improving Performance for Targets of Different Sizes

F. Generalization Experiment on Dataset DroneVehicle

G. Generalization Experiment on Dataset LLVIP

Discussion

Conclusion

References

IEEE Account

Purchase Details

Profile Information

Need Help?