Introduction
In computer vision, object detection is a fundamental task, which involves localizing and classifying objects in an image. Advancements in deep learning have resulted in various types of object detectors being proposed [1]–[6]. In most cases, they require supervised learning on a large amount of annotated data [7], [8]. Furthermore, the training and test data must belong to the same domain to achieve the expected performance.
However, domain shifts resulting from changes in weather, painting style, and other factors often occur in practical applications, thereby resulting in a loss of accuracy. In object detection tasks, bounding boxes are required for all the objects in the images during annotation. Therefore, creating a new training dataset in the shifted domain is impractical. An effective solution to this issue is domain adaptation, which involves transferring knowledge from a label-rich source domain to a label-poor or unlabeled target domain [9]. Specifically, unsupervised domain adaptation assumes that the target domain has no labels [10]. In recent studies, several approaches have been proposed for implementing unsupervised domain adaptation in object detection tasks [11]. The most common approach is adversarial feature learning, which involves aligning the source and target features using a feature extractor competing with a domain discriminator [12]–[15]. Other approaches, such as pseudo-labeling in the target domain [16], [17] and image-to-image translation [18]–[20] have also been proposed.
In this study, we explore the application of unsupervised domain adaptation in object detection. Further, we demonstrate that learning robust features in the source domain through adversarial training enhances object detection in the target domain with a large domain shift. Recent studies on adversarial training have revealed the existence of non-robust and robust features [21]. The former is sensitive to perturbation, but they are still necessary for attaining high accuracy. The latter is highly stable and close to human perception [22]. Robust and non-robust features are visualized in Fig. 1. We hypothesize that for domain adaptation, non-robust features are highly domain-specific features. Thus, they are susceptible to domain shifts, whereas robust features are informative in both the source and target domains. This idea is inspired by studies that have recently shown that adversarially trained models demonstrate improved transfer performance compared to standard-trained models [23], [24]. These studies focus on transfer learning in cases where the target domain has a small number of labels. Contrarily, we focus on unsupervised domain adaptation, where the target domain has no labels. In addition to learning robust features through adversarial training, to ensure the increased alignment of such features with the target domain, we propose a novel approach that combines adversarial training and adversarial feature learning [13].
Visualization of robust and non-robust features. The standard trained model in the source domain is highly dependent on non-robust features, which are not informative for the largely shifted target domain. In contrast, the robust features acquired by the adversarially trained model are informative even in the largely shifted target domain.
In our experiments on the benchmark datasets of real to artistic image adaptation, the adversarially trained detector improves the mean average precision by up to 7.7% compared to that of the standard-trained detector. When combined with adversarial feature learning, the improvement in mean average precision reaches 11.8%. Although our method degrades the performance for small domain shifts such as different weather conditions, quantifying the domain shift using the Fréchet distance allows us to predict domain adaptation performance with adversarial training in advance. In addition, we analyze various adversarial training methods for object detection. We demonstrate that several proposed techniques suggested to be robust against adversarial examples are not substantially different from the simplest adversarial training method in terms of their application in unsupervised domain adaptation.
The contributions of this study are as follows:
To the best of our knowledge, this is the first study on the effectiveness of adversarial training in unsupervised domain adaptation. We establish that, for large domain shifts, adversarially trained detectors achieve improved accuracy in the target domain than standard-trained detectors.
We propose a method that combines adversarial training with adversarial feature learning to ensure the enhanced alignment of the source and target features. Experimental results show that our proposed method achieves improved domain adaptation performance compared to approaches that solely rely on adversarial training.
We introduce a quantification of the domain shift using the Fréchet distance, which allows us to predict the domain adaptation performance with adversarial training.
We show that several adversarial training methods that have been proposed to improve robustness against adversarial examples do not differ substantially in terms of performance with respect to unsupervised domain adaptation.
Related Work
This section reviews the literature on object detection, domain adaptation, and adversarial training studies.
A. Object Detection
Object detection is a fundamental task in computer vision as well as image classification. Many object detectors have achieved high accuracy due to advancements in deep neural networks [1]–[6]. Most of them rely on supervised learning using large annotated datasets, such as PASCAL Visual Object Classes (VOC) [7] and Microsoft Common Objects in Context (MSCOCO) [8]. Generally, creating a new dataset for object detection is more time-consuming than creating one for image classification because it requires instance-level annotations. In this study, we use You Only Look Once v3 (YOLOv3), which is a well-known object detector with excellent inference speed and accuracy [4].
B. Domain Adaptation
Domain adaptation is a technique for adapting a model trained using a label-rich domain to a label-poor domain. Recently, unsupervised domain adaptation has attracted significant attention in computer vision tasks, such as image classification and semantic segmentation [10].
Many domain adaptation approaches have also been proposed for object detection [11]. Typical approaches include adversarial feature learning [12]–[15], pseudo-label-based self-training [16], [17], and image-to-image translation [18]–[20]. Adversarial feature learning employs an adversarial objective between the domain discriminator and feature extractor [9]. The domain discriminator attempts to accurately classify the source and target images, whereas the feature extractor attempts to fool the domain discriminator. As a result, the model can extract similar features from the source and target domains. The pseudo-label-based self-training approach trains the model by assigning pseudo-labels to the target images based on the knowledge obtained from the source domain. Image-to-image translation converts the source images into target-like images using CycleGAN [25] or similar methods. The model is then trained using the converted images and the original labels obtained from the source domain.
We propose a new method based on adversarial training for unsupervised domain adaptation in object detection. Note that adversarial training is different from adversarial feature learning. As mentioned above, adversarial feature learning is a domain adaptation method in which the feature extractor competes with the domain discriminator to obtain similar features from the source and target domains. On the other hand, adversarial training had originally been proposed to increase the robustness of deep neural networks against adversarial examples. To the best of our knowledge, our study is the first one to apply adversarial training to unsupervised domain adaptation in object detection.
C. Adversarial Training
One of the vulnerabilities of deep neural network-based models is the existence of adversarial examples that perturb the inputs and cause such models to make mistakes [26]. During the adversarial training, adversarial perturbations are added to the training data to ensure that the model is robust against input perturbations. The most typical methods for creating adversarial perturbations are the fast gradient sign method (FGSM) [27] and the projected gradient descent (PGD) [28]. Let \begin{equation*} \delta = \epsilon \cdot \mathop {\mathrm {sign}}\nolimits (\nabla _{x} \mathcal {L} (f (x), y)),\tag{1}\end{equation*}
\begin{equation*} \delta ^{ (t+1)} = \mathcal {P}[\delta ^{ (t)} + \alpha \cdot \mathop {\mathrm {sign}}\nolimits (\nabla _{x} \mathcal {L} (f (x), y))],\tag{2}\end{equation*}
In a recent study, the researchers demonstrated that adversarial examples result from the presence of non-robust features that are highly predictive but imperceptible to humans [21]. Standard-trained models rely on such non-robust features, whereas adversarially trained models extract robust features that are aligned with human perception [22]. This attribute gave rise to an unintended but useful inference, i.e., adversarially trained models are highly effective in transferring knowledge to new domains compared to standard-trained models [23], [24].
Inspired by the observations presented above, we propose adversarial training in the source domain to implement unsupervised domain adaptation in object detection. The robust features acquired from the source domain are informative in the dissimilar target domain. Furthermore, the enhanced alignment of robust features with the target domain can be achieved through adversarial training combined with adversarial feature learning.
Proposed Method
In this section, we first formulate the problem and describe the adversarial training in the source domain for YOLOv3 [4]. We then introduce an approach for combining adversarial training and adversarial feature learning to ensure robust and target-aligned feature acquisition. The framework of our proposed method is illustrated in Fig. 2.
Framework of the proposed method.
A. Problem Setting
To implement unsupervised domain adaptation in object detection, we obtain labeled data
In this study, we use YOLOv3, a well-known object detector. The objective of standard training in the source domain for YOLOv3 can be expressed as follows:\begin{equation*} \min _{F} \mathbb {E}_{ (\boldsymbol {x}_{s}, \left \{{y_{s}, \boldsymbol {b}_{s}}\right \})\sim \mathcal {D}_{s}} [\mathcal {L}_{\mathrm {det}} (F (\boldsymbol {x}_{s}), \left \{{y_{s}, \boldsymbol {b}_{s}}\right \})],\tag{3}\end{equation*}
\begin{align*}&\hspace {-2pc}\mathcal {L}_{\mathrm {det}} (F (\boldsymbol {x}_{s}), \left \{{y_{s}, \boldsymbol {b}_{s}}\right \}) \\=&\mathcal {L}_{\mathrm {cls}} (F (\boldsymbol {x}_{s}), y_{s}) + \mathcal {L}_{\mathrm {loc}} (F (\boldsymbol {x}_{s}), \boldsymbol {b}_{s}) + \mathcal {L}_{\mathrm {obj}} (F (\boldsymbol {x}_{s})).\tag{4}\end{align*}
B. Adversarial Training in the Source Domain
Our main objective is to demonstrate that adversarial training in the source domain can be employed to achieve unsupervised domain adaptation. The robust features acquired through adversarially trained detectors are expected to be useful for dissimilar target domains and improve detection accuracy in the target domain. The objective of adversarial training can be expressed as follows:\begin{equation*} \min _{F} \mathbb {E}_{ (\boldsymbol {x}_{s}, \left \{{y_{s}, \boldsymbol {b}_{s}}\right \})\sim \mathcal {D}_{s}} [\mathcal {L}_{\mathrm {det}} (F (\boldsymbol {x}_{s} + \boldsymbol {\delta ^{*}}), \left \{{y_{s}, \boldsymbol {b}_{s}}\right \})],\tag{5}\end{equation*}
1) FGSM
The FGSM creates an adversarial perturbation in a single gradient step. A straightforward approach for generating an adversarial perturbation involves using the gradient of \begin{align*} \tilde { \boldsymbol {\delta }}_{\mathrm {det}}=&\mathop {\mathrm {sign}}\nolimits (\nabla _{ \boldsymbol {\delta }_{0}} \mathcal {L}_{\mathrm {det}} (F (\boldsymbol {x}_{s} + \boldsymbol {\delta }_{0}), \left \{{y_{s}, \boldsymbol {b}_{s}}\right \})),\tag{6}\\ \boldsymbol {\delta }_{\mathrm {det}}=&\mathcal {P} [\boldsymbol {\delta }_{0} + \epsilon \cdot \tilde { \boldsymbol {\delta }}_{\mathrm {det}}],\tag{7}\end{align*}
Alternatively, one can generate adversarial perturbations \begin{align*} \tilde { \boldsymbol {\delta }}_{\mathrm {cls}}=&\mathop {\mathrm {sign}}\nolimits (\nabla _{ \boldsymbol {\delta }_{0}} \mathcal {L}_{\mathrm {cls}} (F (\boldsymbol {x}_{s} + \boldsymbol {\delta }_{0}), y_{s})), \tag{8}\\ \tilde { \boldsymbol {\delta }}_{\mathrm {loc}}=&\mathop {\mathrm {sign}}\nolimits (\nabla _{ \boldsymbol {\delta }_{0}} \mathcal {L}_{\mathrm {loc}} (F (\boldsymbol {x}_{s} + \boldsymbol {\delta }_{0}), \boldsymbol {b}_{s})), \tag{9}\\ \tilde { \boldsymbol {\delta }}_{\mathrm {obj}}=&\mathop {\mathrm {sign}}\nolimits (\nabla _{ \boldsymbol {\delta }_{0}} \mathcal {L}_{\mathrm {obj}} (F (\boldsymbol {x}_{s} + \boldsymbol {\delta }_{0}))).\tag{10}\end{align*}
\begin{equation*} \boldsymbol {\delta }_{\mathrm {mtl}} = \mathop {\mathrm {arg\,max}} _{ \boldsymbol {\delta } \in \left \{{ \boldsymbol {\delta }_{\mathrm {cls}}, \boldsymbol {\delta }_{\mathrm {loc}}, \boldsymbol {\delta }_{\mathrm {obj}}}\right \}} \mathcal {L}_{\mathrm {det}} (F (\boldsymbol {x}_{s} + \boldsymbol {\delta }), \left \{{y_{s}, \boldsymbol {b}_{s}}\right \}).\tag{11}\end{equation*}
Generally,
2) PGD
PGD generates stronger perturbations than those generated using the FGSM by iterating the gradient steps. Adversarial training using PGD is known to be effective in enhancing adversarial robustness. However, this approach is computationally expensive. With a step size parameter \begin{equation*} \boldsymbol {\delta }^{ (t+1)} = \mathcal {P} [\boldsymbol {\delta }^{ (t)} + \alpha \cdot \tilde { \boldsymbol {\delta }}^{ (t)}].\tag{12}\end{equation*}
3) Default Perturbation \boldsymbol{\delta^{*}}
in Our Experiments
We employ
C. Robust and Target-Aligned Feature Learning
Through adversarial training in the source domain, as described above, the model is expected to learn robust features that are also informative to the target domain. However, since the model is not trained in the target domain, the robust features acquired through the model are discrepant from the robust features in the target domain. Therefore, we aim to enhance domain adaptation performance by aligning the robust features to the target domain.
For this purpose, we incorporate adversarial feature learning, which is a typical approach for implementing domain adaptation. Specifically, we employ a local feature alignment approach that matches features, such as texture and color, between the source and target domains [13]. In this study, the detector \begin{align*} \mathcal {L}_{\mathrm {afl}_{s}} (D (F_{1} (\boldsymbol {x}_{s} + \boldsymbol {\delta ^{*}})))=&\frac {1}{WH}\sum _{w,h} D (F_{1} (\boldsymbol {x}_{s} + \boldsymbol {\delta ^{*}}))_{wh}^{2},\tag{13}\\ \mathcal {L}_{\mathrm {afl}_{t}} (D (F_{1} (\boldsymbol {x}_{t})))=&\frac {1}{WH}\sum _{w,h} (1 - D (F_{1} (\boldsymbol {x}_{t}))_{wh})^{2},\tag{14}\end{align*}
Combined with the objective of adversarial training in the source domain, (5), the overall objective is expressed as follows:\begin{align*}&\hspace {-2pc}\max _{F_{1}}\min _{F,D} \mathbb {E}_{\substack { (\boldsymbol {x}_{s}, \left \{{y_{s}, \boldsymbol {b}_{s}}\right \})\sim \mathcal {D}_{s}\\ \boldsymbol {x}_{t}\sim \mathcal {D}_{t}}} [\mathcal {L}_{\mathrm {det}} (F (\boldsymbol {x}_{s} + \boldsymbol {\delta ^{*}}), \left \{{y_{s}, \boldsymbol {b}_{s}}\right \}) \\&+\, \lambda (\mathcal {L}_{\mathrm {afl}_{s}} (D (F_{1} (\boldsymbol {x}_{s} + \boldsymbol {\delta ^{*}}))) + \mathcal {L}_{\mathrm {afl}_{t}} (D (F_{1} (\boldsymbol {x}_{t}))))].\tag{15}\end{align*}
Experiments
In this section, we demonstrate the effectiveness of our approach through domain adaptation experiments conducted on benchmark datasets. In addition, we compare various methods and parameters for generating adversarial perturbations in adversarial training, and we observe the effect of the choice on the domain adaptation performance.
A. Datasets
For large domain shifts, we use PASCAL VOC [7] as the source dataset and Clipart1k, Watercolor2k, and Comic2k [32] as the target datasets. PASCAL VOC is a dataset comprising real-world images with 20 object classes. The training sets (VOC2007-trainval and VOC2012-trainval) comprise 16,551 images, and the test set (VOC2007-test) comprises 4,952 images. Clipart1k is a dataset comprising graphical images and has the same object classes as PASCAL VOC. The training set comprises 500 images, and the test set comprises 500 images. Watercolor2k and Comic2k are datasets comprising watercolor and comic images, respectively. Both datasets have six object classes, which are defined in PASCAL VOC, and they comprise 1,000 training and 1,000 test images. The appearances of objects significantly differ between the real images in the PASCAL VOC dataset and the artistic images in the Clipart1k, Watercolor2k, and Comic2k datasets.
For small domain shifts, we use Cityscapes [33] as the source dataset and FoggyCityscapes [34] as the target dataset. Cityscapes is a dataset comprising urban street scenes with eight object classes. The training set comprises 2,975 images, and the test set comprises 500 images. FoggyCityscapes is a dataset rendered from Cityscapes with fog simulation; it comprises the same number of images as the Cityscapes dataset. The weather conditions are different in the two datasets, but the appearances of the objects are similar. Examples of the datasets are shown in Fig. 3.
B. Implementation Details
In this study, we use YOLOv3 [4], which is a well-known object detector. The network with the first 26 convolutional layers of Darknet-53 in YOLOv3 is used as the feature extractor
The test images are resized during the evaluation so that the longer side is 416. We evaluate the average precision (AP) and mean AP (mAP) on the test data using an IoU threshold of 0.5. The reported results are the average of over three runs of similar training procedures. All the experiments are implemented using the PyTorch framework installed on the Ubuntu operating system running on a computer with an NVIDIA TITAN RTX GPU.
1) Standard Training
Only the source dataset is used, and the batch size is set to 16. We add a zero tensor to the source image instead of
2) Adversarial Training
Only the source dataset is used, and the batch size is similar to that for standard training. By default,
3) Adversarial Feature Learning
When combined with adversarial feature learning, both the source and target datasets are used. The batch size is set to 32: 16 from the source dataset and 16 from the target dataset. We set
C. Results
1) Large Domain Shift
We first conduct experiments on adaptations in large domain shifts from real to artistic images. Specifically, adaptations from PASCAL VOC to Clipart1k, Watercolor2k, and Comic2k are evaluated.
First, we list the results on the Clipart1k dataset in Table 1. Adversarial training (AT) outperforms standard training (ST) by 7.7% in terms of mAP. In addition, AT outperforms ST combined with adversarial feature learning (ST+AFL) by 1.6%, even though the target dataset is not used for AT. AT combined with AFL (AT+AFL) outperforms the other methods for 14 classes in terms of AP and improves the mAP by 11.8% compared to that of ST. Next, we list the results on the Watercolor2k dataset in Table 2. AT and AT+AFL improve the mAP over that of ST by 6.1% and 6.4%, respectively. AT+AFL outperforms the other methods for four classes in terms of AP, although the improvement achieved through AFL is limited compared to that on other datasets. Finally, we list the results on the Comic2k dataset in Table 3. AT and AT+AFL improve the mAP over that of ST by 2.8% and 6.7%, respectively. AT+AFL outperforms the other methods for four classes in terms of AP.
In summary, adversarially trained models outperform standard–trained models, and further improvements in their performance can be achieved by incorporating AFL. Specifically, the finding that AT using only the source dataset improves performance in the target domain is interesting because general domain adaptation methods utilize images in the target domain. The reason behind these results can be explained as follows. In adapting real to artistic images, the non-robust features acquired through ST in the source domain are not informative in the target domain due to the large domain shift. As a result, ST degrades performance in the target domain. Contrarily, the robust features acquired through AT are informative in the target domain, and thus, they can maintain performance in the target domain. Combined with AFL, the robust features are aligned with the target domain, thereby resulting in further performance improvement.
2) Small Domain Shift
We also evaluate the effectiveness of our proposed method on a small domain shift. Specifically, an adaptation between different weather conditions, from Cityscapes to FoggyCityscapes, is performed. The results are listed in Table 4. Contrary to the results for large domain shifts, AT and AT+AFL decrease the mAP by 17.7% and 13.7% compared to ST, respectively. In this adaptation scenario, the ST+AFL approach demonstrates the best performance in all the classes in the target domain.
For adaptation between similar domains, the non-robust features acquired through ST are also informative in the target domain. Therefore, the detection performance of standard-trained models in the target domain is highly dependent on non-robust features. In contrast, AT makes the detector rely on robust features instead of non-robust features. Because robust features are less informative than non-robust features, AT is known to result in a reduction in accuracy in the source domain [22]. Correspondingly, for small domain shifts, the application of AT results in decreased performance in the target domain.
D. Analysis
1) Quantifying the Domain Shift Magnitude
AT in the source domain improves performance for a large shifted target domain but degrades the performance for a small shifted target domain. Therefore, quantifying the magnitude of the domain shift is necessary to determine whether our approach should be applied. Fréchet inception distance (FID) [36], which computes the Fréchet distance (FD) of two distributions using the feature space of the Inception-v3 model [37], is known as a measure to evaluate the difference between image sets. We compute FD using the feature space of the YOLOv3 model instead of the Inception-v3 model because it allows feature extraction along with the object detection task. In the experiments, we use feature maps extracted from the YOLOv3 backbone network, which is pre-trained using the MS COCO dataset.
First, we perform experiments on the PASCAL VOC dataset to verify the effectiveness of the FD in measuring the domain shift. To control the domain shift magnitude, we apply style transfer to the PASCAL VOC test set via adaptive instance normalization (AdaIN) [38]. The content–style trade-off
Examples of style transfer via AdaIN for the PASCAL VOC test set. The test set is stylized into three style images by varying the content–style trade-off
Fréchet distance (FD) between the PASCAL VOC training set and the stylized PASCAL VOC test set, and the ratio of the mAP of AT to mAP of ST on the stylized test set. The larger the FD, the better the detection performance of AT on the stylized test sets.
Next, we compute the FD between the source domain’s training set and the target domain’s test set used in our main experiments. The results are listed in Table 5. The FDs are larger for PASCAL VOC to Clipart1k, Watercolo2k, and Comic2k, where AT improves the detection performance in the target domain, and smaller for Cityscapes to FoggyCityscapes, where AT degrades the performance. Thus, by measuring the magnitude of the domain shift in terms of FD, it is possible to determine whether adversarial training in the source domain should be conducted.
2) Methods and Parameters for Adversarial Training
AT using the random-initialized FGSM or PGD is known to make the model significantly robust to adversarial examples compared to AT using the zero-initialized FGSM [28], [31]. In addition, the value of
We conduct AT using the zero-initialized FGSM, random-initialized FGSM, and PGD on the PASCAL VOC dataset using the gradient of
Results of AT using the zero-initialized FGSM, random-initialized FGSM, and PGD on the PASCAL VOC dataset with different values of
3) Loss for Generating Adversarial Perturbations
The total loss of object detection comprises several task losses, as shown in (4), for YOLOv3. Therefore, determining the loss to be used to generate adversarial perturbations is a crucial factor. To prevent gradient misalignment between tasks, the technique of selecting a single task loss that maximizes the total loss has also been proposed [29], as shown in (11). We analyze the impact of these loss choices on domain adaptation performance.
We conduct AT using the zero-initialized FGSM on the PASCAL VOC dataset by varying the loss used to generate
4) Balance Between Adversarial Training and Adversarial Feature Learning
We analyze the impact of the parameter
5) Comparison With the State-of-the-Art Domain Adaptation Methods
To demonstrate the advantages of our method, we compare AT+AFL with the state-of-the-art domain adaptation methods: Implicit-Instance-Invariant Network (I3NET) [14], Unbiased Mean Teacher (UMT) [20], and Scale-Aware Domain Adaptive Faster RCNN (SA-DA-Faster) [15]. Table 8 shows the comparisons with state-of-the-art methods for PASCAL VOC to Clipart1k, Watercolor2k, and Comic2k adaptation. The results of the I3NET, UMT, and SA-DA-Faster methods are cited from the original papers. We note that this is not a fair comparison because the detection architectures and backbone networks are different for each method. Our method outperforms the state-of-the-art results for Clipart1k and Comic2k by 5.3% and 3.3%, respectively. On the other hand, UMT outperforms our method by 2.5% for Watercolor2k, where AFL is less effective. As shown in Table 5, the FD for Clipart1k and Comic2k is larger than that for Watercolor2k, indicating that our method is likely to outperform the existing methods when the domain shift is larger. Remember, domain adaptation through AT is a new approach and can be further combined with these state-of-the-art methods to improve performance.
Conclusion
In this study, we explored the implementation of unsupervised domain adaptation in object detection. Further, we proposed a method based on adversarial training in the source domain. To the best of our knowledge, this is the first application of adversarial training in unsupervised domain adaptation. The robust features acquired using adversarially trained detectors are informative in a largely shifted target domain, thereby resulting in improved detection performance. In contrast, for small domain shifts where the non-robust features acquired through standard training are informative in both domains, adversarially trained detectors degrade performance in the target domain. We also proposed a method for aligning the robust features with the target domain through adversarial feature learning. Using this approach, we demonstrated further improved performance for large domain shifts.