Journals & Magazines >IEEE Access >Volume: 10

Adversarially Trained Object Detector for Unsupervised Domain Adaptation

Framework of the proposed method. First, we propagate the source images with the initial perturbations and then compute the adversarial perturbations using the gradients ...

Abstract:

Unsupervised domain adaptation, which involves transferring knowledge from a label-rich source domain to an unlabeled target domain, can be used to reduce annotation cost...Show More

Metadata

Abstract:

Unsupervised domain adaptation, which involves transferring knowledge from a label-rich source domain to an unlabeled target domain, can be used to reduce annotation costs in the field of object detection substantially. This study demonstrates that adversarial training in the source domain can be employed as a new approach for unsupervised domain adaptation. Specifically, we establish that adversarially trained detectors achieve improved detection performance in target domains that are significantly shifted from source domains. This phenomenon is attributed to the fact that adversarially trained detectors can be used to extract robust features that that align with human perception and worth transferring across domains while discarding domain-specific non-robust features. In addition, we propose a method that combines adversarial training and feature alignment to ensure the improved alignment of robust features with the target domain. We conduct experiments on four benchmark datasets and confirm the effectiveness of our proposed approach on large domain shifts from real to artistic images. Compared to the baseline models, the adversarially trained detectors improve the mean average precision by up to 7.7%, and further by up to 11.8% when feature alignments are incorporated. Although our method degrades the performance for small domain shifts, quantifying the domain shift based on the Fréchet distance allows us to determine whether adversarial training should be conducted.

Framework of the proposed method. First, we propagate the source images with the initial perturbations and then compute the adversarial perturbations using the gradients ...

Published in: IEEE Access ( Volume: 10)

Page(s): 59534 - 59543

Date of Publication: 06 June 2022

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2022.3180344

Funding Agency:

References is not available for this document.

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

In computer vision, object detection is a fundamental task, which involves localizing and classifying objects in an image. Advancements in deep learning have resulted in various types of object detectors being proposed [1]–[6]. In most cases, they require supervised learning on a large amount of annotated data [7], [8]. Furthermore, the training and test data must belong to the same domain to achieve the expected performance.

However, domain shifts resulting from changes in weather, painting style, and other factors often occur in practical applications, thereby resulting in a loss of accuracy. In object detection tasks, bounding boxes are required for all the objects in the images during annotation. Therefore, creating a new training dataset in the shifted domain is impractical. An effective solution to this issue is domain adaptation, which involves transferring knowledge from a label-rich source domain to a label-poor or unlabeled target domain [9]. Specifically, unsupervised domain adaptation assumes that the target domain has no labels [10]. In recent studies, several approaches have been proposed for implementing unsupervised domain adaptation in object detection tasks [11]. The most common approach is adversarial feature learning, which involves aligning the source and target features using a feature extractor competing with a domain discriminator [12]–[15]. Other approaches, such as pseudo-labeling in the target domain [16], [17] and image-to-image translation [18]–[20] have also been proposed.

In this study, we explore the application of unsupervised domain adaptation in object detection. Further, we demonstrate that learning robust features in the source domain through adversarial training enhances object detection in the target domain with a large domain shift. Recent studies on adversarial training have revealed the existence of non-robust and robust features [21]. The former is sensitive to perturbation, but they are still necessary for attaining high accuracy. The latter is highly stable and close to human perception [22]. Robust and non-robust features are visualized in Fig. 1. We hypothesize that for domain adaptation, non-robust features are highly domain-specific features. Thus, they are susceptible to domain shifts, whereas robust features are informative in both the source and target domains. This idea is inspired by studies that have recently shown that adversarially trained models demonstrate improved transfer performance compared to standard-trained models [23], [24]. These studies focus on transfer learning in cases where the target domain has a small number of labels. Contrarily, we focus on unsupervised domain adaptation, where the target domain has no labels. In addition to learning robust features through adversarial training, to ensure the increased alignment of such features with the target domain, we propose a novel approach that combines adversarial training and adversarial feature learning [13].

FIGURE 1.

Visualization of robust and non-robust features. The standard trained model in the source domain is highly dependent on non-robust features, which are not informative for the largely shifted target domain. In contrast, the robust features acquired by the adversarially trained model are informative even in the largely shifted target domain.

Show All

In our experiments on the benchmark datasets of real to artistic image adaptation, the adversarially trained detector improves the mean average precision by up to 7.7% compared to that of the standard-trained detector. When combined with adversarial feature learning, the improvement in mean average precision reaches 11.8%. Although our method degrades the performance for small domain shifts such as different weather conditions, quantifying the domain shift using the Fréchet distance allows us to predict domain adaptation performance with adversarial training in advance. In addition, we analyze various adversarial training methods for object detection. We demonstrate that several proposed techniques suggested to be robust against adversarial examples are not substantially different from the simplest adversarial training method in terms of their application in unsupervised domain adaptation.

The contributions of this study are as follows:

To the best of our knowledge, this is the first study on the effectiveness of adversarial training in unsupervised domain adaptation. We establish that, for large domain shifts, adversarially trained detectors achieve improved accuracy in the target domain than standard-trained detectors.
We propose a method that combines adversarial training with adversarial feature learning to ensure the enhanced alignment of the source and target features. Experimental results show that our proposed method achieves improved domain adaptation performance compared to approaches that solely rely on adversarial training.
We introduce a quantification of the domain shift using the Fréchet distance, which allows us to predict the domain adaptation performance with adversarial training.
We show that several adversarial training methods that have been proposed to improve robustness against adversarial examples do not differ substantially in terms of performance with respect to unsupervised domain adaptation.

SECTION II.

Related Work

This section reviews the literature on object detection, domain adaptation, and adversarial training studies.

A. Object Detection

Object detection is a fundamental task in computer vision as well as image classification. Many object detectors have achieved high accuracy due to advancements in deep neural networks [1]–[6]. Most of them rely on supervised learning using large annotated datasets, such as PASCAL Visual Object Classes (VOC) [7] and Microsoft Common Objects in Context (MSCOCO) [8]. Generally, creating a new dataset for object detection is more time-consuming than creating one for image classification because it requires instance-level annotations. In this study, we use You Only Look Once v3 (YOLOv3), which is a well-known object detector with excellent inference speed and accuracy [4].

B. Domain Adaptation

Domain adaptation is a technique for adapting a model trained using a label-rich domain to a label-poor domain. Recently, unsupervised domain adaptation has attracted significant attention in computer vision tasks, such as image classification and semantic segmentation [10].

Many domain adaptation approaches have also been proposed for object detection [11]. Typical approaches include adversarial feature learning [12]–[15], pseudo-label-based self-training [16], [17], and image-to-image translation [18]–[20]. Adversarial feature learning employs an adversarial objective between the domain discriminator and feature extractor [9]. The domain discriminator attempts to accurately classify the source and target images, whereas the feature extractor attempts to fool the domain discriminator. As a result, the model can extract similar features from the source and target domains. The pseudo-label-based self-training approach trains the model by assigning pseudo-labels to the target images based on the knowledge obtained from the source domain. Image-to-image translation converts the source images into target-like images using CycleGAN [25] or similar methods. The model is then trained using the converted images and the original labels obtained from the source domain.

We propose a new method based on adversarial training for unsupervised domain adaptation in object detection. Note that adversarial training is different from adversarial feature learning. As mentioned above, adversarial feature learning is a domain adaptation method in which the feature extractor competes with the domain discriminator to obtain similar features from the source and target domains. On the other hand, adversarial training had originally been proposed to increase the robustness of deep neural networks against adversarial examples. To the best of our knowledge, our study is the first one to apply adversarial training to unsupervised domain adaptation in object detection.

C. Adversarial Training

One of the vulnerabilities of deep neural network-based models is the existence of adversarial examples that perturb the inputs and cause such models to make mistakes [26]. During the adversarial training, adversarial perturbations are added to the training data to ensure that the model is robust against input perturbations. The most typical methods for creating adversarial perturbations are the fast gradient sign method (FGSM) [27] and the projected gradient descent (PGD) [28]. Let $x$ be the input, $y$ the label associated with $x$ , $f$ the model, and $\mathcal {L}$ the loss. FGSM generates $L_{\infty }$ -bounded perturbations that approximately maximize the loss in a single step as follows:

$\begin{equation*} \delta = \epsilon \cdot \mathop {\mathrm {sign}}\nolimits (\nabla _{x} \mathcal {L} (f (x), y)),\tag{1}\end{equation*}$ View Source

where

$\epsilon$

is the budget of the adversarial manipulation. In contrast, PGD creates stronger perturbations by making better approximations through iterative updates with a small step size

$\alpha$

as follows:

$\begin{equation*} \delta ^{ (t+1)} = \mathcal {P}[\delta ^{ (t)} + \alpha \cdot \mathop {\mathrm {sign}}\nolimits (\nabla _{x} \mathcal {L} (f (x), y))],\tag{2}\end{equation*}$

View Source

where

$\mathcal {P}$

denotes the projection onto the

$L_{\infty }$

-norm

$\epsilon$

-ball

$\left \{{\delta \mid \left \|{\delta }\right \|_\infty \leq \epsilon }\right \}$

. Such methods mainly focus on image classifiers. However, studies on the adversarial training of object detectors have also been conducted from the perspective of multi-task learning in object detection [29], [30].

In a recent study, the researchers demonstrated that adversarial examples result from the presence of non-robust features that are highly predictive but imperceptible to humans [21]. Standard-trained models rely on such non-robust features, whereas adversarially trained models extract robust features that are aligned with human perception [22]. This attribute gave rise to an unintended but useful inference, i.e., adversarially trained models are highly effective in transferring knowledge to new domains compared to standard-trained models [23], [24].

Inspired by the observations presented above, we propose adversarial training in the source domain to implement unsupervised domain adaptation in object detection. The robust features acquired from the source domain are informative in the dissimilar target domain. Furthermore, the enhanced alignment of robust features with the target domain can be achieved through adversarial training combined with adversarial feature learning.

SECTION III.

Proposed Method

In this section, we first formulate the problem and describe the adversarial training in the source domain for YOLOv3 [4]. We then introduce an approach for combining adversarial training and adversarial feature learning to ensure robust and target-aligned feature acquisition. The framework of our proposed method is illustrated in Fig. 2.

$FIGURE 2. - Framework of the proposed method. $F_{1}$ and $F_{2}$ are object detection networks, $D$ is a domain discriminator, and GRL is a gradient reversal layer. First, we propagate the source images with the initial perturbations $\boldsymbol {\delta }_{0}$ and then compute the adversarial perturbations $\boldsymbol {\delta ^{*}}$ using the gradients of the losses. Then, adversarial training on the source images perturbed by $\boldsymbol {\delta ^{*}}$ and adversarial feature learning on the source and target images are performed.$

FIGURE 2.

Framework of the proposed method. $F_{1}$ and $F_{2}$ are object detection networks, $D$ is a domain discriminator, and GRL is a gradient reversal layer. First, we propagate the source images with the initial perturbations $\boldsymbol {\delta }_{0}$ and then compute the adversarial perturbations $\boldsymbol {\delta ^{*}}$ using the gradients of the losses. Then, adversarial training on the source images perturbed by $\boldsymbol {\delta ^{*}}$ and adversarial feature learning on the source and target images are performed.

Show All

A. Problem Setting

To implement unsupervised domain adaptation in object detection, we obtain labeled data $(\boldsymbol {x}_{s}, \left \{{y_{s}, \boldsymbol {b}_{s}}\right \})$ from the source domain $\mathcal {D}_{s}$ and unlabeled data $\boldsymbol {x}_{t}$ from the target domain $\mathcal {D}_{t}$ . Here, $\boldsymbol {x}_{s}$ and $\boldsymbol {x}_{t}$ represent the input images, $y_{s}$ represents the class label, and $\boldsymbol {b}_{s}$ represents the bounding box. Generally, two domains, $\mathcal {D}_{s}$ and $\mathcal {D}_{t}$ , have different data distributions. The goal of domain adaptation is to improve the detection performance in the target domain $\mathcal {D}_{t}$ using the labeled data in the source domain and the unlabeled data in the target domain. To avoid notational clutter, we use $\mathcal {D}_{s}$ and $\mathcal {D}_{t}$ to denote the data distributions of the source and target domains, respectively.

In this study, we use YOLOv3, a well-known object detector. The objective of standard training in the source domain for YOLOv3 can be expressed as follows:

$\begin{equation*} \min _{F} \mathbb {E}_{ (\boldsymbol {x}_{s}, \left \{{y_{s}, \boldsymbol {b}_{s}}\right \})\sim \mathcal {D}_{s}} [\mathcal {L}_{\mathrm {det}} (F (\boldsymbol {x}_{s}), \left \{{y_{s}, \boldsymbol {b}_{s}}\right \})],\tag{3}\end{equation*}$ View Source

where

$\mathcal {L}_{\mathrm {det}}$

denotes the detection loss, and

$F$

denotes the YOLOv3 network. Because

$F$

outputs the class prediction, bounding box prediction, and objectness score,

$\mathcal {L}_{\mathrm {det}}$

can be decomposed into the classification loss, localization loss, and objectness loss as follows:

$\begin{align*}&\hspace {-2pc}\mathcal {L}_{\mathrm {det}} (F (\boldsymbol {x}_{s}), \left \{{y_{s}, \boldsymbol {b}_{s}}\right \}) \\=&\mathcal {L}_{\mathrm {cls}} (F (\boldsymbol {x}_{s}), y_{s}) + \mathcal {L}_{\mathrm {loc}} (F (\boldsymbol {x}_{s}), \boldsymbol {b}_{s}) + \mathcal {L}_{\mathrm {obj}} (F (\boldsymbol {x}_{s})).\tag{4}\end{align*}$

View Source

Here,

$\mathcal {L}_{\mathrm {cls}}$

is used to measure the difference between the predicted and ground-truth classes,

$\mathcal {L}_{\mathrm {loc}}$

is used to measure the misalignment between the predicted and ground-truth boxes, and

$\mathcal {L}_{\mathrm {obj}}$

is used to verify the existence of the predicted objects.

B. Adversarial Training in the Source Domain

Our main objective is to demonstrate that adversarial training in the source domain can be employed to achieve unsupervised domain adaptation. The robust features acquired through adversarially trained detectors are expected to be useful for dissimilar target domains and improve detection accuracy in the target domain. The objective of adversarial training can be expressed as follows:

$\begin{equation*} \min _{F} \mathbb {E}_{ (\boldsymbol {x}_{s}, \left \{{y_{s}, \boldsymbol {b}_{s}}\right \})\sim \mathcal {D}_{s}} [\mathcal {L}_{\mathrm {det}} (F (\boldsymbol {x}_{s} + \boldsymbol {\delta ^{*}}), \left \{{y_{s}, \boldsymbol {b}_{s}}\right \})],\tag{5}\end{equation*}$ View Source

where

$\boldsymbol {\delta ^{*}}$

represents adversarial perturbation.

$\boldsymbol {\delta ^{*}}$

is designed to cause the detector to make mistakes, and it is usually too small to be perceived by humans. Therefore, as shown in (5), the detector is dependent on the robust features that are aligned with human perception. We shall now introduce several designs of perturbations

$\boldsymbol {\delta ^{*}}$

for YOLOv3 based on the FGSM [27] and PGD [28]. We shall then describe

$\boldsymbol {\delta ^{*}}$

used in our experiments.

1) FGSM

The FGSM creates an adversarial perturbation in a single gradient step. A straightforward approach for generating an adversarial perturbation involves using the gradient of $\mathcal {L}_{\mathrm {det}}$ as follows:

$\begin{align*} \tilde { \boldsymbol {\delta }}_{\mathrm {det}}=&\mathop {\mathrm {sign}}\nolimits (\nabla _{ \boldsymbol {\delta }_{0}} \mathcal {L}_{\mathrm {det}} (F (\boldsymbol {x}_{s} + \boldsymbol {\delta }_{0}), \left \{{y_{s}, \boldsymbol {b}_{s}}\right \})),\tag{6}\\ \boldsymbol {\delta }_{\mathrm {det}}=&\mathcal {P} [\boldsymbol {\delta }_{0} + \epsilon \cdot \tilde { \boldsymbol {\delta }}_{\mathrm {det}}],\tag{7}\end{align*}$ View Source

where

$\mathcal {P}$

denotes the projection onto the

$L_{\infty }$

-norm

$\epsilon$

-ball

$\left \{{ \boldsymbol {\delta }\mid \left \|{ \boldsymbol {\delta }}\right \|_\infty \leq \epsilon }\right \}$

for some

$\epsilon > 0$

, and

$\boldsymbol {\delta }_{0}$

represents the initial value of the perturbation. As shown in (6),

$\tilde { \boldsymbol {\delta }}_{\textrm {det}}$

is calculated as a signed gradient of

$\mathcal {L}_{\mathrm {det}}$

with respect to

$\boldsymbol {\delta }_{0}$

. The adversarial perturbation

$\boldsymbol {\delta }_{\textrm {det}}$

is then obtained using (7).

Alternatively, one can generate adversarial perturbations $\boldsymbol {\delta }_{\mathrm {cls}}$ , $\boldsymbol {\delta }_{\mathrm {loc}}$ , and $\boldsymbol {\delta }_{\mathrm {obj}}$ based on the three task losses presented in (4) in a similar manner. First, $\tilde { \boldsymbol {\delta }}_{\mathrm {cls}}$ , $\tilde { \boldsymbol {\delta }}_{\mathrm {loc}}$ , and $\tilde { \boldsymbol {\delta }}_{\mathrm {obj}}$ are generated as follows:

$\begin{align*} \tilde { \boldsymbol {\delta }}_{\mathrm {cls}}=&\mathop {\mathrm {sign}}\nolimits (\nabla _{ \boldsymbol {\delta }_{0}} \mathcal {L}_{\mathrm {cls}} (F (\boldsymbol {x}_{s} + \boldsymbol {\delta }_{0}), y_{s})), \tag{8}\\ \tilde { \boldsymbol {\delta }}_{\mathrm {loc}}=&\mathop {\mathrm {sign}}\nolimits (\nabla _{ \boldsymbol {\delta }_{0}} \mathcal {L}_{\mathrm {loc}} (F (\boldsymbol {x}_{s} + \boldsymbol {\delta }_{0}), \boldsymbol {b}_{s})), \tag{9}\\ \tilde { \boldsymbol {\delta }}_{\mathrm {obj}}=&\mathop {\mathrm {sign}}\nolimits (\nabla _{ \boldsymbol {\delta }_{0}} \mathcal {L}_{\mathrm {obj}} (F (\boldsymbol {x}_{s} + \boldsymbol {\delta }_{0}))).\tag{10}\end{align*}$ View Source

The final perturbations are then obtained as shown in (7). From the perspective of multi-task learning in object detection, [29] showed that the direct use of

$\mathcal {L}_{\mathrm {det}}$

, as shown in (6), results in gradient misalignment between tasks, thereby causing decreased robustness against adversarial examples. To avoid this problem, they proposed an adversarial training method, which selects a single task perturbation that maximizes

$\mathcal {L}_{\mathrm {det}}$

. Hereinafter, we denote this perturbation as

$\boldsymbol {\delta }_{\mathrm {mtl}}$

, which is generated for YOLOv3 as follows:

$\begin{equation*} \boldsymbol {\delta }_{\mathrm {mtl}} = \mathop {\mathrm {arg\,max}} _{ \boldsymbol {\delta } \in \left \{{ \boldsymbol {\delta }_{\mathrm {cls}}, \boldsymbol {\delta }_{\mathrm {loc}}, \boldsymbol {\delta }_{\mathrm {obj}}}\right \}} \mathcal {L}_{\mathrm {det}} (F (\boldsymbol {x}_{s} + \boldsymbol {\delta }), \left \{{y_{s}, \boldsymbol {b}_{s}}\right \}).\tag{11}\end{equation*}$

View Source

Generally, $\boldsymbol {\delta }_{0}$ is set to zero for the FGSM, which is referred to as the zero-initialized FGSM in this study. However, a recent study showed that initializing $\boldsymbol {\delta }_{0}$ using a random value uniformly sampled from $[-\epsilon, \epsilon]$ results in enhanced robustness against adversarial examples [31]. We refer to this as the random-initialized FGSM.

2) PGD

PGD generates stronger perturbations than those generated using the FGSM by iterating the gradient steps. Adversarial training using PGD is known to be effective in enhancing adversarial robustness. However, this approach is computationally expensive. With a step size parameter $\alpha$ , the generation of adversarial perturbation using PGD can be expressed as follows:

$\begin{equation*} \boldsymbol {\delta }^{ (t+1)} = \mathcal {P} [\boldsymbol {\delta }^{ (t)} + \alpha \cdot \tilde { \boldsymbol {\delta }}^{ (t)}].\tag{12}\end{equation*}$ View Source

Here,

$\tilde { \boldsymbol {\delta }}^{ (t)}$

can be computed using arbitrary losses in object detection, as described for the FGSM.

3) Default Perturbation $\boldsymbol{\delta^{*}}$ in Our Experiments

We employ $\boldsymbol {\delta }_{\mathrm {det}}$ generated using the zero-initialized FGSM as the default $\boldsymbol {\delta ^{*}}$ in our main experiments, because this is the simplest strategy for adversarial training. In Section IV-D, comparisons among the zero-initialized FGSM, random-initialized FGSM, and PGD are conducted, as well as a comparison of the losses used to generate perturbations. Although PGD, the random-initialized FGSM, and the use of $\boldsymbol {\delta }_{\mathrm {mtl}}$ are known to enhance robustness to adversarial examples [28], [29], [31], we establish that the simplest adversarial training method, i.e., the zero-initialized FGSM with $\boldsymbol {\delta }_{\mathrm {det}}$ , is sufficient in terms of performance with respect to domain adaptation.

C. Robust and Target-Aligned Feature Learning

Through adversarial training in the source domain, as described above, the model is expected to learn robust features that are also informative to the target domain. However, since the model is not trained in the target domain, the robust features acquired through the model are discrepant from the robust features in the target domain. Therefore, we aim to enhance domain adaptation performance by aligning the robust features to the target domain.

For this purpose, we incorporate adversarial feature learning, which is a typical approach for implementing domain adaptation. Specifically, we employ a local feature alignment approach that matches features, such as texture and color, between the source and target domains [13]. In this study, the detector $F$ is decomposed as follows: $F_{2} \circ F_{1}$ , where $F_{1}$ represents the first dozens of the network layers, and $F_{2}$ represents the rest of the layers in the network. The output of $F_{1}$ is the input of the domain discriminator $D$ across the gradient reversal layer [9]. The feature extractor $F_{1}$ outputs a feature map of width $W$ and height $H$ , and the domain discriminator $D$ outputs a domain prediction map whose width and height are the same as those of the input from $F_{1}$ . In our setting, the domain discriminator $D$ aims to ensure that the domain predictions for the source images are equal to zero and that those for the target images are equal to one. In contrast, the feature extractor $F_{1}$ is trained to ensure the domain predictions are opposite to those $D$ aims for. Owing to the gradient reversal layer, the losses of adversarial feature learning can be summarized as follows:

$\begin{align*} \mathcal {L}_{\mathrm {afl}_{s}} (D (F_{1} (\boldsymbol {x}_{s} + \boldsymbol {\delta ^{*}})))=&\frac {1}{WH}\sum _{w,h} D (F_{1} (\boldsymbol {x}_{s} + \boldsymbol {\delta ^{*}}))_{wh}^{2},\tag{13}\\ \mathcal {L}_{\mathrm {afl}_{t}} (D (F_{1} (\boldsymbol {x}_{t})))=&\frac {1}{WH}\sum _{w,h} (1 - D (F_{1} (\boldsymbol {x}_{t}))_{wh})^{2},\tag{14}\end{align*}$ View Source

where

$D (\cdot)_{wh}$

denotes the

$(w, h)$

-th entry of the outputs of

$D$

. Note that we add an adversarial perturbation

$\boldsymbol {\delta ^{*}}$

to the source image

$\boldsymbol {x}_{s}$

, as shown in (13).

Combined with the objective of adversarial training in the source domain, (5), the overall objective is expressed as follows:

$\begin{align*}&\hspace {-2pc}\max _{F_{1}}\min _{F,D} \mathbb {E}_{\substack { (\boldsymbol {x}_{s}, \left \{{y_{s}, \boldsymbol {b}_{s}}\right \})\sim \mathcal {D}_{s}\\ \boldsymbol {x}_{t}\sim \mathcal {D}_{t}}} [\mathcal {L}_{\mathrm {det}} (F (\boldsymbol {x}_{s} + \boldsymbol {\delta ^{*}}), \left \{{y_{s}, \boldsymbol {b}_{s}}\right \}) \\&+\, \lambda (\mathcal {L}_{\mathrm {afl}_{s}} (D (F_{1} (\boldsymbol {x}_{s} + \boldsymbol {\delta ^{*}}))) + \mathcal {L}_{\mathrm {afl}_{t}} (D (F_{1} (\boldsymbol {x}_{t}))))].\tag{15}\end{align*}$ View Source

where

$\lambda$

represents the weight required to ensure the balance between adversarial training and adversarial feature learning. The signs of the gradients back-propagated from

$D$

$F_{1}$

are reversed through the gradient reversal layer.

SECTION IV.

Experiments

In this section, we demonstrate the effectiveness of our approach through domain adaptation experiments conducted on benchmark datasets. In addition, we compare various methods and parameters for generating adversarial perturbations in adversarial training, and we observe the effect of the choice on the domain adaptation performance.

A. Datasets

For large domain shifts, we use PASCAL VOC [7] as the source dataset and Clipart1k, Watercolor2k, and Comic2k [32] as the target datasets. PASCAL VOC is a dataset comprising real-world images with 20 object classes. The training sets (VOC2007-trainval and VOC2012-trainval) comprise 16,551 images, and the test set (VOC2007-test) comprises 4,952 images. Clipart1k is a dataset comprising graphical images and has the same object classes as PASCAL VOC. The training set comprises 500 images, and the test set comprises 500 images. Watercolor2k and Comic2k are datasets comprising watercolor and comic images, respectively. Both datasets have six object classes, which are defined in PASCAL VOC, and they comprise 1,000 training and 1,000 test images. The appearances of objects significantly differ between the real images in the PASCAL VOC dataset and the artistic images in the Clipart1k, Watercolor2k, and Comic2k datasets.

For small domain shifts, we use Cityscapes [33] as the source dataset and FoggyCityscapes [34] as the target dataset. Cityscapes is a dataset comprising urban street scenes with eight object classes. The training set comprises 2,975 images, and the test set comprises 500 images. FoggyCityscapes is a dataset rendered from Cityscapes with fog simulation; it comprises the same number of images as the Cityscapes dataset. The weather conditions are different in the two datasets, but the appearances of the objects are similar. Examples of the datasets are shown in Fig. 3.

FIGURE 3.

Examples of the datasets used in the experiments.

Show All

B. Implementation Details

In this study, we use YOLOv3 [4], which is a well-known object detector. The network with the first 26 convolutional layers of Darknet-53 in YOLOv3 is used as the feature extractor $F_{1}$ , which is introduced in Section III-C, and the rest of the network is used as $F_{2}$ . The domain discriminator $D$ is designed following the original local domain classifier [13]. The training images are applied using Mosaic data augmentation [35] and resized to $416\times 416$ pixels. In all the experiments, a model pre-trained using the MSCOCO [8] dataset is used as the initial weight. We train the models for 50 epochs on the size of the source dataset. The optimizer is a stochastic gradient descent with a momentum of 0.937 and a weight decay of $5.0 \times 10^{-4}$ . The learning rate decreases from $1.0 \times 10^{-2}$ to $2.0 \times 10^{-3}$ through the cosine annealing schedule, and linear warmup is used for the first three epochs.

The test images are resized during the evaluation so that the longer side is 416. We evaluate the average precision (AP) and mean AP (mAP) on the test data using an IoU threshold of 0.5. The reported results are the average of over three runs of similar training procedures. All the experiments are implemented using the PyTorch framework installed on the Ubuntu operating system running on a computer with an NVIDIA TITAN RTX GPU.

1) Standard Training

Only the source dataset is used, and the batch size is set to 16. We add a zero tensor to the source image instead of $\boldsymbol {\delta ^{*}}$ , which is presented in (5).

2) Adversarial Training

Only the source dataset is used, and the batch size is similar to that for standard training. By default, $\boldsymbol {\delta }_{\mathrm {det}}$ generated using the zero-initialized FGSM with $\epsilon =1/255$ is used as $\boldsymbol {\delta ^{*}}$ . For detailed analysis, the method and parameters for generating $\boldsymbol {\delta ^{*}}$ are modified as required.

3) Adversarial Feature Learning

When combined with adversarial feature learning, both the source and target datasets are used. The batch size is set to 32: 16 from the source dataset and 16 from the target dataset. We set $\lambda =1.0$ in (15).

C. Results

1) Large Domain Shift

We first conduct experiments on adaptations in large domain shifts from real to artistic images. Specifically, adaptations from PASCAL VOC to Clipart1k, Watercolor2k, and Comic2k are evaluated.

First, we list the results on the Clipart1k dataset in Table 1. Adversarial training (AT) outperforms standard training (ST) by 7.7% in terms of mAP. In addition, AT outperforms ST combined with adversarial feature learning (ST+AFL) by 1.6%, even though the target dataset is not used for AT. AT combined with AFL (AT+AFL) outperforms the other methods for 14 classes in terms of AP and improves the mAP by 11.8% compared to that of ST. Next, we list the results on the Watercolor2k dataset in Table 2. AT and AT+AFL improve the mAP over that of ST by 6.1% and 6.4%, respectively. AT+AFL outperforms the other methods for four classes in terms of AP, although the improvement achieved through AFL is limited compared to that on other datasets. Finally, we list the results on the Comic2k dataset in Table 3. AT and AT+AFL improve the mAP over that of ST by 2.8% and 6.7%, respectively. AT+AFL outperforms the other methods for four classes in terms of AP.

TABLE 1 Results of Adaptation From PASCAL VOC to Clipart1k. AP (%) is Reported on the Clipart1k Test Set. ST, AT, and AFL Indicate Standard Training, Adversarial Training, and Adversarial Feature Learning, Respectively

TABLE 2 Results of Adaptation From PASCAL VOC to Watercolor2k. AP (%) is Reported on the Watercolor2k Test Set

TABLE 3 Results of Adaptation From PASCAL VOC to Comic2k. AP (%) is Reported on the Comic2k Test Set

In summary, adversarially trained models outperform standard–trained models, and further improvements in their performance can be achieved by incorporating AFL. Specifically, the finding that AT using only the source dataset improves performance in the target domain is interesting because general domain adaptation methods utilize images in the target domain. The reason behind these results can be explained as follows. In adapting real to artistic images, the non-robust features acquired through ST in the source domain are not informative in the target domain due to the large domain shift. As a result, ST degrades performance in the target domain. Contrarily, the robust features acquired through AT are informative in the target domain, and thus, they can maintain performance in the target domain. Combined with AFL, the robust features are aligned with the target domain, thereby resulting in further performance improvement.

2) Small Domain Shift

We also evaluate the effectiveness of our proposed method on a small domain shift. Specifically, an adaptation between different weather conditions, from Cityscapes to FoggyCityscapes, is performed. The results are listed in Table 4. Contrary to the results for large domain shifts, AT and AT+AFL decrease the mAP by 17.7% and 13.7% compared to ST, respectively. In this adaptation scenario, the ST+AFL approach demonstrates the best performance in all the classes in the target domain.

TABLE 4 Results of Adaptation From Cityscapes to FoggyCityscapes. AP (%) is Reported on the FoggyCityscapes Test Set

For adaptation between similar domains, the non-robust features acquired through ST are also informative in the target domain. Therefore, the detection performance of standard-trained models in the target domain is highly dependent on non-robust features. In contrast, AT makes the detector rely on robust features instead of non-robust features. Because robust features are less informative than non-robust features, AT is known to result in a reduction in accuracy in the source domain [22]. Correspondingly, for small domain shifts, the application of AT results in decreased performance in the target domain.

D. Analysis

1) Quantifying the Domain Shift Magnitude

AT in the source domain improves performance for a large shifted target domain but degrades the performance for a small shifted target domain. Therefore, quantifying the magnitude of the domain shift is necessary to determine whether our approach should be applied. Fréchet inception distance (FID) [36], which computes the Fréchet distance (FD) of two distributions using the feature space of the Inception-v3 model [37], is known as a measure to evaluate the difference between image sets. We compute FD using the feature space of the YOLOv3 model instead of the Inception-v3 model because it allows feature extraction along with the object detection task. In the experiments, we use feature maps extracted from the YOLOv3 backbone network, which is pre-trained using the MS COCO dataset.

First, we perform experiments on the PASCAL VOC dataset to verify the effectiveness of the FD in measuring the domain shift. To control the domain shift magnitude, we apply style transfer to the PASCAL VOC test set via adaptive instance normalization (AdaIN) [38]. The content–style trade-off $\beta$ , which manipulates the balance between content and style images, is varied from 0.0 to 1.0 in increments of 0.1. Three style images used in our experiments and examples of the stylized images are shown in Fig. 4. The FD is computed between the original PASCAL VOC training set and each stylized PASCAL VOC test set. We then conduct ST and AT on the training set and evaluate how much AT improves the mAP for each stylized test set compared to ST. The results are shown in Fig. 5. As $\beta$ increases, the FD between the training and stylized test sets becomes larger. Correspondingly, the ratio of the mAP of AT to that of ST also increases. These results suggest that FD can be used to quantify the magnitude of the domain shift and help predict the effect of AT on domain adaptation.

$FIGURE 4. - Examples of style transfer via AdaIN for the PASCAL VOC test set. The test set is stylized into three style images by varying the content–style trade-off $\beta $ .$

FIGURE 4.

Examples of style transfer via AdaIN for the PASCAL VOC test set. The test set is stylized into three style images by varying the content–style trade-off $\beta$ .

Show All

FIGURE 5.

Fréchet distance (FD) between the PASCAL VOC training set and the stylized PASCAL VOC test set, and the ratio of the mAP of AT to mAP of ST on the stylized test set. The larger the FD, the better the detection performance of AT on the stylized test sets.

Show All

Next, we compute the FD between the source domain’s training set and the target domain’s test set used in our main experiments. The results are listed in Table 5. The FDs are larger for PASCAL VOC to Clipart1k, Watercolo2k, and Comic2k, where AT improves the detection performance in the target domain, and smaller for Cityscapes to FoggyCityscapes, where AT degrades the performance. Thus, by measuring the magnitude of the domain shift in terms of FD, it is possible to determine whether adversarial training in the source domain should be conducted.

TABLE 5 FD Between the Training Set of the Source Domain and the Test Set of the Target Domain, and Ratio of the mAP of AT to That of ST in the Target Domain

2) Methods and Parameters for Adversarial Training

AT using the random-initialized FGSM or PGD is known to make the model significantly robust to adversarial examples compared to AT using the zero-initialized FGSM [28], [31]. In addition, the value of $\epsilon$ is a crucial factor in AT. In this study, we analyze the impact of the methods and parameters for AT on the performance of unsupervised domain adaptation.

We conduct AT using the zero-initialized FGSM, random-initialized FGSM, and PGD on the PASCAL VOC dataset using the gradient of $\mathcal {L}_{\mathrm {det}}$ and varying $\epsilon$ . When $\epsilon =0$ , ST is performed instead of AT. PGD is performed in 10 steps, and the step size is set to $\alpha =1.5 \epsilon / 10$ . Fig. 6 shows the mAP values for the source (PASCAL VOC) and target (Clipart1k, Watercolor2k, and Comic2k) test sets in each setting. In the source domain, all the methods show a decrease in mAP as the value of $\epsilon$ increases. This result is because AT prevents the model from acquiring predictive and non-robust features. In the target domain, all the methods improve mAP compared to ST ( $\epsilon =0$ ). Interestingly, we establish that the three methods, known to differ in robustness against adversarial examples, are not substantially different in their performance in the target domain. This result indicates the intriguing phenomenon that the domain adaptation performance of adversarially trained models does not depend on their robustness. Considering the computational cost, the zero-initialized FGSM and random-initialized FGSM are better choices for domain adaptation. On the other hand, the best value of $\epsilon$ depends on the target dataset and method; thus, $\epsilon$ must be adjusted according to the setting.

$FIGURE 6. - Results of AT using the zero-initialized FGSM, random-initialized FGSM, and PGD on the PASCAL VOC dataset with different values of $\epsilon $ . We report the mAP values on the PASCAL VOC, Clipart1k, Watercolor2k, and Comic2k test sets.$

FIGURE 6.

Results of AT using the zero-initialized FGSM, random-initialized FGSM, and PGD on the PASCAL VOC dataset with different values of $\epsilon$ . We report the mAP values on the PASCAL VOC, Clipart1k, Watercolor2k, and Comic2k test sets.

Show All

3) Loss for Generating Adversarial Perturbations

The total loss of object detection comprises several task losses, as shown in (4), for YOLOv3. Therefore, determining the loss to be used to generate adversarial perturbations is a crucial factor. To prevent gradient misalignment between tasks, the technique of selecting a single task loss that maximizes the total loss has also been proposed [29], as shown in (11). We analyze the impact of these loss choices on domain adaptation performance.

We conduct AT using the zero-initialized FGSM on the PASCAL VOC dataset by varying the loss used to generate $\boldsymbol {\delta ^{*}}$ . Table 6 shows the mAP values in the target domain for detectors trained using each adversarial perturbation. AT using $\boldsymbol {\delta }_{\mathrm {det}}$ demonstrates the best performance for the Clipart1k and Watercolor2k datasets, and it is only 0.2% lower than the best performance for the Comic2k dataset. The use of $\boldsymbol {\delta }_{\mathrm {mtl}}$ is within only 0.6% of the best performance on all datasets. With $\boldsymbol {\delta }_{\mathrm {cls}}$ , $\boldsymbol {\delta }_{\mathrm {loc}}$ , and $\boldsymbol {\delta }_{\mathrm {obj}}$ , which use a single task loss, the mAP values for the Watercolor2k dataset are much lower than the best performance by 1.3% to 2.1%. These results suggest that domain adaptation performance is highly stable when all the task losses are considered during AT, as in the case of $\boldsymbol {\delta }_{\mathrm {det}}$ and $\boldsymbol {\delta }_{\mathrm {mtl}}$ . As mentioned in Section III-B, $\boldsymbol {\delta }_{\mathrm {mtl}}$ is known to be more robust than $\boldsymbol {\delta }_{\mathrm {det}}$ against adversarial examples because $\boldsymbol {\delta }_{\mathrm {det}}$ results in gradient misalignment between tasks, whereas $\boldsymbol {\delta }_{\mathrm {mtl}}$ does not. However, $\boldsymbol {\delta }_{\mathrm {det}}$ shows a higher mAP value than $\boldsymbol {\delta }_{\mathrm {mtl}}$ . This result indicates that the acquisition of robust features for domain adaptation must be considered separately from robustness against adversarial examples.

TABLE 6 Comparison of Loss Used for Generating

$\boldsymbol{\delta^{*}}$ Using the Zero-Initialized FGSM. We Report the mAP Values (%) for Each Target Domain

$Table 6- Comparison of Loss Used for Generating $\boldsymbol{\delta^{*}}$ Using the Zero-Initialized FGSM. We Report the mAP Values (%) for Each Target Domain$

4) Balance Between Adversarial Training and Adversarial Feature Learning

We analyze the impact of the parameter $\lambda$ in Eq. (15) on domain adaptation performance. $\lambda$ is a parameter that controls the balance between AT and AFL loss. We evaluate the mAP of our proposed method (AT+AFL) with $\lambda \in [{0.3,10.0}]$ for Clipart1k, Watercolor2k, and Comic2k datasets. The results are listed in Table 7. The best mAP is obtained at $\lambda =5.0$ for Clipart1k, $\lambda =0.5$ for Watercolor2k, and $\lambda =0.3$ for Comic2k. However, as shown in Table 7, the mAPs are almost the same for each $\lambda$ , and the results indicate that the parameter $\lambda$ does not have a serious impact on domain adaptation performance.

TABLE 7 Impact of Parameter

$\lambda$ on Domain Adaptation Performance for AT+AFL. We Report the mAP Values (%) for Each Target Domain

$Table 7- Impact of Parameter $\lambda$ on Domain Adaptation Performance for AT+AFL. We Report the mAP Values (%) for Each Target Domain$

5) Comparison With the State-of-the-Art Domain Adaptation Methods

To demonstrate the advantages of our method, we compare AT+AFL with the state-of-the-art domain adaptation methods: Implicit-Instance-Invariant Network (I3NET) [14], Unbiased Mean Teacher (UMT) [20], and Scale-Aware Domain Adaptive Faster RCNN (SA-DA-Faster) [15]. Table 8 shows the comparisons with state-of-the-art methods for PASCAL VOC to Clipart1k, Watercolor2k, and Comic2k adaptation. The results of the I3NET, UMT, and SA-DA-Faster methods are cited from the original papers. We note that this is not a fair comparison because the detection architectures and backbone networks are different for each method. Our method outperforms the state-of-the-art results for Clipart1k and Comic2k by 5.3% and 3.3%, respectively. On the other hand, UMT outperforms our method by 2.5% for Watercolor2k, where AFL is less effective. As shown in Table 5, the FD for Clipart1k and Comic2k is larger than that for Watercolor2k, indicating that our method is likely to outperform the existing methods when the domain shift is larger. Remember, domain adaptation through AT is a new approach and can be further combined with these state-of-the-art methods to improve performance.

TABLE 8 Comparison With the State-of-the-Art Methods in the Adaptation from PASCAL VOC to Clipart1k, Watercolor2k, and Comic2k. We Report the mAP Values (%) for Each Target Domain

SECTION V.

Conclusion

In this study, we explored the implementation of unsupervised domain adaptation in object detection. Further, we proposed a method based on adversarial training in the source domain. To the best of our knowledge, this is the first application of adversarial training in unsupervised domain adaptation. The robust features acquired using adversarially trained detectors are informative in a largely shifted target domain, thereby resulting in improved detection performance. In contrast, for small domain shifts where the non-robust features acquired through standard training are informative in both domains, adversarially trained detectors degrade performance in the target domain. We also proposed a method for aligning the robust features with the target domain through adversarial feature learning. Using this approach, we demonstrated further improved performance for large domain shifts.

References is not available for this document.

Adversarially Trained Object Detector for Unsupervised Domain Adaptation

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Work

A. Object Detection

B. Domain Adaptation

C. Adversarial Training

Proposed Method

A. Problem Setting

B. Adversarial Training in the Source Domain

1) FGSM

2) PGD

3) Default Perturbation $\boldsymbol{\delta^{*}}$ in Our Experiments

C. Robust and Target-Aligned Feature Learning

Experiments

A. Datasets

B. Implementation Details

1) Standard Training

2) Adversarial Training

3) Adversarial Feature Learning

C. Results

1) Large Domain Shift

2) Small Domain Shift

D. Analysis

1) Quantifying the Domain Shift Magnitude

2) Methods and Parameters for Adversarial Training

3) Loss for Generating Adversarial Perturbations

4) Balance Between Adversarial Training and Adversarial Feature Learning

5) Comparison With the State-of-the-Art Domain Adaptation Methods

Conclusion

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Adversarially Trained Object Detector for Unsupervised Domain Adaptation

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Work

A. Object Detection

B. Domain Adaptation

C. Adversarial Training

Proposed Method

A. Problem Setting

B. Adversarial Training in the Source Domain

1) FGSM

2) PGD

3) Default Perturbation \boldsymbol{\delta^{*}}\boldsymbol{\delta^{*}} in Our Experiments

C. Robust and Target-Aligned Feature Learning

Experiments

A. Datasets

B. Implementation Details

1) Standard Training

2) Adversarial Training

3) Adversarial Feature Learning

C. Results

1) Large Domain Shift

2) Small Domain Shift

D. Analysis

1) Quantifying the Domain Shift Magnitude

2) Methods and Parameters for Adversarial Training

3) Loss for Generating Adversarial Perturbations

4) Balance Between Adversarial Training and Adversarial Feature Learning

5) Comparison With the State-of-the-Art Domain Adaptation Methods

Conclusion

References

IEEE Account

Purchase Details

Profile Information

Need Help?

3) Default Perturbation $\boldsymbol{\delta^{*}}$ in Our Experiments