Journals & Magazines >IEEE Access >Volume: 7

An Adaptive Control Algorithm for Stable Training of Generative Adversarial Networks

The adaptive control procedure of Adaptive k-step Generative Adversarial Network.

Abstract:

Generative adversarial networks (GANs) have shown significant progress in generating highquality visual samples, however they are still well known both for being unstable...Show More

Metadata

Abstract:

Generative adversarial networks (GANs) have shown significant progress in generating highquality visual samples, however they are still well known both for being unstable to train and for the problem of mode collapse, particularly when trained on data collections containing a diverse set of visual objects. In this paper, we propose an Adaptive k-step Generative Adversarial Network (Ak-GAN), which is designed to mitigate the impact of instability and saturation in the original by dynamically adjusting the ratio of the training steps of both the generator and discriminator. To accomplish this, we track and analyze stable training curves of relatively narrow datasets and use them as the target fitting lines when training more diverse data collections. Furthermore, we conduct experiments on the proposed procedure using several optimization techniques (e.g., supervised guiding from previous stable learning curves with and without momentum) and compare their performance with that of state-of-the-art models on the task of image synthesis from datasets consisting of diverse images. Empirical results demonstrate that Ak-GAN works well in practice and exhibits more stable behavior than regular GANs during training. A quantitative evaluation has been conducted on the Inception Score (IS) and the relative inverse Inception Score (RIS); compared with regular GANs, the former has been improved by 61% and 83%, and the latter by 21% and 60%, on the CelebA and the Anime datasets, respectively.

The adaptive control procedure of Adaptive k-step Generative Adversarial Network.

Published in: IEEE Access ( Volume: 7)

Page(s): 184103 - 184114

Date of Publication: 18 December 2019

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2019.2960461

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Generative adversarial networks (GANs) [2], [4], [5], [9], [12] are well known for being effective at synthesizing samples for various applications such as image generation [5], [12], [26], [28], [29], industrial design [20], speech generation [11], and natural language processing [2], [9]. The objective of GANs is to train a generator model ${G}$ and a discriminator model ${D}$ in parallel. GANs are typically multi-layer perceptrons (MLPs) or convolutional neural networks (CNNs). The generator synthesizes the data distribution from a noise prior, whereas the discriminator is trained to extract discriminative features of real data. Specifically, the discriminator distinguishes between real and synthesized data produced by the generator and outputs the posterior probabilities that data are from the real data distribution, which are used as a guiding signal to update the parameters of discriminator and generator networks. Ideally, the process continues until real and synthesized data are indistinguishable.

In practice, however, GANs are well known both for being unstable to train and for the problem of mode colla- pse [1], [19]. Mode collapse is a situation where ${G}$ generates most of the samples sharing some common properties, especially when trained on data collections containing a discrete sets of visual objects. Two major issues that compound the problems are: 1) the constantly unbalanced model capacities of the generator and discriminator during the training process, which prevents the generator from learning effectively; and 2) the lack of a computable convergence criterion.

Recently, several studies have focused on improving the stability of GANs, mostly by adopting a heuristic approach [11], [12], which is extremely sensitive to the training data and hard to apply to new domains [19]. Unlike previous work, this study tries to approach the unstable training problem of GANs by investigating an adaptive hyper-parameter learning method. Specifically, we propose to dynamically adjust the number of training steps of the ${G}$ and ${D}$ for each set of training iterations based on the learning curves from relatively simple and narrow datasets such as MNIST [7], from which GANs have been shown to produce very imperative results. Extensive experimental results show that the proposed procedure can significantly improve the stability of GAN trainings and generate much more recognizable objects in Anime and CelebA datasets [8].

This study is an extension of our previous work [25], which makes further the following improvements.

In this extended work, the proposed $\text{A}k$ -GAN model dynamically adjusts the ratio (denoted as $k$ ) of the training steps of $D$ and $G$ every $c$ iterations intead of every one iteration as in our previous work. $c$ is a newly added hyperparameter to improve the generalizaiton of the adaptive model.
We introduced some variants in which the adjustment of $k$ either depends on more sophisticated criterias, such as loss values, or in progressive or immediate manners to control the degree of stabilizing the GAN training. We also investigated blur strategies which are a set of filters applied to the input of $D$ to examine whether the blurring affects the training balance of $D$ and $G$ .
We conducted extensive experiments to compare $\text{A}k$ -GAN models with some state-of-the-art models, using $Inception~Score$ ( $IS$ ) [13] [16] and $relative~inverse~Inception~Score$ ( $RIS$ ) [24]. Evaluation results demonstrate the effectiveness of the proposed training procedure in terms of convergence rate and image quality.

SECTION II.

Background

A. Preliminary

In order to avoid the confusion, the parameters used throughout in this paper are defined as follows.

Definition 1:

$k$ : the training ratio of ${D}$ to ${G}$ . When $k <$ 1, ${G}$ is trained ( $\frac {1}{k}$ ) steps with ${D}$ one step, else if $k >1$ , ${D}$ is optimized $k$ steps with ${G}$ one step.

The conventional GAN [4] is set to alternate between $k$ steps of optimizing ${D}$ and one step of ${G}$ , generally with $k$ being one. In our algorithm, the value of $k$ is adaptively controlled using deviation in probabilities or loss values between running dataset and criterion dataset.

Table 1 shows the notations and their descriptions.

TABLE 1 Notations Used in This Paper

B. Generative Adversarial Networks

GANs estimate generative models by means of an adversarial process in which a generator ${G}$ is pitted against a discriminator ${D}$ . The inputs of ${D}$ come from two data distributions: real and synthesized, where the latter is generated by ${G}$ . ${D}$ is trained to maximize its ability to categorize a sample exactly as real or fake, whereas ${G}$ is trained to minimize the ability of ${D}$ to categorize a fake sample from synthesized data. In the original framework, the training procedure is defined as a two-player minimax game: $\begin{align*}&\hspace {-0.6pc}\min \limits _{G} \max \limits _{D} V(D,G) = \mathbb {E}_{x \sim p_{data}(x)}[\log {D(x)}] \\& \qquad \qquad \qquad \qquad \qquad\displaystyle {+\, \mathbb {E}_{z \sim p_{z}(z)}[\log ({1-D(G(z))})] } \tag{1}\end{align*}$ View Source where ${G}$ is a function that maps input noise variables $p_{z}(z)$ to a generated data distribution $p_{g}$ . Whereas ${D}$ is a function that maps a data space to scalar values, where each value represents the probability that specifies a sample came from real data distribution $p_{data}(x)$ rather than $p_{g}$ . In GANs, ${G}$ and ${D}$ are treated as neural networks and trained concurrently on their objective function.

In practice, we use the same loss function of ${D}$ as in the conventional GAN [4], and to optimize ${G}$ we adapt more effective objective function [19]. They are represented in formula as: $\begin{align*} L_{D}=&-\frac {1}{m}\sum _{i=1}^{m}[\log {P(x^{(i)};\theta _{D})} \\&+\log (1-P(G(z^{(i)};\theta _{G});\theta _{D}))] \tag{2}\\ L_{G}=&-\tfrac {1}m \sum _{i=1}^{m} \log (P(G(z^{(i)};\theta _{G});\theta _{D}))\tag{3}\end{align*}$ View Source where $L_{D}$ and $L_{G}$ are the loss functions, and $\theta _{D}$ and $\theta _{G}$ are the parameters of ${D}$ and ${G}$ , respectively. $x^{(i)}$ represents real data, whereas $G(z^{(i)})$ is the data generated by ${G}$ with the arbitary noise $z^{(i)}$ . Then $P(x^{(i)};\theta _{D})$ is the probability that ${D}$ classifies real data $x^{(i)}$ as real label. $P(G(z^{(i)};\theta _{G});\theta _{D})$ is the probability that D classifies generated data $G(z^{(i)})$ as real label. Hereinafter referred to as $P_{r}$ and $P_{g}$ .

C. Challenges and Limitations of GANs

Recently, GANs have gained tremendous attention because of their strength at producing compelling image synthesis, particularly when trained on image collections comprising relatively narrow domains, such as MNIST handwritten digits, as shown in Fig. 1(a). However, for diverse image collections, GANs generally yield less impressive results. For example, samples from models trained on the Anime dataset produce few recognizable objects, as shown in Fig. 1(b).

FIGURE 1.

Effects of the original GAN from data collections with different levels of diversity.

Show All

It is a considerable challenge to balance the convergence of the ${D}$ and ${G}$ . Training one of them too well results in instability. In practice, ${D}$ tends to be overly trained during the training process, because ${D}$ wins too easily in the beginning.

We estimated and tracked the training curves of MNIST and the diverse dataset, which are shown in Fig. Fig. 2. Specifically, Fig. 2 plots the curves of probabilities of real and synthesized data ( $P_{r}$ and $P_{g}$ , respectively) estimated by ${D}$ during the training process. Fig. 2(a) shows GANs achieve good stability when training on the narrow dataset MNIST. Note that in the beginning, a large gap exists between $P_{r}$ and $P_{g}$ . This gap gradually narrows during the training and finally converges at approximately 0.5. The good stability on MNIST is also reflected in the considerable quality of synthesized images, as shown in Fig. 1(a). Likewise, the poor visual results on the diverse dataset can best be explained by the constantly unstable training curves (their instability), as shown in Fig. 2(b). In other words, practical implementations dictate that the original GAN achieves a stable balance between ${G}$ and ${D}$ iterations and synthesizes images with higher quality on relatively simple datasets such as MNIST than on the diverse dataset.

FIGURE 2.

Learning curves for the conventional GAN model with multi-layer perceptrons (MLPs). The learning curves are measured by the probabilities of real and synthesized data (p_real and p_generated, respectively).

Show All

D. Related Work

To overcome the problem of mode collapse, several researchers have proposed to penalize the appearance of near-duplicate images during each training batch using heuristic-based approaches [13], [21]. Salimans et al. [13] sought to address this problem by training the discriminator in a semi-supervised fashion, granting the discriminator’s internal representations knowledge of the class structure of the training data. Although this method increases synthesizing quality, it is less appealing as a tool for unsupervised learning. Energy-based GAN [21] has proposed modeling the ${D}$ as an energy function, which enables the use of a wide variety of architectures and loss functions in addition to the binary classifier, making the model easy to train and more stable.

E-GAN [26] combines the GAN training with an evolutionary algorithm, which evolves a set of generators to adapt the discriminator and preserves the best generator for further training. PG-GAN [27] achieves high-resolution images by training GAN model in a progressive growing way to increase the resolution. SAGAN [28] imports the self-attention block into deeper GAN architecture to capture the global structure of images. Based on SAGAN, BigGAN [29] also uses different $residual~block$ for $G$ and $D$ , but trains the models with larger batches, larger parameters, and the $truncation~trick$ , achieving a new level of performance on ImageNet. These studies have shown the effects of increase in the depth of networks on some diverse datasets. However, our work focuses on adaptive control of the training process, which is orthogonal to the work of the increase of depth. For future work, it would be interesting to explore how to combine the adaptive control, very deep nets, and other techniques on complex datasets.

Consistent with our work, some authors have tried to use an adaptive approach [15], [18] to improve GANs. The AdaGAN [18] addresses the problem of mode collapse by incrementally adding new generators, which are trained using reweighted data at every step. Thus, the data generated by the mixture model can progressively cover all the modes of datasets. However, the generator of the mixture model is a mixture of single generator networks, and thus the latent representation becomes more complex with training. Of particular interest to us is the work of ABC-GAN [15]. Its adaptive idea is also achieved by adjusting ${G}$ and ${D}$ steps based on a function of losses of ${D}$ . A major downside of ABC-GAN is that it uses a manually fixed probability value, from which the ratio is sampled each time.

SECTION III.

An Adaptive Training Procedure for GANs

To address the problem that ${G}$ and ${D}$ do not balance well on diverse datasets, we explore an adaptive training procedure to encourage convergence of GANs. Our method is similar in spirit to ABC-GAN [15] in the sense that it balances the generation capacities of GAN models by adjusting the ratio of ${D}$ to ${G}$ . Different from the adaptive controller used in ABC-GAN, which is based on a manually fixed probability value, our method automatically adjusts $k$ by matching a set of well-trained learning curves (with different metrics) derived from relatively narrow datasets (e.g. MNIST). This section first explains the core idea and then describes the adaptive method in detail.

A. Basic Idea

Our main goal is to achieve a convincing performance on distinct datasets by matching the well-trained learning curves. We first describe the manner in which to obtain well-trained learning curves, and then present an algorithm that constrains the difference between the current value of control variables (e.g., posterior probabilities $P_{r}$ and $P_{g}$ , or loss values $L_{G}$ and $L_{D}$ ) and the criterion (probabilities or loss values of the well-trained learning curve).

Because the original GAN object function is defined as a two-player game, determining the convergence of ${G}$ and ${D}$ is difficult, as the loss of ${G}$ increases when that of ${D}$ decreases. The learning curves are typically the efficient means to alert us as to how training is progressing. Therefore, we use the aforementioned learning curves as prototypes [22], which are considered to be more typical and central than others in the set of led-to-convergence curves. As shown in Fig. 3, the criterion learning curves consist of four measures: the posterior probabilities $P_{r}$ and $P_{g}$ when trained on MNIST dataset (expressed by $P_{mr}$ and $P_{mg}$ , respectively) and the loss costs $L_{G}$ and $L_{D}$ when trained on MNIST (represented by $L_{mG}$ and $L_{mD}$ , respectively). We use the ratio ( $k$ ) of the training steps of ${D}$ to ${G}$ as the control variable. Instead of setting $k$ to 1 as in the conventional GAN, we dynamically adjust the value of $k$ to satisfy the constraints during the training on diverse datasets. $k$ is controlled using the following inequality constraint. $\begin{equation*} \left |{\frac {V_{c}-V_{m}}{V_{m}} }\right | < \alpha ~(or ~\beta)\tag{4}\end{equation*}$ View Source where $V_{c}$ is $P_{r}$ , $P_{g}$ , $L_{G}$ , or $L_{D}$ obtained from the current training data, correspondingly, $V_{m}$ is $P_{mr}$ , $P_{mg}$ , $L_{G}$ or $L_{mD}$ obtained from MNIST. Different from our previous work, we propose other algorithms which are controlled by $P_{r}$ or $L_{D}$ . $\alpha$ and $\beta$ are thresholds that prescribe a limit to the range in which current values deviate from the criterion value. Since the criterion learning curves for $P_{g}$ or $L_{D}$ show the same trend during the training process (as shown in Fig. 3), we use the same threshold $\alpha$ to limit the deviation. Similarly, we use another threshold $\beta$ for updating $P_{r}$ or $L_{G}$ . We experimented with different $\alpha$ and $\beta$ values and found that $\alpha$ to 0.2 and $\beta$ to 0.05 achieve the best results.

FIGURE 3.

Learning curves for stable training.

Show All

We next provide an intuitive explanation of our adaptive methods before proceeding to the implementation details, as illustrated in Fig. 4. First, The $\text{A}k$ -GAN model is trained with the initial state of $k$ being 1. Then, we pick the criterion values (for instance, $P_{mg}$ ) from the well-trained learning models and compare with the current values (in this case, $P_{g}$ ) every $c$ iteration ( $c$ is set to 1 by default; we explore different ${c}$ values in experiments), and then examine whether the deviation exceeds the constraint condition (in this case, $\left |{\frac {P_{g}-P_{mg}}{P_{mg}} }\right | < \alpha$ ). If it does, we update $k$ in such a manner as to bias the next training toward the “weaker” of either ${G}$ or ${D}$ . Our experimental results indicate that the proposed procedure yields more stable training and improves sample generation. We designed two kinds of adaptive methods to adjust the $k$ value: either with or without momentum. The adaptive method with momentum progressively adjusts the $k$ value, and that without momentum changes the $k$ value immediately, thus reducing the fluctuation of learning curves. In Appendix, we provide a more detailed description of the two methods, specifically on how to update the $k$ value for the next iteration and the conditions of the update.

$FIGURE 4. - The adaptive control procedure of $\text{A}k$ -GAN.$

FIGURE 4.

The adaptive control procedure of $\text{A}k$ -GAN.

Show All

B. Adaptive Control Using Fitted Curves

Note that the criterion learning curves shown in Fig. 5 fluctuate wildly during training. Instead of directly matching the curves, we used an approach to draw regression lines as guiding. In this approach, we employed two exponential functions, Eq. (5) is used to fit the curves of $P_{mr}$ and $L_{mG}$ , whereas Eq. (6) is used to fit curves of $P_{mg}$ and $L_{mD}$ . The coefficients of these functions are shown in Table 2. $\begin{align*} y=&a*e^{-bx}+d \tag{5}\\ y=&a*-(e^{-bx})+d \tag{6}\end{align*}$ View Source

TABLE 2 Coefficients of the Exponential Functions for Fitting Criterion Curves of

$P_{mr}$ ,

$P_{mg}$ ,

$L_{mD}$ , and

$L_{mG}$

$Table 2- Coefficients of the Exponential Functions for Fitting Criterion Curves of $P_{mr}$ , $P_{mg}$ , $L_{mD}$ , and $L_{mG}$$

FIGURE 5.

Regression curves (power exponential functions) of stable training.

Show All

SECTION IV.

Experiments

A. Datasets

We performed our experiments on two datasets: CelebA [8], and the Anime face dataset. Details of each dataset are given as follows. The samples of two datasets are shown in Fig. 6.

FIGURE 6.

Samples of two datasets.

Show All

1) CelebA

The large-scale celeb faces attributes (CelebA) dataset contains 202,599 facial images of 10,177 celebrities. The images in this dataset are RGB color images containing diverse pose variations and cluttered backgrounds. In our experiments, each image was resized to $64\times64$ .

2) Anime Faces

We derived anime faces from an image board site for anime wallpapers and cropped the images to contain only faces. These images were sized to $96\times96$ . This dataset consists of 51,223 color images.

B. Implementation

Our model was coded in Python and all experiments were conducted in the Tensorflow framework. We tested the model with two architectures: GANs with ${G}$ and ${D}$ comprising fully connected layers (MLP), and GANs with convolutional architectures (DCGAN [12]), both of which are described in the following subsections.

1) MLP-Based GAN

The first GAN architecture we experimented with consisted of two MLP nets. The parameters of the networks were updated to minimize the loss function, with Adam [6] used as the optimizer function. A 100-dimension vector sampled from a Gaussian distribution is transformed into an image size using a series of fully connected layers with non-linearity activation functions (ReLU and tanh). The architecture of ${G}$ is given in Table 3. ${D}$ consisted of three fully connected layers, the first two of which with dropout. The penultimate layer was connected to a single sigmoid layer, as shown in Table 4.

TABLE 3 Architecture of a MLP-Based Generator

TABLE 4 Architecture of a MLP-Based Discriminator

2) DCGAN-Based GAN

We also performed our adaptive methods using the architecture of DCGAN [12]. The generator networks (Table 5) have five layers and include one linear connection layer and four fractional-strided convolutional layers with a filter size of $5\times5$ and a stride of 2. All the layers except the final layer are followed by batch normalization and the ReLU activation function. The final layers consist of a fractional-strided convolutional layer and the tanh function. The input is also a 100-dimension vector sampled from a Gaussian distribution. The discriminator (Table 6) has five layers: four convolutional layers, the first of which is a convolutional layer with Leaky ReLU, the other three being convolutional layers followed by batch normalization and Leaky ReLU; and a final hidden layer that uses a full connection layer and sigmoid to transform the feature into a scalar, which is the probability that an input image belongs to a set of real data.

TABLE 5 Architecture of a DCGAN-Based Generator

TABLE 6 Architecture of a DCGAN-Based Discriminator

SECTION V.

Results

A. Adaptive Control on Diverse Datasets Using MLP and DCGAN

In Fig. 7, we show the results of the proposed $\text{A}k$ -GAN using MLP on diverse datasets. Compared with the samples generated by the conventional GAN models, the adaptive control techniques enabled GANs to capture the recognizable features of faces, such as facial contours and eyes. However, these samples also had frequent noise. In addition, the image acuity still showed room for improvement. We applied the proposed adaptive control procedure to the DCGAN architecture to further improve visual quality. We also compared the results with the dataset using the DCGAN architecture. In Fig. 8 (D01-D03 on CelebA dataset, D07-D09 on Anime dataset), we show that $\text{A}k$ -GAN model generates images with higher quality and alleviates the mode collapse as compared with the conventional DCGAN.

FIGURE 7.

The corresponding samples for the $Inception~Scores$ presented in Table 7.

Show All

FIGURE 8.

The corresponding samples for the $Inception~Scores$ presented in Table 8.

Show All

B. Performance Measure

The $Inception~Score$ [13] [16] is an approach to evaluate the performance of GANs quantitatively. This approach correlates well with human judgment. The $Inception~Score$ uses an Inception v3 network [17], which is designed for classification tasks pre-trained on ImageNet [3], to calculate statistics of the network’s output when generated samples are feed into. $\begin{equation*} IS(G)=exp\left({\frac {1}N \sum _{i=1}^{N} D_{KL}(p(y\mid x)\parallel p(y))}\right) \tag{7}\end{equation*}$ View Source where ${x}$ is an image sampled from generation distribution $P_{g}$ , $p(y\vert x)$ is the conditional class distribution predicted by Inception v3 network given the sample ${x}$ , $p(y)$ is the marginal distribution of class ${y}$ . $D_{KL}(p(y\vert x)\parallel p(y))$ is the KL-divergence between the distribution $p(y\vert x)$ and $p(y)$ . For exponential operation, we use base ${e}$ in our experiments to make the value easier to compare. The $Inception~Score$ satisfies two desirable qualities of the generative models: the generated images should contain clear objects (visually recognizable), that means $p(y\vert x)$ should have low entropy. On the other hand, the models should generate diverse samples, that is $p(y)$ should have high entropy. Therefore, we want a large $Inception~Score$ , which means the KL-divergence between the distribution $p(y\vert x)$ and $p(y)$ is large.

Table 7 shows the $Inception~Scores$ obtained from 50k generated samples of $\text{A}k$ -GAN models and the conventional GAN model [4] on Anime and CelebA datasets, and both models use the MLP architecture, the corresponding samples are shown in Fig. 7. The highest scores are achieved by $\text{A}k$ -GAN models on both two datasets. The best-performed models are based on adaptive controls over $P_{r}$ without momentum. We also tested with different ${c}$ values, while other parameters being fixed. Results indicate that a proper value of ${c}$ will increase the $Inception~Score$ . Moreover, the choice of ${c}$ values depends on the dataset used, e.g., the best score is achieved for the Anime dataset with ${c}$ being 1, whereas for CelebA, ${c}$ being 3.

TABLE 7

$Inception~Scores$ for Samples Generated by the Conventional GAN Models and

$\text{A}k$ -GAN Models With the MLP Architecture

$Table 7- $Inception~Scores$ for Samples Generated by the Conventional GAN Models and $\text{A}k$ -GAN Models With the MLP Architecture$

Table 8 presents the $Inception~Scores$ for samples generated by $\text{A}k$ -GAN models with the DCGAN architecture, the conventional DCGAN model [12], and some state-of-the-art GAN models: WGAN [23], ABC-GAN [15], and SAGAN [28]. WGAN [23] used a more effective loss function, ABC-GAN [15] used an adaptive control method to stable the training of $D$ and $G$ , and SAGAN [28] investigated a much deeper network and applied the self-attention mechanism to capture long-range, multi-level dependencies features. In general, DCGAN-based models produce more vivid and recognizable samples compared to MLP-based models, as shown in Fig. 8. However, on the CelebA dataset, the MLP-based models achieved higher scores. This interesting result may suggest that the $Inception~Score$ not only depends on the dataset used, and also depends on the underlying models, that is, the absolute value of the $Inception~Score$ is unimportant between MLP- and DCGAN-based GANs on the same dataset. It would be better to derive a more general scoring mechanism, therefore, we leave the further investigation of this matter for future work. As shown in Table 8, WGAN slightly outperforms $\text{A}k$ -GAN on the CelebA dataset. Whereas, the best $IS$ is achieved by $\text{A}k$ -GAN on the Anime dataset compared to other state-of-the-art GAN models. Relatively higher scores on most of test cases suggest that the proposed $\text{A}k$ -GAN models effectively improve the generation quality of GAN models.

TABLE 8

$Inception~Scores$ for Samples Generated by

$\text{A}k$ -GAN Models With the DCGAN Architecture and Some State-of-the-Art GAN Models

$Table 8- $Inception~Scores$ for Samples Generated by $\text{A}k$ -GAN Models With the DCGAN Architecture and Some State-of-the-Art GAN Models$

To more accurately evaluate the performance of GAN models, we use the $relative~inverse~Inception~Score$ ( $RIS$ ) [24]: $\begin{align*} RIS=&1-\frac {IS_{g_{avg}}}{IS_{r_{avg}}} \tag{8}\\ IS_{g_{avg}}=&\frac {1}{n} \sum _{i=1}^{10} IS_{g_{i}} \tag{9}\\ IS_{r_{avg}}=&\frac {1}{n} \sum _{i=1}^{10} IS_{r_{i}} \tag{10}\end{align*}$ View Source where $IS_{g_{i}}$ and $IS_{r_{i}}$ are the $Inception~Scores$ for generated and real images, respectively. Whereas, $IS_{g_{avg}}$ and $IS_{r_{avg}}$ are the average $Inception~Scores$ for $IS_{g_{i}}$ and $IS_{r_{i}}$ , respectively. In practice, out of 500k samples, 5k real or generated ones are randomly chosen to obtain $IS_{g_{i}}$ and $IS_{r_{i}}$ , respectively. Then we repeated this process 10 times in total in order to compute average $IS_{g_{avg}}$ and $IS_{r_{avg}}$ based on equations (9) and (10). $RIS$ can be obtained using equation (8). Obviously, $RIS$ is inversely proportional to $IS$ : the lower $RIS$ value, the better the model.

As shown in Fig. 9, we compare the $RISs$ of $\text{A}k$ -GAN with other models using the boxplot. The $\text{A}k$ -GAN based on MLP architecture obtains smaller scores on both two datasets compared to the original GAN. We also compare $\text{A}k$ -GAN with four state-of-the-art models: DCGAN [12], WGAN [23], ABC-GAN [15], and SAGAN [28]. The $\text{A}k$ -GAN shows the best performance on Anime dataset. However, for CelebA dataset, the $RISs$ of all the models are basically same except ABC-GAN and SAGAN. Despite its deeper network architecture, SAGAN fails to show the superiority on the task at hand. For future work, we will explore the performance of the models that combine the adaptive control and attention mechanisms on diverse datasets.

FIGURE 9.

Range of the $relative~inverse~Inception~Scores$ ( $RISs$ ) of models with each architecture, on each dataset. The cross mark in each range is the average of $10~RISs$ . The whiskers indicate the minimum and maximum of all of the $RISs$ . The boxes display the variation of $RISs$ .

Show All

SECTION VI.

Conclusion and Future Work

In this paper, a family of adaptive control-based GAN models called $\text{A}k$ -GANs are proposed to stabilize the GAN model when training on the diverse datasets, and thus improve the quality of synthesized images to a certain degree. The proposed method is achieved by directing the training process using the well-trained learning curves (prototypes) under relatively narrow datasets. Compared with the previous work [25], the proposed models cover some various $\text{A}k$ -GANs in which $k$ depends on more sophisticated criteria, and with or without momentum. We also investigated the $\text{A}k$ -GAN models with both MLP and DCGAN architectures. Compared to MLP-based regular GAN models, the $\text{A}k$ -GAN model has improved the $Inception~Score$ by 61% and 83% on Anime and CelebA, respectively. Whereas, the DCGAN-based $\text{A}k$ -GAN model has improved 32% on Anime and showed a slight decrease (9%) on CelebA compared to WGAN [23]. In addition, the $\text{A}k$ -GAN model significantly outperformed the previous state-of-the-art models on the $relative~inverse~Inception~Score$ ( $RIS$ ), reducing $RIS$ by around 21% and 60% on Anime and CelebA, respectively. Quantitative and qualitative results showed that the proposed procedure could indeed improve the stability of GAN training and generate compelling images under various models and diverse datasets, compared to some state-of-the-art models such as WGAN [23], ABC-GAN [15], and SAGAN [28].

For future work, we plan to investigate a more effective criteria to direct the training, which may encourage the convergence of GANs. We are also going to experiment with more objective functions, such as ${f}$ -divergence [10], to measure the difference in the divergence between the distributions of real and generated data.

ACKNOWLEDGMENT

(Xiaohan Ma and Rize Jin contributed equally to this work.)

Appendix

SECTION A.

Adaptive Method With Momentum

We next introduce the adaptive methods with momentum based on $P_{g}$ or $L_{D}$ and on $P_{r}$ or $L_{G}$ .

1) Adaptive Control Based on $P_{g}$ or $L_{D}$

Because the criterion learning curves for $P_{g}$ or $L_{D}$ show the same trend during the training process, we use the same method to control them. In Algorithm 1, when $k$ is greater than one, which means ${D}$ is trained $k$ steps with ${G}$ one step. If the current $P_{g}$ or $L_{D}$ exceeds the allowable deviation defined by threshold $\alpha$ , we increase or decrease the ${D}$ steps (represented by $k$ ) by 1 based on one of the following resulting cases: positive deviation ( $\frac {P_{g}-P_{mg}}{P_{mg}} >\alpha$ or $\frac {L_{D}-L_{mD}}{L_{mD}} >\alpha$ ) or negative deviation ( $\frac {P_{mg}-P_{g}}{P_{mg}} >\alpha$ or $\frac {L_{mD}-L_{D}}{L_{mD}} >\alpha$ ). When $k$ is one, which denotes that ${G}$ and ${D}$ is alternately trained with one step, respectively. If positive deviation, we increase ${D}$ steps by making $k$ more than one ( $k=k+1$ ); if negative deviation, we increase ${G}$ steps by making $k$ less than one ( $k= \frac {1}{k+1}$ ). When $k$ is less than one, which indicates that ${G}$ is trained $\frac {1}{k}$ steps with ${D}$ one step. If the positive deviation, we decrease the ${G}$ steps by 1; else, we increase the ${G}$ steps by 1.

Algorithm 1 Adaptive Control Over $k$ Values Based on $P_{g}$ or $L_{D}$ (With Momentum)

Input:

$k$ , $P_{g}$ or $L_{D}$

Output:

$k'$

if $k>1$ then

if $\frac {P_{g}-P_{mg}}{P_{mg}} >\alpha$ (or $\frac {L_{D}-L_{mD}}{L_{mD}} >\alpha$ ) then

$k'= k+1$

else if $\frac {P_{mg}-P_{g}}{P_{mg}} >\alpha$ (or $\frac {L_{mD}-L_{D}}{L_{mD}} >\alpha$ ) then

$k'= k-1$

else

$k'= k$

end if

else if $k=1$ then

10:

if $\frac {P_{g}-P_{mg}}{P_{mg}} >\alpha$ (or $\frac {L_{D}-L_{mD}}{L_{mD}} >\alpha$ ) then

11:

$k'= k+1$

12:

else if $\frac {P_{mg}-P_{g}}{P_{mg}} >\alpha$ (or $\frac {L_{mD}-L_{D}}{L_{mD}} >\alpha$ ) then

13:

$k'= \frac {1}{k+1}$

14:

else

15:

$k'= k$

16:

end if

17:

else

18:

if $\frac {P_{g}-P_{mg}}{P_{mg}} >\alpha$ (or $\frac {L_{D}-L_{mD}}{L_{mD}} >\alpha$ ) then

19:

$k'= \frac {1}{\frac {1}{k} - 1}$

20:

else if $\frac {P_{mg}-P_{g}}{P_{mg}} >\alpha$ (or $\frac {L_{mD}-L_{D}}{L_{mD}} >\alpha$ ) then

21:

$k'= \frac {1}{\frac {1}{k} + 1}$

22:

else

23:

$k'= k$

24:

end if

25:

end if

26:

return $k'$

2) Adaptive Control Based on $P_{R}$ or $L_{G}$

Different from the aforementioned method, Algorithm 2 is based on $P_{r}$ or $L_{G}$ . Since the curves of $P_{r}$ and $L_{G}$ grow in opposite direction with learning curves for $P_{g}$ and $L_{D}$ , the adjustment of $k$ is the extreme opposite of the previous one. That is, if there is a positive deviation ( $\frac {P_{r}-P_{mr}}{P_{mr}} >\beta$ or $\frac {L_{G}-L_{mG}}{L_{mG}} >\beta$ ), ${G}$ steps increase or ${D}$ steps decrease; otherwise, ${G}$ steps decrease or ${D}$ steps increase.

Algorithm 2 Adaptive Control Over $k$ Values Based on $P_{r}$ or $L_{G}$ (With Momentum)

Input:

$k$ , $P_{r}$ or $L_{G}$

Output:

$k'$

if $k>1$ then

if $\frac {P_{r}-P_{mr}}{P_{mr}} >\beta$ (or $\frac {L_{G}-L_{mG}}{L_{mG}} >\beta$ ) then

$k'= k-1$

else if $\frac {P_{mr}-P_{r}}{P_{mr}} >\beta$ (or $\frac {L_{mG}-L_{G}}{L_{mG}} >\beta$ ) then

$k'= k+1$

else

$k'= k$

end if

else if $k=1$ then

10:

if $\frac {P_{r}-P_{mr}}{P_{mr}} >\beta$ (or $\frac {L_{G}-L_{mG}}{L_{mG}} >\beta$ ) then

11:

$k'= \frac {1}{k+1}$

12:

else if $\frac {P_{mr}-P_{r}}{P_{mr}} >\beta$ (or $\frac {L_{mG}-L_{G}}{L_{mG}} >\beta$ ) then

13:

$k'= k+1$

14:

else

15:

$k'= k$

16:

end if

17:

else

18:

if $\frac {P_{r}-P_{mr}}{P_{mr}} >\beta$ (or $\frac {L_{G}-L_{mG}}{L_{mG}} >\beta$ ) then

19:

$k'= \frac {1}{\frac {1}{k} + 1}$

20:

else if $\frac {P_{mr}-P_{r}}{P_{mr}} >\beta$ (or $\frac {L_{mG}-L_{G}}{L_{mG}} >\beta$ ) then

21:

$k'= \frac {1}{\frac {1}{k} - 1}$

22:

else

23:

$k'= k$

24:

end if

25:

end if $k'$

SECTION B.

Adaptive Method Without Momentum

In practical implementations, the progressive adjustment methods improve the stability of models. However, the training is slow. Therefore, we explore an variant which removes the momentum. This means that the $k$ value is immediately changed to an opposite number ( $k= \frac {1}{2}$ when $k \geq 1$ , $k=2$ when $k < 1$ ) when control variables are out of the constraint range.

When $k$ is greater than or equal to one, which means ${D}$ is trained k steps with ${G}$ one step. In cases of positive deviation ( $\frac {P_{g}-P_{mg}}{P_{mg}} >\alpha$ or $\frac {L_{D}-L_{mD}}{L_{mD}} >\alpha$ ), where $P_{g}$ or $L_{D}$ is greater than the criterion value, indicating that ${G}$ is overly trained, we must increase $k$ by 1. If the negative deviation ( $\frac {P_{g}-P_{mg}}{P_{mg}} >\alpha$ or $\frac {L_{D}-L_{mD}}{L_{mD}} >\alpha$ ), in which ${D}$ is overly trained, we immediately makes $k$ to be $\frac {1}{2}$ , which means ${G}$ is trained 2 steps with ${D}$ one step, to bias toward ${G}$ . When $k$ is less than one, where ${G}$ is trained $\frac {1}{k}$ steps with ${D}$ one step, if the positive deviation, we immediately make $k$ to be 2 to bias to training D. If the negative deviation, we increase the ${G}$ steps by 1 ( $k'= \frac {1}{\frac {1}{k}+1}$ ). We modify $k$ for the cases of algorithm based on $P_{r}$ or $L_{G}$ in an opposite manner with the algorithm based on $P_{g}$ or $L_{D}$ . These two algorithms are shown in Algorithm 3 and Algorithm 4, respectively.

SECTION Algorithm 3

Adaptive Control Over $k$ Values Based on $P_{g}$ or $L_{D}$ (Without Momentum)

Input:

$k$ , $P_{g}$ or $L_{D}$

Output:

$k'$

if $k\geq 1$ then

if $\frac {P_{g}-P_{mg}}{P_{mg}} >\alpha$ (or $\frac {L_{D}-L_{mD}}{L_{mD}} >\alpha$ ) then

$k'= k+1$

else if $\frac {P_{mg}-P_{g}}{P_{mg}} >\alpha$ (or $\frac {L_{mD}-L_{D}}{L_{mD}} >\alpha$ ) then

$k'= \frac {1}{2}$

else

$k'= k$

end if

else

10:

if $\frac {P_{g}-P_{mg}}{P_{mg}} >\alpha$ (or $\frac {L_{D}-L_{mD}}{L_{mD}} >\alpha$ ) then

11:

$k'= 2$

12:

else if $\frac {P_{mg}-P_{g}}{P_{mg}} >\alpha$ (or $\frac {L_{mD}-L_{D}}{L_{mD}} >\alpha$ ) then

13:

$k'= \frac {1}{\frac {1}{k} + 1}$

14:

else

15:

$k'= k$

16:

end if

17:

end if

18:

return $k'$

SECTION Algorithm 4

Adaptive Control Over $k$ Values Based on $P_{r}$ or $L_{G}$ (Without Momentum)

Input:

$k$ , $P_{r}$ or $L_{G}$

Output:

$k'$

if $k>1$ then

if $\frac {P_{r}-P_{mr}}{P_{mr}} >\beta$ (or $\frac {L_{G}-L_{mG}}{L_{mG}} >\beta$ ) then

$k'= \frac {1}{2}$

else if $\frac {P_{mr}-P_{r}}{P_{mr}} >\beta$ (or $\frac {L_{mG}-L_{G}}{L_{mG}} >\beta$ ) then

$k'= k+1$

else

$k'= k$

end if

else

if $\frac {P_{r}-P_{mr}}{P_{mr}} >\beta$ (or $\frac {L_{G}-L_{mG}}{L_{mG}} >\beta$ ) then

$k'= \frac {1}{\frac {1}{k} + 1}$

else if $\frac {P_{mr}-P_{r}}{P_{mr}} >\beta$ (or $\frac {L_{mG}-L_{G}}{L_{mG}} >\beta$ ) then

$k'= 2$

else

$k'= k$

end if

end if $k'$

References is not available for this document.

MIT Libraries

MIT Libraries

An Adaptive Control Algorithm for Stable Training of Generative Adversarial Networks

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Background

A. Preliminary

Definition 1:

B. Generative Adversarial Networks

C. Challenges and Limitations of GANs

D. Related Work

An Adaptive Training Procedure for GANs

A. Basic Idea

B. Adaptive Control Using Fitted Curves

Experiments

A. Datasets

1) CelebA

2) Anime Faces

B. Implementation

1) MLP-Based GAN

2) DCGAN-Based GAN

Results

A. Adaptive Control on Diverse Datasets Using MLP and DCGAN

B. Performance Measure

Conclusion and Future Work

ACKNOWLEDGMENT

Appendix

Appendix

Adaptive Method With Momentum

1) Adaptive Control Based on P_{g}P_{g} or L_{D}L_{D}

Algorithm 1 Adaptive Control Over kk Values Based on P_{g}P_{g} or L_{D}L_{D} (With Momentum)

2) Adaptive Control Based on P_{R}P_{R} or L_{G}L_{G}

Algorithm 2 Adaptive Control Over kk Values Based on P_{r}P_{r} or L_{G}L_{G} (With Momentum)

Adaptive Method Without Momentum

Adaptive Control Over kk Values Based on P_{g}P_{g} or L_{D}L_{D} (Without Momentum)

Adaptive Control Over kk Values Based on P_{r}P_{r} or L_{G}L_{G} (Without Momentum)

References

IEEE Account

Purchase Details

Profile Information

Need Help?

1) Adaptive Control Based on $P_{g}$ or $L_{D}$

Algorithm 1 Adaptive Control Over $k$ Values Based on $P_{g}$ or $L_{D}$ (With Momentum)

2) Adaptive Control Based on $P_{R}$ or $L_{G}$

Algorithm 2 Adaptive Control Over $k$ Values Based on $P_{r}$ or $L_{G}$ (With Momentum)

Adaptive Control Over $k$ Values Based on $P_{g}$ or $L_{D}$ (Without Momentum)

Adaptive Control Over $k$ Values Based on $P_{r}$ or $L_{G}$ (Without Momentum)