Introduction
Due to the success of Generative Adversarial Network (GAN) for modeling distributions of real world data, it has been widely used for image generation. After the first introduction from Goodfellow and his colleagues [1], many researchers have improved its stability and accuracy by adopting new loss functions [2], [3], designing new network architectures [4], [5], improving training process and regularization [5], [6], imposing conditions [7]–[13], and inventing progressive methods [14]. Among them imposing explicit conditions is one of the easiest ways of improving the quality of image generation if there exist well-defined labels. In modern GAN frameworks, both generator and discriminator are formulated to model the conditional distribution of images given with labels.
In this article, we propose an alternative formulation of GAN which models the joint distribution of images and labels. We will show that there are two advantages of this joint formulation over conditional approaches. The first advantage is that the joint formulation is more robust to label noises if it’s properly modeled. Typical labels used in image synthesis are annotated by human workers or generated by other machine learning methods. It is generally difficult to guarantee the completeness or correctness of labels for large-scale data. Since conditional image generation regards labels as a given constraint or strong hypothesis, noises in labels may degenerate the quality of image generation. Our joint formulation regards labels as a piece of additional information to model the joint distribution. It could be more robust to the noises in labels since the joint probability distribution assumes no strong conditional dependence between images and labels. We will show the joint formulation provides the same level of image generation quality with defect-free labels and becomes more robust to noises in labels. Second and more importantly, we can use any kind of weak labels or additional information which have correlation with the original image data to enhance unconditional image generation since our joint GAN formulation doesn’t require those labels when generating images but actually generates them along with images. In a conventional conditional formulation, it’s impossible to feed these additional data into the generator since we don’t know what kind of data should be added to the generator. Our experiment shows better image generation is possible without feeding labels or those additional information explicitly. Our contribution is summarized as follows:
We propose a novel GAN formulation that models the joint distribution of images and labels, and show that this joint formulation increases the robustness on noisy or weak labels.
We demonstrate that this joint formulation can be used to increase the quality of unconditional image generation by incorporating weak labels or additional information which have correlation with the original image data into training process. Since the labels or those additional information are used only for training and our GAN generates both images and labels, we don’t need to feed labels when generating images.
A Joint Formulation of GAN for Modeling $p(I, L)$
The standard adversarial loss for the discriminator \begin{align*}&\hspace {-0.5pc}l(D)= -E_{q(\mathbf {L})}[E_{q(\mathbf {I|L})}[log(D(\mathbf {I,L})]] \\&\qquad\qquad\qquad {-\,E_{p(\mathbf {L})}[E_{p(G_{\mathbf {I}}(z)|\mathbf {L})}[log(1-D(G_{\mathbf {I}}(z), \mathbf {L}))]],}\tag{1}\end{align*}
\begin{equation*} l(G)=-E_{p(\mathbf {L})}[E_{p(G_{\mathbf {I}}(z)|\mathbf {L})}[log(D(G_{\mathbf {I}}(z), \mathbf {L})]].\tag{2}\end{equation*}
In our joint formulation, we rewrite the discriminator and generator losses with a new generator \begin{align*} l(D)=&-E_{q(\mathbf {L})}[E_{q(\mathbf {I|L})}[log(D(\mathbf {I}, \mathbf {L})]] \\&-\,E_{p(G_{\mathbf {I}, \mathbf {L}}(z))}[log(1-D(G_{\mathbf {I,L}}(z)))], \tag{3}\\ l(G)=&-E_{p(G_{\mathbf {I,L}}(z))}[log(D(G_{\mathbf {I,L}}(z)))].\tag{4}\end{align*}
As you can see, no modification is made on the discriminator since the discriminator has already a joint formulation which takes
Three different GAN formulations: (left) Unsupervised GAN modeling
Benefits of joint formulation over conditional formulation are limited when there exist well-defined labels, which are made carefully by human workers or external oracles. It’s well-known that modeling joint distribution is generally a more difficult task than modeling conditional distribution due to its increased dimension in probability distribution. Thus the discriminator represents the joint distribution by the lower dimension probability distributions
A. Boosting Unsupervised Image Generation
With our joint formulation we can add additional information, which has dependence on the original data, as a weak label for the generator. Figure 2 illustrates how we can add output from other classification network
Enhancing unsupervised image generation by using an additional label predictor
Experiment
We used CIFAR10, CIFAR100, and STL for our experiment, and resized STL images to
We first show our joint formulation is as good as the conditional formulation when modeling the conditional distribution
Comparison of images generated by (top row) conditional GAN and (bottom row) joint GAN on CIFAR10 with noisy labels. (from left to right) Generated images with clean label (0% noise), 20% noise, and 40% noise. In each sub-figure, rows are class ids and columns are random samples of each class id. Our JGAN shows a better inception score than conditional GAN (Refer to table 1 a for inception score of each case).
Comparison of (left) real unlabeled STL dataset, (middle) images generated by unsupervised GAN, and (right) joint GAN with weak label from ImageNet classification task, which shows a better inception score than unsupervised image synthesis (refer to table 2).
Our next experiment is focused on improving unconditional image generation by incorporating an additional information. We used the class probability of inception network as a staring point of this addtional information. We used the same inception network version used in [4]. Since it has a probability distribution of 1000 classes and it’s difficult to find the optimal network architecture to capture this high dimensional probability distribution, we applied truncated singular value decomposition (SVD) to reduce its dimension to 64 to simplify the problem. We applied softmax to the output of truncated SVD to make it a probability distribution in lower dimensional space. Table 2 summarizes the comparison result between unsupervised and and joint image generation. We used the same network architecture for both unsupervised and joint settings except additional label function approximation. We used a label generator slightly different from the ones used in Table 5. Table 6 describes the network for weak label generation. As you can see, JGAN consistently generates images with higher inception and FID [19] scores compared to unsupervised ones. We have achieved the best unsupervised image generation score in STL dataset compared to [5] and [20], that reported 9.05 and 9.50, respectively. Note that our baseline implementation achieved a better result due to different network architecture and training process but our joint formulation achieved even higher inception and FID scores.
Conclusion
In this article, we propose a novel GAN framework that models the joint probabilistic distribution of images and labels. We showed that this joint formulation can generate as good image quality as the conventional conditional image generation with clean labels, and remains robust when there exist noises in labels. We also applied our method to improve the image quality of unconditional image generation by incorporating additional information which has correlation with the original image data. We think this joint formulation can provide an easy way to feed many kinds of relevant information or weak labels into the GAN framework with a simple modification of the generator. There are several interesting future works like finding optimal network architectures for the label generator and testing with other methods for generating additional information we can use with our joint formulation. Even though we used images as our main target domain, we expect our formulation works for other domains as well.