Loading [MathJax]/extensions/MathMenu.js
ViT-R50 GAN: Vision Transformers Hybrid Model based Generative Adversarial Networks for Image Generation | IEEE Conference Publication | IEEE Xplore

ViT-R50 GAN: Vision Transformers Hybrid Model based Generative Adversarial Networks for Image Generation


Abstract:

In recent years, the tremendous potential of GAN in image generation has been demonstrated. Transformer derived from the NLP field is also gradually applied in computer v...Show More

Abstract:

In recent years, the tremendous potential of GAN in image generation has been demonstrated. Transformer derived from the NLP field is also gradually applied in computer vision, and Vision Transformer performs well in image classification problems. In this paper, we design a ViT-based GAN architecture for image generation. We found that the Transformer-based generator did not perform well due to using the same attention matrix for each channel. To overcome this problem, we increased the number of heads to generate more attention matrices. And this part is named enhanced multi-head attention, replacing multi-head attention in Transformer. Secondly, our discriminator uses a hybrid model of ResNet50 and ViT, where ResNet50 works on feature extraction making the discriminator perform better. Experiments show that our architecture performs well on image generation tasks.
Date of Conference: 06-08 January 2023
Date Added to IEEE Xplore: 02 June 2023
ISBN Information:
Conference Location: Guangzhou, China
References is not available for this document.

I. Introduction

Since Generative Adversarial Network (GAN) [1] was proposed, GAN has improved the image generation task to a new level because GAN can improve the modeling ability through continuous games. In traditional GANs, fully connected neural networks are usually used, and they are difficult to train until the emergence of DC-GAN [2], which introduces convolution neural networks (CNNs) [11] into the generator and discriminator and uses convolutions in the discriminator model instead of the pooling layer. Four fractionally-strided convolutions are used in the generator model to complete the generation process from random noise to images. Compared to the original GAN, DC-GAN almost entirely uses convolutional layers instead of fully connected layers, and the discriminator is almost symmetric to the generator. With the help of CNN's more robust fitting and expression ability, the subsequent GAN produces vivid images and greatly improves the diversity of images in image generation.

Select All
1.
Antonia Creswell et al., "Generative adversarial networks: An overview", IEEE signal processing magazine, vol. 35.1, pp. 53-65, 2018.
2.
Alec Radford, Luke Metz and Soumith Chintala, "Unsupervised representation learning with deep convolutional generative adversarial networks", arXiv preprint, 2015.
3.
Gehring Jonas et al., "Convolutional sequence to sequence learning", International conference on machine learning. PMLR, 2017.
4.
Alexey Dosovitskiy et al., "An image is worth 16x16 words: Transformers for image recognition at scale", arXiv preprint, 2020.
5.
He Kaiming et al., "Deep residual learning for image recognition", Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
6.
Jiang Yifan, Shiyu Chang and Zhangyang Wang, "Transgan: Two Transformers can make one strong gan", arXiv preprint, vol. 1.3, 2021.
7.
Karras Tero et al., "Analyzing and improving the image quality of stylegan", Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020.
8.
Tancik Matthew et al., "Fourier features let networks learn high frequency functions in low dimensional domains", Advances in Neural Information Processing Systems, vol. 33, pp. 7537-7547, 2020.
9.
Karras Tero et al., "Progressive growing of gans for improved quality stability and variation", arXiv preprint, 2017.
10.
Jeeseung Park and Younggeun Kim, "Styleformer: Transformer based generative adversarial networks with style vector", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
11.
Yann Le Cun et al., "Backpropagation applied to handwritten zip code recognition", Neural computation, vol. 1.4, pp. 541-551, 1989.
12.
Krizhevsky Alex and Geoffrey Hinton, Learning multiple layers of features from tiny images, pp. 7, 2009.
13.
Liu Ziwei et al., "Deep learning face attributes in the wild", Proceedings of the IEEE international conference on computer vision, 2015.
14.
Yu Fisher et al., "Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop", arXiv preprint, 2015.
15.
Heusel Martin et al., "Gans trained by a two time-scale update rule converge to a local nash equilibrium", Advances in neural information processing systems, vol. 30, 2017.
16.
Salimans Tim et al., "Improved techniques for training gans", Advances in neural information processing systems, vol. 29, 2016.
17.
Hu Xueqi et al., "Style Transformer for Image Inversion and Editing", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
18.
Vaswani Ashish et al., "Attention is all you need", Advances in neural information processing systems, vol. 30, 2017.
Contact IEEE to Subscribe

References

References is not available for this document.