Loading [MathJax]/extensions/MathZoom.js
ViT-R50 GAN: Vision Transformers Hybrid Model based Generative Adversarial Networks for Image Generation | IEEE Conference Publication | IEEE Xplore

ViT-R50 GAN: Vision Transformers Hybrid Model based Generative Adversarial Networks for Image Generation


Abstract:

In recent years, the tremendous potential of GAN in image generation has been demonstrated. Transformer derived from the NLP field is also gradually applied in computer v...Show More

Abstract:

In recent years, the tremendous potential of GAN in image generation has been demonstrated. Transformer derived from the NLP field is also gradually applied in computer vision, and Vision Transformer performs well in image classification problems. In this paper, we design a ViT-based GAN architecture for image generation. We found that the Transformer-based generator did not perform well due to using the same attention matrix for each channel. To overcome this problem, we increased the number of heads to generate more attention matrices. And this part is named enhanced multi-head attention, replacing multi-head attention in Transformer. Secondly, our discriminator uses a hybrid model of ResNet50 and ViT, where ResNet50 works on feature extraction making the discriminator perform better. Experiments show that our architecture performs well on image generation tasks.
Date of Conference: 06-08 January 2023
Date Added to IEEE Xplore: 02 June 2023
ISBN Information:
Conference Location: Guangzhou, China

I. Introduction

Since Generative Adversarial Network (GAN) [1] was proposed, GAN has improved the image generation task to a new level because GAN can improve the modeling ability through continuous games. In traditional GANs, fully connected neural networks are usually used, and they are difficult to train until the emergence of DC-GAN [2], which introduces convolution neural networks (CNNs) [11] into the generator and discriminator and uses convolutions in the discriminator model instead of the pooling layer. Four fractionally-strided convolutions are used in the generator model to complete the generation process from random noise to images. Compared to the original GAN, DC-GAN almost entirely uses convolutional layers instead of fully connected layers, and the discriminator is almost symmetric to the generator. With the help of CNN's more robust fitting and expression ability, the subsequent GAN produces vivid images and greatly improves the diversity of images in image generation.

Contact IEEE to Subscribe

References

References is not available for this document.