Loading [MathJax]/extensions/MathMenu.js
Image Caption via Visual Attention Switch on DenseNet | IEEE Conference Publication | IEEE Xplore

Image Caption via Visual Attention Switch on DenseNet


Abstract:

We introduce a novel approach that is used to convert images into the corresponding language descriptions. This method follows the most popular encoder-decoder architectu...Show More

Abstract:

We introduce a novel approach that is used to convert images into the corresponding language descriptions. This method follows the most popular encoder-decoder architecture. The encoder uses the recently proposed densely convolutional neural network (DenseNet) to extract the feature maps. Meanwhile, the decoder uses the long short time memory (LSTM) to parse the feature maps to descriptions. We predict the next word of descriptions by taking the effective combination of feature maps with word embedding of current input word by “visual attention switch”. Finally, we compare the performance of the proposed model with other baseline models and achieve good results.
Date of Conference: 22-24 August 2018
Date Added to IEEE Xplore: 08 November 2018
ISBN Information:

ISSN Information:

Conference Location: Guiyang, China

1 Introduction

Automatical description generation of images is a very popular topic for its combination of computer vision and natural language processing (NLP). Meanwhile, the problem of cross-modality has also received extensive attention [1] [2]. In general, an image description process is similar with a machine translation process, which follows the encoder-decoder structure [3]. The encoder maps the input to a fixed-length feature vector and the decoder analyzes the feature vector to the final output. Different with the translation model in NLP, the encoder of the image caption is a convolutional neural network (CNN), for example AlexNet [4], ResNet [5], and Inception [6], to extract feature maps of the input image. In the decoder, both of machine translation and image caption apply a recurrent neural network (RNN) [7] or a LSTM [8] to complete decoding that converting the feature vector to the corresponding output. In this paper, we propose an approach using a DenseNet [9] as the encoding part and a LSTM as the decoding part, and introducing a novel structure named “visual attention switch”. The “visual attention switch” structure combines the aforementioned encoder and decoder effectively. The contributions of this paper are as follows:

Unlike previous methods using a ResNet or a VGG network [10] to extract feature maps of images, we firstly use the DenseNet considering the great difference in the visual significance of feature maps of different depths.

We introduce a novel structure named “visual attention switch” to effectively combine encoder and decoder. The structure is mainly applied to calculating the tightness between the word embedding of input word and the hidden state of the LSTM, and introduces attention to the specific feature maps. We predict the next word in the description by using concatenation with the feature maps of attention and word embedding of the current input word as the input of the LSTM.

References

References is not available for this document.