1 Introduction
Automatical description generation of images is a very popular topic for its combination of computer vision and natural language processing (NLP). Meanwhile, the problem of cross-modality has also received extensive attention [1] [2]. In general, an image description process is similar with a machine translation process, which follows the encoder-decoder structure [3]. The encoder maps the input to a fixed-length feature vector and the decoder analyzes the feature vector to the final output. Different with the translation model in NLP, the encoder of the image caption is a convolutional neural network (CNN), for example AlexNet [4], ResNet [5], and Inception [6], to extract feature maps of the input image. In the decoder, both of machine translation and image caption apply a recurrent neural network (RNN) [7] or a LSTM [8] to complete decoding that converting the feature vector to the corresponding output. In this paper, we propose an approach using a DenseNet [9] as the encoding part and a LSTM as the decoding part, and introducing a novel structure named “visual attention switch”. The “visual attention switch” structure combines the aforementioned encoder and decoder effectively. The contributions of this paper are as follows:
Unlike previous methods using a ResNet or a VGG network [10] to extract feature maps of images, we firstly use the DenseNet considering the great difference in the visual significance of feature maps of different depths.
We introduce a novel structure named “visual attention switch” to effectively combine encoder and decoder. The structure is mainly applied to calculating the tightness between the word embedding of input word and the hidden state of the LSTM, and introduces attention to the specific feature maps. We predict the next word in the description by using concatenation with the feature maps of attention and word embedding of the current input word as the input of the LSTM.