I. Introduction
As one of the vision-language problems, image captioning is a challenging problem in computer vision and machine learning, which has attracted increasing attention of researchers [1]–[6]. The objective of image captioning is to generate a natural language description of a given image, and it essentially applies translation between two disparate modals of information. Compared with conventional computer vision tasks, image captioning is more difficult as it requires not only capturing the information contained in an image, but also extracting the semantic correlation of the captured visual information to the relevant language expressions.