I. Introduction
Computer vision (CV) and natural language processing (NLP) have been developed with unprecedented advances in deep learning technology. In addition to prominent achievements made by each field individually, there have been significant progress on the challenging task of combining vision and language by exploring their contextual relations. Such Vision- and-Language (VL) research has attracted many interests in many applications such as visual question answering [1], [2], image captioning [3]–[5], referring expression comprehension [6], [7], and image-text retrieval [8]–[11] for the last decade.