1. Introduction
With the advent of big data era, the amount of multi-modal data (e.g., image, text, audio, video, 3D model) in the Internet is growing explosively. This trend brings unprecedented challenges of accurate and efficient cross-modal retrieval [1]. As a hot-spot in the area of multimedia, cross-modal retrieval aims to find out objects of different modalities according to a query of a specific modality. This technology can be applied in many scenarios, such as multimedia search, recommendation system, VQA, etc.