I. Introduction
With the development of the Internet, multi-modal data such as image, text, audio, and video on the Internet have been rapidly increasing. The Cross-modal retrieval integrating image, text and other modal type has become a research hotspot in the field of multimedia information retrieval. Different from the traditional single modal information retrieval task, the multi-modal retrieval task realizes that users can submit one modal data and receive results containing multi-modal information. For example, if a user visits the Buckingham Palace and uses an image search, the results will be Buckingham Palace with images, text, and video.