I. Introduction
Information retrieval, as one of the most common tools, is often used to find out what users need by providing some key clues. With the advent of the big data era, users are faced with the challenge of retrieving interesting information not only from massive homogeneous data, but also from multimedia data. In this context, cross-modal retrieval methods have emerged and become increasingly important in modern life. These methods can search for interesting things across heterogeneous modalities, such as illustrating a textual scene with images visually or searching for a picture based on vague textual description. To further improve the efficiency of the cross-modal retrieval, learning to hash has been proposed in recent years, which allows condensing high-dimensional images and texts into binary codes. Owing to the high efficiency of binary codes in storage and computation, cross-modal hashing methods have gained significant popularity in the information retrieval community [1], [2], [3], [4], [5].