I. Introduction
To date, multi-modal tasks involving videos and texts have attracted tremendous interest due to the explosive growth of short videos and the rapid development of deep learning, such as video question answering [1], [2], video grounding [3], and video captioning [4], [5]. These tasks aim at bridging a gap between human and machine intelligence by utilizing the complementary nature of videos and texts. When filling this gap, a core problem is: how to establish the potential correspondence between video and text, which is commonly referred to as video-text retrieval. So far, video-text retrieval is a challenging task due to the massive heterogeneity of video and text, resulting in considerable differences in their representation spaces. Therefore, great progress has been made in the past decades [6], [7], [8], [9], [10].