I. Introduction
With the explosive development of the internet users, a massive amount of videos were created and uploaded to the internet. The ability to accurately retrieve videos from enormous video contents with a text query is thus essential to quickly find relevant information that we want. The task of text-video retrieval [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12] is an approach to solve this problem, and large-scale multimodal pre-training [13], [14], [15], [16], [17] has proved that it can further boost the retrieval performance. This paper studies the video-language pre-training for text-video retrieval.