1. Introduction
Video-Language Representation Learning [33], [42], [26], [48], [46] is a fundamental problem of multimodal intelligence, which has demonstrated great practical value in various real-world applications such as video captioning [29], [40], video question answering [46], [26], [57], and video retrieval [7], [17], [16]. Essentially, a video-language pair can be seen as two temporal sequences where each sequence is coherent and change smoothly, and the two sequences are concurrently aligned with each other over time. Therefore, different from single-modality representation learning, e.g., video or text, that requires capturing the temporal dynamics along time [42], [26], [48], [13], multi-modality learning further appeals to the temporal alignment across two concurrent modalities. We refer to the property of requiring modeling of both temporal dynamics and temporal alignment in video-language learning as temporal concurrency.