I. Introduction
Video question answering (VideoQA) is a challenging task in Multimedia Intelligence [1], [2], [3], and it aims to answer the question based on a thorough understanding of the given video. The task requires the powerful cognitive capability of spatio-temporal visual representations guided by the compositional semantics of the given question. In recent years, VideoQA has drawn increasing attention due to its wide applications in various domains, e.g., human-robot interaction and autonomous driving. Despite its recent achievements, VideoQA still remains challenging as it requires effective reasoning about complex spatio-temporal relations based on the vision and language modalities [4], [5].