1. Introduction
With the success of Transformers in computer vision, e.g., classification [14], [52], object detection [10], [32], [43], [61], [75], [77], segmentation [59], [64], a line of works [16], [33], [51], [57], [60], [76] have proposed video Transformers to comprehend the video for various downstream tasks. The attention mechanism in Transformers shows the desirable characteristics for video understanding such as the ability to capture the spatial and temporal dependencies at the same time. Consequently,
Comparison of vid-TLDR (Ours) with UMT [33]. Without any additional training, vid-TLDR obtains comparable or even better performance than the base model UMT (left) while reducing the considerable computational cost (right). UMT-B (87M) is used.
these video Transformers have been the primary backbones for the various downstream tasks in the video domain, including action recognition [65], [73], video-text retrieval [17], [38], video question-answering [18], [63], etc. Meanwhile, the self-attention mechanism entails the dot-product calculation between tokens, which brings the quadratic cost in the number of tokens. This poses a challenge for existing video Transformers like UMT [33] that tokenize the whole video into a large number of tokens.