1. Introduction
Recent vision-language models have shown remarkable progress, driven by transformer-based large-scale pre-trained models [10], [39], [9], [38], [17], [45], [44]. These models have been incorporated into video understanding methods, including video question answering (VideoQA), through multimodal fusion on large-scale multimodal datasets [41], [3], [60]. However, adapting pretrained models to video-language tasks on limited data is challenging. This is because of the gap between the visual and language modalities and, more importantly, because finetuning the entire model on limited data can lead to overfitting and forgetting previously acquired knowledge.