1. Introduction
Recent trends in machine learning have shown a growing interest in multimodal learning, specifically in vision language tasks such as visual question answering, imagetext retrieval, video grounding and so on. Video moment retrieval, also known as video grounding, aims to align a video segment semantically with a given sentence query. Numerous approaches have been proposed to address video grounding, but their results were unsatisfactory due to limitations in capturing both spatial and temporal information [8]. Transformer-based methods have dominated the vision-language landscape in recent years and have also been effectively used for video grounding [2], [25],[28],[29],[32]. One advantage of using transformers over other neural network architectures is their ability to model long sequences with-out losing context [22] and little need for engineering fusion approaches for effective multimodal interaction.