Cross-modal Contrastive Learning with Asymmetric Co-attention Network for Video Moment Retrieval | IEEE Conference Publication | IEEE Xplore

Cross-modal Contrastive Learning with Asymmetric Co-attention Network for Video Moment Retrieval


Abstract:

Video moment retrieval is a challenging task requiring fine-grained interactions between video and text modalities. Recent work in image-text pretraining has demonstrated...Show More

Abstract:

Video moment retrieval is a challenging task requiring fine-grained interactions between video and text modalities. Recent work in image-text pretraining has demonstrated that most existing pretrained models suffer from information asymmetry due to the difference in length between visual and textual sequences. We question whether the same problem also exists in the video-text domain with an auxiliary need to preserve both spatial and temporal information. Thus, we evaluate a recently proposed solution involving the addition of an asymmetric co-attention network for video grounding tasks. Additionally, we incorporate momentum contrastive loss for robust, discriminative representation learning in both modalities. We note that the integration of these supplementary modules yields better performance compared to state-of-the-art models on the TACoS dataset and comparable results on ActivityNet Captions, all while utilizing significantly fewer parameters with respect to baseline.
Date of Conference: 01-06 January 2024
Date Added to IEEE Xplore: 16 April 2024
ISBN Information:

ISSN Information:

Conference Location: Waikoloa, HI, USA

1. Introduction

Recent trends in machine learning have shown a growing interest in multimodal learning, specifically in vision language tasks such as visual question answering, imagetext retrieval, video grounding and so on. Video moment retrieval, also known as video grounding, aims to align a video segment semantically with a given sentence query. Numerous approaches have been proposed to address video grounding, but their results were unsatisfactory due to limitations in capturing both spatial and temporal information [8]. Transformer-based methods have dominated the vision-language landscape in recent years and have also been effectively used for video grounding [2], [25],[28],[29],[32]. One advantage of using transformers over other neural network architectures is their ability to model long sequences with-out losing context [22] and little need for engineering fusion approaches for effective multimodal interaction.

Contact IEEE to Subscribe

References

References is not available for this document.