Loading web-font TeX/Main/Regular
Self-Supervised Learning for Semi-Supervised Temporal Language Grounding | IEEE Journals & Magazine | IEEE Xplore

Self-Supervised Learning for Semi-Supervised Temporal Language Grounding


Abstract:

Given a text description, Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed vide...Show More

Abstract:

Given a text description, Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed video. TLG is inherently a challenging task, as it requires comprehensive understanding of both sentence semantics and video contents. Previous works either tackle this task in a fully-supervised setting that requires a large amount of temporal annotations or in a weakly-supervised setting that usually cannot achieve satisfactory performance. Since manual annotations are expensive, to cope with limited annotations, we tackle TLG in a semi-supervised way by incorporating self-supervised learning, and propose Self-Supervised Semi-Supervised Temporal Language Grounding (S^{4}TLG). S^{4}TLG consists of two parts: (1) A pseudo label generation module that adaptively produces instant pseudo labels for unlabeled samples based on predictions from a teacher model; (2) A self-supervised feature learning module with inter-modal and intra-modal contrastive losses to learn video feature representations under the constraints of video content consistency and video-text alignment. We conduct extensive experiments on the ActivityNet-CD-OOD and Charades-CD-OOD datasets. The results demonstrate that our proposed S^{4}TLG can achieve competitive performance compared to fully-supervised state-of-the-art methods while only requiring a small portion of temporal annotations.
Published in: IEEE Transactions on Multimedia ( Volume: 25)
Page(s): 7747 - 7757
Date of Publication: 09 December 2022

ISSN Information:

Funding Agency:

No metrics found for this document.

I. Introduction

Video understanding is a challenging task in computer vision. Many related tasks have been extensively studied due to its wide applications, such as video captioning [1], action localization [46] and video question answering [47]. Temporal language grounding (TLG) [2], [3], which aims at automatically locating the temporal boundaries of the segments indicated by natural language descriptions in an untrimmed video, has recently attracted increasing attention from both the computer vision and the natural language processing communities because of its potential applications in video search engines and automated video editing.

Usage
Select a Year
2025

View as

Total usage sinceDec 2022:261
0246810JanFebMarAprMayJunJulAugSepOctNovDec698000000000
Year Total:23
Data is updated monthly. Usage includes PDF downloads and HTML views.

Contact IEEE to Subscribe

References

References is not available for this document.