Journals & Magazines >IEEE Transactions on Multimedia >Volume: 25

Self-Supervised Learning for Semi-Supervised Temporal Language Grounding

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Given a text description, Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed vide...Show More

Metadata

Abstract:

Given a text description, Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed video. TLG is inherently a challenging task, as it requires comprehensive understanding of both sentence semantics and video contents. Previous works either tackle this task in a fully-supervised setting that requires a large amount of temporal annotations or in a weakly-supervised setting that usually cannot achieve satisfactory performance. Since manual annotations are expensive, to cope with limited annotations, we tackle TLG in a semi-supervised way by incorporating self-supervised learning, and propose Self-Supervised Semi-Supervised Temporal Language Grounding (S

$^{4}$ TLG). S

$^{4}$ TLG consists of two parts: (1) A pseudo label generation module that adaptively produces instant pseudo labels for unlabeled samples based on predictions from a teacher model; (2) A self-supervised feature learning module with inter-modal and intra-modal contrastive losses to learn video feature representations under the constraints of video content consistency and video-text alignment. We conduct extensive experiments on the ActivityNet-CD-OOD and Charades-CD-OOD datasets. The results demonstrate that our proposed S

$^{4}$ TLG can achieve competitive performance compared to fully-supervised state-of-the-art methods while only requiring a small portion of temporal annotations.

Published in: IEEE Transactions on Multimedia ( Volume: 25)

Page(s): 7747 - 7757

Date of Publication: 09 December 2022

ISSN Information:

DOI: 10.1109/TMM.2022.3228167

Funding Agency:

Contents

I. Introduction

Video understanding is a challenging task in computer vision. Many related tasks have been extensively studied due to its wide applications, such as video captioning [1], action localization [46] and video question answering [47]. Temporal language grounding (TLG) [2], [3], which aims at automatically locating the temporal boundaries of the segments indicated by natural language descriptions in an untrimmed video, has recently attracted increasing attention from both the computer vision and the natural language processing communities because of its potential applications in video search engines and automated video editing.

References is not available for this document.

MIT Libraries

MIT Libraries

Self-Supervised Learning for Semi-Supervised Temporal Language Grounding

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

MIT Libraries

MIT Libraries

Self-Supervised Learning for Semi-Supervised Temporal Language Grounding

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References