Journals & Magazines >IEEE Transactions on Multimedia >Volume: 25

Self-Supervised Learning for Semi-Supervised Temporal Language Grounding

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Given a text description, Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed vide...Show More

Metadata

Abstract:

Given a text description, Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed video. TLG is inherently a challenging task, as it requires comprehensive understanding of both sentence semantics and video contents. Previous works either tackle this task in a fully-supervised setting that requires a large amount of temporal annotations or in a weakly-supervised setting that usually cannot achieve satisfactory performance. Since manual annotations are expensive, to cope with limited annotations, we tackle TLG in a semi-supervised way by incorporating self-supervised learning, and propose Self-Supervised Semi-Supervised Temporal Language Grounding (S

$^{4}$ TLG). S

$^{4}$ TLG consists of two parts: (1) A pseudo label generation module that adaptively produces instant pseudo labels for unlabeled samples based on predictions from a teacher model; (2) A self-supervised feature learning module with inter-modal and intra-modal contrastive losses to learn video feature representations under the constraints of video content consistency and video-text alignment. We conduct extensive experiments on the ActivityNet-CD-OOD and Charades-CD-OOD datasets. The results demonstrate that our proposed S

$^{4}$ TLG can achieve competitive performance compared to fully-supervised state-of-the-art methods while only requiring a small portion of temporal annotations.

Published in: IEEE Transactions on Multimedia ( Volume: 25)

Page(s): 7747 - 7757

Date of Publication: 09 December 2022

ISSN Information:

DOI: 10.1109/TMM.2022.3228167

Funding Agency:

No metrics found for this document.

Contents

I. Introduction

Video understanding is a challenging task in computer vision. Many related tasks have been extensively studied due to its wide applications, such as video captioning [1], action localization [46] and video question answering [47]. Temporal language grounding (TLG) [2], [3], which aims at automatically locating the temporal boundaries of the segments indicated by natural language descriptions in an untrimmed video, has recently attracted increasing attention from both the computer vision and the natural language processing communities because of its potential applications in video search engines and automated video editing.

Usage

Select a Year

View as

Total usage sinceDec 2022:261

Year Total:23

Data is updated monthly. Usage includes PDF downloads and HTML views.

Citations

Crossref^®

Search for
Citations in
Google Scholar^®

References is not available for this document.

Self-Supervised Learning for Semi-Supervised Temporal Language Grounding

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

View as

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Self-Supervised Learning for Semi-Supervised Temporal Language Grounding

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

Authors

Figures

References

Citations

Keywords

Metrics

View as

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?