Conferences >2020 IEEE/CVF Conference on C...

Set-Constrained Viterbi for Set-Supervised Action Segmentation

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

This paper is about weakly supervised action segmentation, where the ground truth specifies only a set of actions present in a training video, but not their true temporal...Show More

Metadata

Abstract:

This paper is about weakly supervised action segmentation, where the ground truth specifies only a set of actions present in a training video, but not their true temporal ordering. Prior work typically uses a classifier that independently labels video frames for generating the pseudo ground truth, and multiple instance learning for training the classifier. We extend this framework by specifying an HMM, which accounts for co-occurrences of action classes and their temporal lengths, and by explicitly training the HMM on a Viterbi-based loss. Our first contribution is the formulation of a new set-constrained Viterbi algorithm (SCV). Given a video, the SCV generates the MAP action segmentation that satisfies the ground truth. This prediction is used as a framewise pseudo ground truth in our HMM training. Our second contribution in training is a new regularization of feature affinities between training videos that share the same action classes. Evaluation on action segmentation and alignment on the Breakfast, MPII Cooking2, Hollywood Extended datasets demonstrates our significant performance improvement for the two tasks over prior work.

Published in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Date of Conference: 13-19 June 2020

Date Added to IEEE Xplore: 05 August 2020

ISBN Information:

ISSN Information:

DOI: 10.1109/CVPR42600.2020.01083

Conference Location: Seattle, WA, USA

References is not available for this document.

Contents

1. Introduction

This paper addresses action segmentation by labeling video frames with action classes under set-level weak supervision in training. Set-supervised training means that the ground truth specifies only a set of actions present in a training video. Their temporal ordering and the number of their occurrences remain unknown. This is an important problem arising from the proliferation of big video datasets where providing detailed annotations of a temporal ordering of actions is prohibitively expensive. One example application is action segmentation of videos that have been retrieved from a dataset based on word captions [6], [20], where the captions do not describe temporal relationships of actions.

References is not available for this document.

MIT Libraries

MIT Libraries

Set-Constrained Viterbi for Set-Supervised Action Segmentation

Abstract:

Metadata

Abstract:

ISSN Information:

1. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

MIT Libraries

MIT Libraries

Set-Constrained Viterbi for Set-Supervised Action Segmentation

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

1. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?