Loading [a11y]/accessibility-menu.js
Set-Constrained Viterbi for Set-Supervised Action Segmentation | IEEE Conference Publication | IEEE Xplore

Set-Constrained Viterbi for Set-Supervised Action Segmentation


Abstract:

This paper is about weakly supervised action segmentation, where the ground truth specifies only a set of actions present in a training video, but not their true temporal...Show More

Abstract:

This paper is about weakly supervised action segmentation, where the ground truth specifies only a set of actions present in a training video, but not their true temporal ordering. Prior work typically uses a classifier that independently labels video frames for generating the pseudo ground truth, and multiple instance learning for training the classifier. We extend this framework by specifying an HMM, which accounts for co-occurrences of action classes and their temporal lengths, and by explicitly training the HMM on a Viterbi-based loss. Our first contribution is the formulation of a new set-constrained Viterbi algorithm (SCV). Given a video, the SCV generates the MAP action segmentation that satisfies the ground truth. This prediction is used as a framewise pseudo ground truth in our HMM training. Our second contribution in training is a new regularization of feature affinities between training videos that share the same action classes. Evaluation on action segmentation and alignment on the Breakfast, MPII Cooking2, Hollywood Extended datasets demonstrates our significant performance improvement for the two tasks over prior work.
Date of Conference: 13-19 June 2020
Date Added to IEEE Xplore: 05 August 2020
ISBN Information:

ISSN Information:

Conference Location: Seattle, WA, USA
References is not available for this document.

1. Introduction

This paper addresses action segmentation by labeling video frames with action classes under set-level weak supervision in training. Set-supervised training means that the ground truth specifies only a set of actions present in a training video. Their temporal ordering and the number of their occurrences remain unknown. This is an important problem arising from the proliferation of big video datasets where providing detailed annotations of a temporal ordering of actions is prohibitively expensive. One example application is action segmentation of videos that have been retrieved from a dataset based on word captions [6], [20], where the captions do not describe temporal relationships of actions.

Select All
1.
Yunus Can Bilge, Dogukan Çagatay, Begüm Genç, Mecit Sari, Hüseyin Akcan and Cem Evrendilek, "All colors shortest path problem", CoRR abs/1507.06865, 2015.
2.
Piotr Bojanowski, Rémi Lajugie, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid, et al., "Weakly supervised action labeling in videos under ordering constraints", European Conference on Computer Vision, pp. 628-643, 2014.
3.
Chien-Yi Chang, De-An Huang, Yanan Sui, Li Fei-Fei and Juan Carlos Niebles, "D3tw: Discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3546-3555, 2019.
4.
Chien-Yi Chang, De-An Huang, Yanan Sui, Li Fei-Fei and Juan Carlos Niebles, Discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation, 2019.
5.
Li Ding and Chenliang Xu, "Weakly-supervised action segmentation with iterative soft boundary assignment", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6508-6516, 2018.
6.
Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, et al., "Dual encoding for zero-example video retrieval", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9346-9355, 2019.
7.
De-An Huang, Li Fei-Fei and Juan Carlos Niebles, "Connectionist temporal modeling for weakly supervised action labeling", European Conference on Computer Vision, pp. 137-153, 2016.
8.
H Jhuang, H Garrote, E Poggio, T Serre and T Hmdb, "A large video database for human motion recognition", Proc. of IEEE International Conference on Computer Vision, vol. 4, pp. 6, 2011.
9.
Oscar Koller, Hermann Ney and Richard Bowden, "Deep hand: How to train a cnn on 1 million hand images when your data is continuous and weakly labelled", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3793-3802, 2016.
10.
Oscar Koller, Sepehr Zargaran and Hermann Ney, "Re-sign: Re-aligned end-to-end sequence modelling with deep recurrent cnn-hmms", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4297-4305, 2017.
11.
Hilde Kuehne, Ali Arslan and Thomas Serre, "The language of actions: Recovering the syntax and semantics of goal-directed human activities", Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 780-787, 2014.
12.
Hilde Kuehne, Alexander Richard and Juergen Gall, "Weakly supervised learning of actions from transcripts", Computer Vision and Image Understanding, vol. 163, pp. 78-89, 2017.
13.
Jun Li, Peng Lei and Sinisa Todorovic, "Weakly supervised energy-based learning for action segmentation", The IEEE International Conference on Computer Vision (ICCV), October 2019.
14.
Sujoy Paul, Sourya Roy and Amit K Roy-Chowdhury, "W-talc: Weakly-supervised temporal activity localization and classification", Proceedings of the European Conference on Computer Vision (ECCV), pp. 563-579, 2018.
15.
Alexander Richard, Hilde Kuehne and Juergen Gall, "Weakly supervised action learning with rnn based fine-to-coarse modeling", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 754-763, 2017.
16.
Alexander Richard, Hilde Kuehne and Juergen Gall, "Action sets: Weakly supervised action segmentation without ordering constraints", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5987-5996, 2018.
17.
Alexander Richard, Hilde Kuehne, Ahsan Iqbal and Juer-gen Gall, "Neuralnetwork-viterbi: A framework for weakly supervised video learning", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7386-7395, 2018.
18.
Marcus Rohrbach, Sikandar Amin, Mykhaylo Andriluka and Bernt Schiele, "A database for fine grained activity detection of cooking activities", 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1194-1201, 2012.
19.
Marcus Rohrbach, Anna Rohrbach, Michaela Regneri, Sikandar Amin, Mykhaylo Andriluka, Manfred Pinkal, et al., "Recognizing fine-grained and composite activities using hand-centric features and script data", International Journal of Computer Vision, vol. 119, no. 3, pp. 346-373, 2016.
20.
Dian Shao, Yu Xiong, Yue Zhao, Qingqiu Huang, Yu Qiao and Dahua Lin, "Find and focus: Retrieve and localize video events with natural language queries", The European Conference on Computer Vision (ECCV), 2018.
21.
Zheng Shou, Hang Gao, Lei Zhang, Kazuyuki Miyazawa and Shih-Fu Chang, "Autoloc: weakly-supervised temporal action localization in untrimmed videos", Proceedings of the European Conference on Computer Vision (ECCV), pp. 154-171, 2018.
22.
Krishna Kumar Singh and Yong Jae Lee, "Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization", 2017 IEEE International Conference on Computer Vision (ICCV), pp. 3544-3553, 2017.
23.
Khurram Soomro, Amir Roshan Zamir and Mubarak Shah, Ucf101: A dataset of 101 human actions classes from videos in the wild, 2012.
24.
Chen Sun, Sanketh Shetty, Rahul Sukthankar and Ram Nevatia, "Temporal localization of fine-grained actions in videos by domain transfer from web images", Proceedings of the 23rd ACM international conference on Multimedia, pp. 371-380, 2015.
25.
Heng Wang and Cordelia Schmid, "Action recognition with improved trajectories", Proceedings of the IEEE international conference on computer vision, pp. 3551-3558, 2013.
26.
Limin Wang, Yuanjun Xiong, Dahua Lin and Luc Van Gool, "Untrimmednets for weakly supervised action recognition and detection", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4325-4334, 2017.
27.
Zhi-Hua Zhou and Min-Ling Zhang, "Neural networks for multi-instance learning", Proceedings of the International Conference on Intelligent Information Technology, pp. 455-459, 2002.
Contact IEEE to Subscribe

References

References is not available for this document.