Action Unit Memory Network for Weakly Supervised Temporal Action Localization | IEEE Conference Publication | IEEE Xplore

Action Unit Memory Network for Weakly Supervised Temporal Action Localization


Abstract:

Weakly supervised temporal action localization aims to detect and localize actions in untrimmed videos with only video-level labels during training. However, without fram...Show More

Abstract:

Weakly supervised temporal action localization aims to detect and localize actions in untrimmed videos with only video-level labels during training. However, without frame-level annotations, it is challenging to achieve localization completeness and relieve background interference. In this paper, we present an Action Unit Memory Network (AUMN) for weakly supervised temporal action localization, which can mitigate the above two challenges by learning an action unit memory bank. In the proposed AUMN, two attention modules are designed to update the memory bank adaptively and learn action units specific classifiers. Furthermore, three effective mechanisms (diversity, homogeneity and sparsity) are designed to guide the updating of the memory network. To the best of our knowledge, this is the first work to explicitly model the action units with a memory network. Extensive experimental results on two standard benchmarks (THUMOS14 and ActivityNet) demonstrate that our AUMN performs favorably against state-of-the-art methods. Specifically, the average mAP of IoU thresholds from 0.1 to 0.5 on the THUMOS14 dataset is significantly improved from 47.0% to 52.1%.
Date of Conference: 20-25 June 2021
Date Added to IEEE Xplore: 02 November 2021
ISBN Information:

ISSN Information:

Conference Location: Nashville, TN, USA

Funding Agency:

References is not available for this document.

1. Introduction

Temporal action localization (TAL) is an important yet challenging task for video understanding. Its goal is to localize temporal boundaries of actions with specific categories in untrimmed videos [13], [7]. Because of its broad applications in high-level tasks such as video surveillance [40], video summarization [17], and event detection [15], TAL has recently drawn increasing attentions from the community. Up to now, deep learning based methods have made impressive progresses in this area. However, most of them handle this task in a fully supervised way, requiring massive temporal boundary annotations for actions [24], [51], [5], [42], [36]. Such manual annotations are expensive to obtain, which limits the development potential of fully-supervised methods in real-world scenarios.

Select All
1.
Shyamal Buch, Victor Escorcia, Bernard Ghanem, Li Fei-Fei and Juan Carlos Niebles, "End-to-end single-stream temporal action detection in untrimmed videos", BMVC, vol. 2, pp. 7, 2017.
2.
Shyamal Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem and Juan Carlos Niebles, "Sst: Single-stream temporal action proposals", Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 2911-2920, 2017.
3.
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem and Juan Carlos Niebles, "Activitynet: A large-scale video benchmark for human activity understanding", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961-970, 2015.
4.
Joao Carreira and Andrew Zisserman, "Quo vadis action recognition? a new model and the kinetics dataset", proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299-6308, 2017.
5.
Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A Ross, Jia Deng and Rahul Sukthankar, "Rethinking the faster r-cnn architecture for temporal action localization", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1130-1139, 2018.
6.
Peng Chen, Zhongqian Sun, Lidong Bing and Wei Yang, "Recurrent attention network on memory for aspect sentiment analysis", Proceedings of the 2017 conference on empirical methods in natural language processing, pp. 452-461, 2017.
7.
Adrien Gaidon, Zaid Harchaoui and Cordelia Schmid, "Temporal localization of actions with actoms", IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 11, pp. 2782-2795, 2013.
8.
Ross Girshick, Jeff Donahue, Trevor Darrell and Jitendra Malik, "Rich feature hierarchies for accurate object detection and semantic segmentation", Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580-587, 2014.
9.
Ross Girshick, Jeff Donahue, Trevor Darrell and Jitendra Malik, "Region-based convolutional networks for accurate object detection and segmentation", IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 1, pp. 142-158, 2015.
10.
Guoqiang Gong, Xinghan Wang, Yadong Mu and Qi Tian, "Learning temporal co-attention models for unsupervised video action localization", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9819-9828, 2020.
11.
Sepp Hochreiter and Jurgen Schmidhuber, "Long short-term¨ memory", Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.
12.
Linjiang Huang, Yan Huang, Wanli Ouyang, Liang Wang et al., "Relational prototypical network for weakly supervised temporal action localization", 2020.
13.
Haroon Idrees, Amir R Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, et al., "The thumos challenge on action recognition for videos “in the wild”", Computer Vision and Image Understanding, vol. 155, pp. 1-23, 2017.
14.
Diederik P Kingma and Jimmy Ba, "Adam: A method for stochastic optimization", 2014.
15.
Suha Kwak, Bohyung Han and Joon Hee Han, "Multi-agent event detection: Localization and role assignment", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013.
16.
Pilhyeon Lee, Youngjung Uh and Hyeran Byun, "Background suppression network for weakly-supervised temporal action localization", AAAI, pp. 11320-11327, 2020.
17.
Yong Jae Lee, Joydeep Ghosh and Kristen Grauman, "Discovering important people and objects for egocentric video summarization", 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1346-1353, 2012.
18.
Tianwei Lin, Xiao Liu, Xin Li, Errui Ding and Shilei Wen, "Bmn: Boundary-matching network for temporal action proposal generation", Proceedings of the IEEE International Conference on Computer Vision, pp. 3889-3898, 2019.
19.
Tianwei Lin, Xu Zhao and Zheng Shou, "Single shot temporal action detection", Proceedings of the 25th ACM international conference on Multimedia, pp. 988-996, 2017.
20.
Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang and Ming Yang, "Bsn: Boundary sensitive network for temporal action proposal generation", Proceedings of the European Conference on Computer Vision (ECCV), pp. 3-19, 2018.
21.
Daochang Liu, Tingting Jiang and Yizhou Wang, "Completeness modeling and context separation for weakly supervised temporal action localization", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1298-1307, 2019.
22.
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, et al., "Ssd: Single shot multibox detector", European conference on computer vision, pp. 21-37, 2016.
23.
Ziyi Liu, Le Wang, Qilin Zhang, Zhanning Gao, Zhenxing Niu, Nanning Zheng, et al., "Weakly supervised temporal action localization through contrast based evaluation networks", Proceedings of the IEEE International Conference on Computer Vision, pp. 3899-3908, 2019.
24.
Fuchen Long, Ting Yao, Zhaofan Qiu, Xinmei Tian, Jiebo Luo and Tao Mei, "Gaussian temporal awareness networks for action localization", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 344-353, 2019.
25.
Zhekun Luo, Devin Guillory, Baifeng Shi, Wei Ke, Fang Wan, Trevor Darrell, et al., "Weakly-supervised action localization with expectation-maximization multiinstance learning", 2020.
26.
Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes and Jason Weston, "Key-value memory networks for directly reading documents", 2016.
27.
Kyle Min and Jason J Corso, "Adversarial background-aware loss for weakly-supervised temporal activity localization", 2020.
28.
Sanath Narayan, Hisham Cholakkal, Fahad Shabaz Khan and Ling Shao, "3c-net: Category count and center loss for weakly-supervised action localization", 2019.
29.
Phuc Nguyen, Ting Liu, Gautam Prasad and Bohyung Han, "Weakly supervised action localization by sparse temporal pooling network", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6752-6761, 2018.
30.
Phuc Xuan Nguyen, Deva Ramanan and Charless C Fowlkes, "Weakly-supervised action localization with background modeling", 2019.
Contact IEEE to Subscribe

References

References is not available for this document.