Helping Hands: An Object-Aware Ego-Centric Video Recognition Model | IEEE Conference Publication | IEEE Xplore

Helping Hands: An Object-Aware Ego-Centric Video Recognition Model


Abstract:

We introduce an object-aware decoder for improving the performance of spatio-temporal representations on egocentric videos. The key idea is to enhance object-awareness du...Show More

Abstract:

We introduce an object-aware decoder for improving the performance of spatio-temporal representations on egocentric videos. The key idea is to enhance object-awareness during training by tasking the model to predict hand positions, object positions, and the semantic label of the objects using paired captions when available. At inference time the model only requires RGB frames as inputs, and is able to track and ground objects (although it has not been trained explicitly for this).We demonstrate the performance of the object-aware representations learnt by our model, by: (i) evaluating it for strong transfer, i.e. through zero-shot testing, on a number of downstream video-text retrieval and classification benchmarks; and (ii) by using the representations learned as input for long-term video understanding tasks (e.g. Episodic Memory in Ego4D). In all cases the performance improves over the state of the art—even compared to networks trained with far larger batch sizes. We also show that by using noisy image-level detection as pseudo-labels in training, the model learns to provide better bounding boxes using video consistency, as well as grounding the words in the associated text descriptions.Overall, we show that the model can act as a drop-in replacement for an ego-centric video model to improve performance through visual-text grounding1.
Date of Conference: 01-06 October 2023
Date Added to IEEE Xplore: 15 January 2024
ISBN Information:

ISSN Information:

Conference Location: Paris, France

Funding Agency:

References is not available for this document.

1. Introduction

In visual-language models there has been a recent move to explicitly build object awareness into the vision module by adding specialized and bespoke components, or using entirely object-centric architectures. The motivation for this partly comes from the attractive compositional nature of objects and their inter-relationships in language, which enables inexhaustible novel combinations [10], [45], and partly from infant cognitive studies that stress the importance of objects in early visual development [29], [56], [60]. Examples in the video domain include explicit internal object representations [2], e.g., through RoI-align [17] pooled features either from a pre-trained region-proposal network (RPN) [2], [52], [57], [62], or from bounding-box coordinates taken as input [19], [42], [48], [71]. This contrasts with the large body of work where standard representations are learnt end-to-end without any explicit factorization into objects/entities, such as dual-encoder vision-language models in the image [21], [49] and video domains [4], [64].

Select All
1.
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds et al., "Flamingo: a visual language model for few-shot learning", 2022.
2.
Anurag Arnab, Chen Sun and Cordelia Schmid, "Unified graph structured models for video understanding", Proc. ICCV, 2021.
3.
Elad Ben Avraham, Roei Herzig, Karttikeya Mangalam, Amir Bar, Anna Rohrbach, Leonid Karlinsky, et al., "Bringing image scene structure to video via frame-clip consistency of object tokens", NeurIPS, 2022.
4.
Max Bain, Arsha Nagrani, Gül Varol and Andrew Zisserman, "Frozen in time: A joint video and image encoder for end-to-end retrieval", Proc. ICCV, 2021.
5.
Sven Bambach, Stefan Lee, David J Crandall and Chen Yu, "Lending a hand: Detecting hands and recognizing activities in complex egocentric interactions", Proceedings of the IEEE international conference on computer vision, pp. 1949-1957, 2015.
6.
Gedas Bertasius, Heng Wang and Lorenzo Torresani, "Is space-time attention all you need for video understanding", Proc. ICML, 2021.
7.
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov and Sergey Zagoruyko, "End-to-end object detection with transformers", Proc. ECCV, 2020.
8.
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, et al., "Uniter: Universal image-text representation learning", Proc. ECCV, 2020.
9.
Bowen Cheng, Anwesa Choudhuri, Ishan Misra, Alexander Kirillov, Rohit Girdhar and Alexander G Schwing, "Mask2former for video instance segmentation", 2021.
10.
Noam Chomsky and David W Lightfoot, Syntactic structures, Walter de Gruyter, 1957.
11.
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price et al., "Scaling egocentric vision: The epic-kitchens dataset", Proc. ECCV, 2018.
12.
Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar, Richard Higgins, et al., "Epic-kitchens visor benchmark: Video segmentations and object relations", NeurIPS Datasets and Benchmarks Track, 2022.
13.
Zi-Yi Dou, Aishwarya Kamath, Zhe Gan, Pengchuan Zhang, Jianfeng Wang, Linjie Li, Zicheng Liu, Ce Liu, Yann Le-Cun, Nanyun Peng et al., "Coarse-to-fine vision-language pre-training with fusion in the backbone", 2022.
14.
Gamaleldin F Elsayed, Aravindh Mahendran, Sjoerd van Steenkiste, Klaus Greff, Michael C Mozer and Thomas Kipf, "Savi++: Towards end-to-end object-centric learning from real-world videos", NeurIPS, 2022.
15.
Philip Gage, "A new algorithm for data compression", C Users Journal, 1994.
16.
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu et al., "Ego4d: Around the world in 3000 hours of egocentric video", Proc. CVPR, 2022.
17.
Kaiming He, Georgia Gkioxari, Piotr Dollár and Ross Gir-shick, "Mask r-cnn", Proc. ICCV, 2017.
18.
Paul Henderson and Christoph H. Lampert, "Unsupervised object-centric video generation and decomposition in 3D", NeurIPS, 2020.
19.
Roei Herzig, Elad Ben-Avraham, Karttikeya Mangalam, Amir Bar, Gal Chechik, Anna Rohrbach, et al., "Object-region video transformers", 2021.
20.
Roei Herzig, Elad Levi, Huijuan Xu, Hang Gao, Eli Brosh, Xiaolong Wang, et al., "Spatio-temporal action graph networks", Proc. ICCV, 2019.
21.
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, et al., "Scaling up visual and vision-language representation learning with noisy text supervision", Proc. ICML, 2021.
22.
Justin Johnson, Andrej Karpathy and Li Fei-Fei, "Densecap: Fully convolutional localization networks for dense captioning", Proc. CVPR, 2016.
23.
Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan Misra, Gabriel Synnaeve and Nicolas Carion, "Mdetr–modulated detection for end-to-end multi-modal understanding", Proc. ICCV, 2021.
24.
Ranjay Krishna, Ines Chami, Michael Bernstein and Li Fei-Fei, "Referring relationships", Proc. CVPR, 2018.
25.
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma et al., "Visual genome: Connecting language and vision using crowdsourced dense image annotations", IJCV, 2017.
26.
Matej Kristan, Jiří Matas, Aleš Leonardis, Michael Felsberg, Roman Pflugfelder, Joni-Kristian Kämäräinen, Hyung Jin Chang, Martin Danelljan, Luka Cehovin, Alan Lukežič et al., "The ninth visual object tracking vot2021 challenge results", Proc. ICCV, 2021.
27.
A. Krizhevsky, I. Sutskever and G. E. Hinton, "ImageNet classification with deep convolutional neural networks", NeurIPS, pp. 1106-1114, 2012.
28.
Abhijit Kundu, Kyle Genova, Xiaoqi Yin, Alireza Fathi, Caroline Pantofaru, Leonidas J Guibas, et al., "Panoptic neural fields: A semantic object-aware neural scene representation", Proc. CVPR, 2022.
29.
Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum and Samuel J Gershman, "Building machines that learn and think like people", Behavioral and brain sciences, 2017.
30.
Junnan Li, Dongxu Li, Caiming Xiong and Steven Hoi, "Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation", Proc. ICML, 2022.

Contact IEEE to Subscribe

References

References is not available for this document.