Conferences >2023 IEEE/CVF International C...

Helping Hands: An Object-Aware Ego-Centric Video Recognition Model

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

We introduce an object-aware decoder for improving the performance of spatio-temporal representations on egocentric videos. The key idea is to enhance object-awareness du...Show More

Metadata

Abstract:

We introduce an object-aware decoder for improving the performance of spatio-temporal representations on egocentric videos. The key idea is to enhance object-awareness during training by tasking the model to predict hand positions, object positions, and the semantic label of the objects using paired captions when available. At inference time the model only requires RGB frames as inputs, and is able to track and ground objects (although it has not been trained explicitly for this).We demonstrate the performance of the object-aware representations learnt by our model, by: (i) evaluating it for strong transfer, i.e. through zero-shot testing, on a number of downstream video-text retrieval and classification benchmarks; and (ii) by using the representations learned as input for long-term video understanding tasks (e.g. Episodic Memory in Ego4D). In all cases the performance improves over the state of the art—even compared to networks trained with far larger batch sizes. We also show that by using noisy image-level detection as pseudo-labels in training, the model learns to provide better bounding boxes using video consistency, as well as grounding the words in the associated text descriptions.Overall, we show that the model can act as a drop-in replacement for an ego-centric video model to improve performance through visual-text grounding¹.

Published in: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

Date of Conference: 01-06 October 2023

Date Added to IEEE Xplore: 15 January 2024

ISBN Information:

ISSN Information:

DOI: 10.1109/ICCV51070.2023.01278

Conference Location: Paris, France

Funding Agency:

References is not available for this document.

Contents

1. Introduction

In visual-language models there has been a recent move to explicitly build object awareness into the vision module by adding specialized and bespoke components, or using entirely object-centric architectures. The motivation for this partly comes from the attractive compositional nature of objects and their inter-relationships in language, which enables inexhaustible novel combinations [10], [45], and partly from infant cognitive studies that stress the importance of objects in early visual development [29], [56], [60]. Examples in the video domain include explicit internal object representations [2], e.g., through RoI-align [17] pooled features either from a pre-trained region-proposal network (RPN) [2], [52], [57], [62], or from bounding-box coordinates taken as input [19], [42], [48], [71]. This contrasts with the large body of work where standard representations are learnt end-to-end without any explicit factorization into objects/entities, such as dual-encoder vision-language models in the image [21], [49] and video domains [4], [64].

References is not available for this document.

Helping Hands: An Object-Aware Ego-Centric Video Recognition Model

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

1. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Helping Hands: An Object-Aware Ego-Centric Video Recognition Model

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

1. Introduction

Authors

Figures

References

Citations

Keywords

Metrics

Supplemental Items

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?