1. Introduction
The state of the art in machine learning has achieved exceptional accuracy on many computer vision tasks solely by training models on images. Building upon these successes and advancing 3D object understanding has great potential to power a wider range of applications, such as augmented reality, robotics, autonomy, and image retrieval. Yet, understanding objects in 3D remains a challenging task due to the lack of large real-world datasets compared to 2D tasks (e.g., ImageNet [8], COCO [22], and Open Images [20]). To empower the research community for continued advancement in 3D object understanding, there is a strong need for the release of object-centric video datasets, which capture more of the 3D structure of an object, while matching the data format used for many down-stream vision tasks (i.e., video or camera streams), to aid in the training and benchmarking of machine learning models. The object-centric approach is consistent with how our brains perceive new objects too. For example, when a child wants to learn the shape of a chair, they’ll walk around and look at the chair from different angles to pick up information. In other words, “We must also move in order to perceive” [12][37].