EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation | IEEE Conference Publication | IEEE Xplore

EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation


Abstract:

Locating 3D objects from a single RGB image via Perspective-n-Points (PnP) is a long-standing problem in computer vision. Driven by end-to-end deep learning, recent studi...Show More

Abstract:

Locating 3D objects from a single RGB image via Perspective-n-Points (PnP) is a long-standing problem in computer vision. Driven by end-to-end deep learning, recent studies suggest interpreting PnP as a differentiable layer, so that 2D-3D point correspondences can be partly learned by backpropagating the gradient w.r.t. object pose. Yet, learning the entire set of unrestricted 2D-3D points from scratch fails to converge with existing approaches, since the deterministic pose is inherently non-differentiable. In this paper, we propose the EPro-PnP a probabilistic PnP layer for general end-to-end pose estimation, which outputs a distribution of pose on the SE(3) manifold, essentially bringing categorical Softmax to the continuous domain. The 2D-3D coordinates and corresponding weights are treated as intermediate variables learned by minimizing the KL divergence between the predicted and target pose distribution. The underlying principle unifies the existing approaches and resembles the attention mechanism. EPro-PnP significantly outperforms competitive baselines, closing the gap between PnP-based method and the task-specific leaders on the LineMOD 6DoF pose estimation and nuScenes 3D object detection benchmarks.3
Date of Conference: 18-24 June 2022
Date Added to IEEE Xplore: 27 September 2022
ISBN Information:

ISSN Information:

Conference Location: New Orleans, LA, USA
References is not available for this document.

1. Introduction

Estimating the pose (i.e., position and orientation) of 3D objects from a single RGB image is an important task in computer vision. This field is often subdivided into specific tasks, e.g., 6DoF pose estimation for robot manipulation and 3D object detection for autonomous driving. Although they share the same fundamentals of pose estimation, the different nature of the data leads to biased choice of methods. Top performers [29], [42], [44] on the 3D object detection benchmarks [6], [14] fall into the category of direct 4DoF pose prediction, leveraging the advances in end-to-end deep learning. On the other hand, the 6DoF pose estimation benchmark [19] is largely dominated by geometry-based methods [20], [46], which exploit the provided 3D object models and achieve a stable generalization performance. However, it is quite challenging to bring together the best of both worlds, i.e., training a geometric model to learn the object pose in an end-to-end manner.

EPro-PnP is a general solution to end-to-end 2D-3D correspondence learning. In this paper, we present two distinct networks trained with EPro-PnP: (a) An off-the-shelf dense correspondence network whose potential is unleashed by end-to-end training, (b) a novel deformable correspondence network that explores new possibilities of fully learnable 2D-3D points.

Select All
1.
Christopher M. Bishop, Mixture density networks, 1994.
2.
Eric Brachmann, Alexander Krull, Sebastian Nowozin, Jamie Shotton, Frank Michel, Stefan Gumhold, et al., "Dsac - differentiable ransac for camera localization", CVPR, 2017.
3.
Eric Brachmann, Frank Michel, Alexander Krull, Michael Ying Yang, Stefan Gumhold and Carsten Rother, "Uncertainty-driven 6d pose estimation of objects and scenes from a single rgb image", CVPR, 2016.
4.
Eric Brachmann and Carsten Rother, "Learning less is more - 6d camera localization via 3d surface regression", CVPR, 2018.
5.
Mai Bui, Tolga Birdal, Haowen Deng, Shadi Albarqouni, Leonidas Guibas, Slobodan Ilic, et al., "6d camera relocalization in ambiguous scenes via continuous multi-modal inference", ECCV, 2020.
6.
Caesar Holger, Bankiti Varun, H. Lang Alex, Sourabh Vora, Venice Erin Liong, Qiang Xu, et al., "nuscenes: A multi-modal dataset for autonomous driving", CVPR, 2020.
7.
Dylan Campbell, Liu Liu and Stephen Gould, "Solving the blind perspective-n-point problem end-to-end with robust differentiable geometric optimization", ECCV, 2020.
8.
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov and Sergey Zagoruyko, "End-to-end object detection with transformers", ECCV, 2020.
9.
Florian Chabot, Mohamed Chaouch, Jaonary Rabarisoa, Celine Teuliere and Thierry Chateau, "Deep manta: A coarse-to-fine many-task network for joint 2d and 3d vehicle analysis from monocular image", CVPR, 2017.
10.
Bo Chen, Alvaro Parra, Jiewei Cao, Nan Li and Tat-Jun Chin, "End-to-end learnable geometric vision by backpropagating pnp optimization", CVPR, 2020.
11.
Hansheng Chen, Yuyao Huang, Wei Tian, Zhong Gao and Lu Xiong, "Monorun: Monocular 3d object detection by reconstruction and uncertainty propagation", CVPR, 2021.
12.
Jean-Marie Cornuet, Jean-Michel Marin, Antonietta Mira and Christian P. Robert, "Adaptive multiple importance sampling", Scandinavian Journal of Statistics, vol. 39, no. 4, pp. 798-812, 2012.
13.
Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, et al., "Deformable convolutional networks", CVPR, 2017.
14.
Andreas Geiger, Philip Lenz and Raquel Urtasun, "Are we ready for autonomous driving? the kitti vision benchmark suite", CVPR, 2012.
15.
Igor Gilitschenski, Roshni Sahoo, Wilko Schwarting, Alexander Amini, Sertac Karaman and Daniela Rus, "Deep orientation uncertainty learning based on a bingham loss", ICLR, 2020.
16.
Stephen Gould, Richard Hartley and Dylan John Campbell, "Deep declarative networks", IEEE TPAMI, 2021.
17.
Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun, "Deep residual learning for image recognition", CVPR, 2016.
18.
Yihui He, Chenchen Zhu, Jianren Wang, Marios Savvides and Xiangyu Zhang, "Bounding box regression with uncertainty for accurate object detection", CVPR, 2019.
19.
Stefan Hinterstoisser, Stefan Holzer, Cedric Cagniart, Slobodan Ilic, Kurt Konolige, Nassir Navab, et al., "Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes", ICCV, 2011.
20.
Shun Iwase, Xingyu Liu, Rawal Khirodkar, Rio Yokota and M. Kitani Kris, "Repose: Fast 6d object pose refinement via deep texture rendering", ICCV, 2021.
21.
Alex Kendall and Yarin Gal, "What uncertainties do we need in bayesian deep learning for computer vision?", NIPS, 2017.
22.
Diederik P. Kingma and Max Welling, "Auto-encoding variational bayes", ICLR, 2014.
23.
Peixuan Li, Huaici Zhao, Pengfei Liu and Feidao Cao, "Rtm3d: Real-time monocular 3d detection from object key-points for autonomous driving", ECCV, 2020.
24.
Zhigang Li, Gu Wang and Xiangyang Ji, "CDPN: Coordinates-based disentangled pose network for real-time rgb-based 6-dof object pose estimation", ICCV, 2019.
25.
Ilya Loshchilov and Frank Hutter, "Decoupled weight decay regularization", ICLR, 2019.
26.
Osama Makansi, Eddy Ilg, Ozgun Cicek and Thomas Brox, "Overcoming limitations of mixture density networks: A sampling and fitting framework for multimodal future prediction", CVPR, 2019.
27.
Fabian Manhardt, Diego Martin Arroyo, Christian Rup-precht, Benjamin Busam, Nassir Navab and Federico Tombari, "Explaining the ambiguity of object detection and 6d pose from visual data", ICCV, 2019.
28.
Arsalan Mousavian, Dragomir Anguelov, John Flynn and Jana Kosecka, "3d bounding box estimation using deep learning and geometry", CVPR, 2017.
29.
Dennis Park, Rares Ambrus, Vitor Guizilini, Jie Li and Adrien Gaidon, "Is pseudo-lidar needed for monocular 3d object detection?", ICCV, 2021.
30.
Kiru Park, Timothy Patten and Markus Vincze, "Pix2pose: Pixel-wise coordinate regression of objects for 6d pose estimation", ICCV, 2019.
Contact IEEE to Subscribe

References

References is not available for this document.