1. Introduction
While images admit a standard representation in the form of a scalar function uniformly discretized on a grid, the curse of dimensionality has prevented the effective usage of analogous representations for learning 3D geometry. Voxel representations have shown some promise at low resolution [10], [20], [35], [57], [62], [69], [74], while hierarchical representations have attempted to reduce the memory footprint required for training [58], [64], [73], but at the significant cost of complex implementations. Rather than representing the volume occupied by a 3D object, one can resort to modeling its surface via a collection of points [1], [19], polygons [31], [56], [71], or surface patches [26]. Alternatively, one might follow Cezanne's advice and “treat nature by means of the cylinder, the sphere, the cone, everything brought into proper perspective”, and think to approximate 3D geometry as geons [4]-collections of simple to interpret geometric primitives [68], [77], and their composition [60], [21]. Hence, one might rightfully start wondering “why so many representations of 3D data exist, and why would one be more advantageous than the other?” One observation is that multiple equivalent representations of 3D geometry exist because real-world applications need to perform different operations and queries on this data ([9, Ch.1]). For example, in computer graphics, points and polygons allow for very efficient rendering on GPUs, while volumes allow artists to sculpt geometry without having to worry about tessellation [51] or assembling geometry by smooth composition [2], while primitives enable highly efficient collision detection [66] and resolution [67]. In computer vision and robotics, analogous trade-offs exist: surface models are essential for the construction of low-dimensional parametric templates essential for tracking [6], [8], volumetric representations are key to capturing 3D data whose topology is unknown [48], [47], while part-based models provide a natural decomposition of an object into its semantic components. Part-based models create a representation useful to reason about extent, mass, contact, … quantities that are key to describing the scene, and planning motions [29], [28].
Our method reconstruct a 3D object from an input image as a collection of convex hulls, and we visualize the explode of these convexes. Notably, CvxNet outputs polygonal mesh representations of convex polytopes without requiring the execution of computationally expensive iso-surfacing (e.g. Marching cubes). This means the representation outputted by CvxNet can then be readily used for physics simulation [17], as well as many other downstream applications that consume polygonal meshes.