1. Introduction
Understanding complex layouts of the 3D world is a crucial ability for applications like robot navigation, driver assistance systems or autonomous driving. Recent success in deep learning-based perception systems enables pixel-accurate semantic segmentation [3], [4], [33] and (monocular) depth estimation [9], [15], [32] in the perspective view of the scene. Other works like [10], [23], [25] go further and reason about occlusions and build better representations for 3D scene understanding. The representation in these works, however, is typically non-parametric, i.e., it provides a semantic label for a 2D/3D point of the scene, which makes higher-level reasoning hard for downstream applications.