1. Introduction
Understanding 3D properties of road scenes from single or multiple images is an important and challenging task. Semantic segmentation [2], [3], [41], (monocular) depth estimation [9], [16], [40] and road layout estimation [10], [25], [28], [38] are some well-explored directions for single image 3D scene understanding. Compared to image-based inputs, videos provide the opportunity to exploit more cues such as temporal coherence, dynamics and context [19], yet 3D scene understanding in videos is relatively under-explored, especially with long-term inputs. This work takes a step towards 3D road scene understanding in videos through a holistic consideration of local, global and consistency cues.
Given perspective images (top left) that captures a 3D scene, our goal is to predict the layout of complex driving scenes in top-view both accurately and coherently.