1. Introduction
As low-level vision reaches maturity, more and more researchers have started considering high-level scene understanding tasks such as scene segmentation, object detection and single view 3D reconstruction. Given the success of pixel-based approaches for low-level vision (such as image denoising, where each pixel is assigned an intensity value; or stereo reconstruction, where each pixel is assigned a disparity value), the initial expectation was that such models will also be suitable for scene understanding. However, this expectation has turned out to be overly optimistic as features extracted from patches surrounding pixels are prone to noise from background clutter. To address this issue, many researchers have advocated the use of regions, sets of connected pixels that share common characteristics such as color or texture, to extract reliable discriminative features.