1 Introduction
Scene parsing, or recognizing and segmenting objects in an image, is one of the core problems of computer vision. Traditional approaches to object recognition begin by specifying an object model, such as template matching [8], [49], constellations [13], [15], bags of features [19], [24], [44], [45], or shape models [2], [3], [14], etc. These approaches typically work with a fixed number of object categories and require training generative or discriminative models for each category from training data. In the parsing stage, these systems try to align the learned models to the input image and associate object category labels with pixels, windows, edges, or other image representations. Recently, context information has also been carefully modeled to capture the relationship between objects at the semantic level [20], [22]. Encouraging progress has been made by these models on a variety of object recognition and scene parsing tasks.