1 Introduction
Many state-of-the-art image parsing or semantic segmentation methods attempt to compute a labeling of every pixel or segmentation region in an image [2], [4], [7], [14], [15], [19], [20]. Despite their rapidly increasing accuracy, these methods have several limitations. First, they have no notion of object instances - given an image with multiple nearby or overlapping cars, these methods are likely to produce a blob of “car” labels instead of separately delineated instances (Figure 1(a)). In addition, pixel labeling methods tend to be more accurate for “stuff” classes that are characterized by local appearance rather than overall shape - classes such as road, sky, tree, and building. To do better on “thing” classes such as car, cat, person, and vase - as well as to gain the ability to represent object instances - it becomes necessary to incorporate detectors that model the overall object shape.