1. Introduction
Densely labeling a scene is a challenging recognition task which is the focus of much recent work [7], [13], [20], [26], [27]. The difficulty stems from several factors. First, the incredible diversity of the visual world means that each region can potentially take on one of hundreds of different labels. Second, the distribution of classes in a typical scene is far from uniform, following a power-law (as illustrated in Fig. 8). Consequently, many classes will have a small number of instances even in a large dataset, making it hard to train good classifiers. Third, as noted by Frome et al. [3], the use of single global distance metric for all descriptors is insufficient to handle the large degree of variation found in a given class. For example, the position within the image may sometimes be an important cue for finding people (e.g. when they are walking on a street), but on other occasions position may be irrelevant and color a much better feature (e.g. the person is close and facing the camera).