1. Introduction
This paper addresses the problem of image parsing, or labeling each pixel in an image with its semantic category. Our goal is achieving broad coverage - the ability to recognize hundreds or thousands of object classes that commonly occur in everyday street scenes and indoor environments. A major challenge in doing this is posed by the non-uniform statistics of these classes in realistic scene images. A small number of classes - mainly ones associated with large regions or “stuff,” such as road, sky, trees, buildings, etc. - constitute the majority of all image pixels and object instances in the dataset. But a much larger number of “thing” classes - people, cars, dogs, mailboxes, vases, stop signs - occupy a small percentage of image pixels and have relatively few instances each.