1. Introduction
Representations based on loose collections of invariant appearance descriptors extracted from local image patches have become very popular for texture analysis and visual recognition [1], [5], [7], [9], [12], [17], [18], [19],[5], [24], [26], [27]. They are flexible and relatively easy to construct, they capture a significant proportion of the complex statistics of real images and visual classes in a convenient local form, and they have good resistance to occlusions, geometric deformations and illumination variations. In particular, the statistics of natural image classes can be encoded by vector quantizing the appearance space and accumulating histograms or signatures of patch appearances based on this coding. Traditionally one codes a dense set of patches ‘texton’ representation) [18], but sparser sets based on keypoints or ‘points of interest’ detected by invariant local feature detectors have generated a lot of interest recently [1], [5], [7], [17], [26], [27]. Key-points potentially provide greater invariance and more compact coding, but they were not designed to select the most informative regions for classification and, as we will show below, dense sampling followed by explicit discriminative feature selection gives significantly better results. Top: sample images from four categories of the ‘Xerox 7’ dataset. Bottom: the image regions assigned to the 100 codewords with maximal inter-category discrimination in this dataset. The codewords represent meaningful object parts, even though they were learned without using the category labels.