We address the problem of visual category recognition by learning an image-to-image distance function that attempts to satisfy the following property: the distance betwee...Show More
Metadata
Abstract:
We address the problem of visual category recognition by learning an image-to-image distance function that attempts to satisfy the following property: the distance between images from the same category should be less than the distance between images from different categories. We use patch-based feature vectors common in object recognition work as a basis for our image-to-image distance functions. Our large-margin formulation for learning the distance functions is similar to formulations used in the machine learning literature on distance metric learning, however we differ in that we learn local distance functions-a different parameterized function for every image of our training set-whereas typically a single global distance function is learned. This was a novel approach first introduced in Frome, Singer, & Malik, NIPS 2006. In that work we learned the local distance functions independently, and the outputs of these functions could not be compared at test time without the use of additional heuristics or training. Here we introduce a different approach that has the advantage that it learns distance functions that are globally consistent in that they can be directly compared for purposes of retrieval and classification. The output of the learning algorithm are weights assigned to the image features, which is intuitively appealing in the computer vision setting: some features are more salient than others, and which are more salient depends on the category, or image, being considered. We train and test using the Caltech 101 object recognition benchmark.
Consider the triplet of images, drawn from the Caltech101 dataset [4], shown in Figure 1. We want to classify a query image , and we have stored exemplar images and . Let be the distance from image to , and be the distance from image to where . Then a nearest neighbor classifier which assigns the category of the query image based on which of is smaller, would trivially do the right thing. Note that for this to work, the distance function need not be symmetric, in general . To approach this problem, we parameterize the image-to-image distance functions using a weighted linear combination of distances between patch-based shape feature descriptors, such as SIFT [14] or geometric blur [2]. These features characterize image patches by fixed length vectors, which can be compared using or metrics. One possible approach to computing an image-to-image distance is to attempt to solve the correspondence problem by taking into account both distances between feature vectors and the geometric arrangement of their patches (e.g. [1]). However, this is expensive, and recent approaches that use sets of features and absolute positions of patches provide good approximations that work well in practice. We work in a setting where we approximate correspondence by only the distance between feature vectors. More precisely, given the th of patch features from image , find the best-matching patch in image , and let the distance between them be . The image-to-image distance is defined to be a weighted sum of these distancesD_{ji}=\sum_{m=1}^{M}w_{j,m}d_{ji,m} \eqno\hbox{(1)}
where the different patch features, indexed by , in image are assigned possibly different weights , The intuition is that the weights will be high for “relevant” features and low or zero for “irrelevant” features for characterizing the visual category of . Figure 2 is a visualization of the weights that our algorithm learned for three training images.
Visualizations of the weights learned by our algorithm for one type of shape feature (geometric blur with 42-pixel radius) for three images from our training set. Each circle is centered at the center point of the feature's patch and the color of the circle indicates the relative value of the weight. The colors are on Matlab's jet scale, where dark red is the highest weight (most salient feature) and dark blue is the lowest non-zero weight. Weights that were assigned a zero weight are not shown. Note that the circles are much smaller than the extent of the features and that the colors are scaled separately for each image. For (a), the algorithm learned zero weights for all but 83 of the 400 small geometric blur features computed for the image, and learned that the most important feature is the patch around the eye of the panda. For (b) it learned that the most useful features are on the breast and tail of the rooster, and assigned zero weights to all but 79 of the roughly 400 small geometric blur features. For (c) it learned that the best features aren't on the leopard at all, but are along the right edge and in the upper-right corner of the image. The Leopards images were drawn from the Corel image set and they have a thin black border around the image that algorithms can exploit, making it a surprisingly easy category.