I. Introduction
Image categorization has attracted much attention in recent years. Its goal is to categorize a collection of unlabeled images into a set of predefined classes for semantic-level image retrieval. Among various image classification methods, many researchers have developed a set of sophisticated models, to represent the spatial context of the local patches in the images, e.g., hidden conditional random fields [2], constellation model [3], etc. Among them, 2-dimensional hidden Markov model (2-D HMM) has attracted much attention as a classic spatial contextual model [4]–[6]. This model can efficiently capture the spatial context among different patches in the images. In more detail, when using 2-D HMM for image categorization, a model is first learned from a training set of images for each image class. Then this learned model can be used to score the probability of an unlabeled image belonging to this class. However, the images in one class usually have large intra-class variance and this variance often leads to the difficulty in constructing a common spatial contextual model for this class. Fig. 1 illustrates an example of this difficulty. The images in the category “car” have many different views in this example, such as top view, side view, front view, and back view. Each view has a different spatial context of their local patches. These differences between the image spatial contexts bring large intra-class variance for this category. As stated above, the traditional 2-D HMM attempts to use a common model to generate all these images with different spatial structures. Therefore, the depictive ability of a single model is too limited to capture large intra-class variance perfectly. Actually, the above problem also exists in many other spatial-contextual models for image categorization, which attempt to use one common generative model to represent one class, such as HCRF [2] and constellation model [3].
Using one common 2-D HMM to model the category “car” with large intra-class variance. In this example, four different views in “car” make one model inadequate to capture such an intra-class variance.