1. INTRODUCTION
Image classification is one classical problem in computer vision. Among many methods, Bag-of-Words (BoW) [1] has been widely used by many researchers and shown good performance, which represents an image as a histogram of its local descriptors. It is robust against spatial translations of features. However, the nature of histogram discards all information about the spatial layout of local descriptors, which limits the representation power. An extension of bag-of-feature representation called spatial pyramid matching (SPM) [2] has received widely application, which partitions an image into several finer spatial sub-regions in different scales and computes the word histogram inside each of the sub-regions, and then concatenates all the histograms to form a vector representation of the image.