1. Introduction
In image retrieval scenario, many state-of-the-art search engines rely on the Bag-of-Words (BoW) framework [15]. In this model, affine-invariant Hessian regions [10] are detected and a local descriptor such as SIFT [9] is extracted in each of these affine regions. Then local descriptors are quantized to visual words according to a pre-trained dictionary, mostly generated by K-means algorithm or approximate K-means [13] for efficiency. After that, each visual word is weighted using inverse document frequency (IDF) [15] inspired by text retrieval. An image is finally represented with a sparse BoW vector and an inverted index system [17] is leveraged for fast accessing.