I. Introduction
Many fundamental tasks in the fields of multimedia and computer vision, including image classification, annotation, and retrieval, all rely on the efficient extraction of discriminative features, especially in the big data era. Recently, many researchers have focused on the sparse coding-based representation for problems related to image/video classification [1]–[5], annotation [6], and concept detection [7]. The sparse coding method greatly reduces the reconstruction error of local features, compared to vector quantization (VQ) in the traditional bag of words (BoW) model [8], and can capture the salient properties of images. Combined with spatial pyramid matching [9], the sparse coding algorithm achieves state-of-the-art performance even with simple linear classifiers. On the other hand, one significant limitation of the sparse coding representation is the huge computational cost of coding local features, especially when a large codebook is used. Typically, it costs more than 1 second to solve the sparse coding algorithm for a image. This makes the sparse coding feature difficult to extract for real-world applications. To tackle the problem, some researchers have attempted to enforce the locality of the local feature to reduce the effective size of the codebook. E.g. Yu et al. [3] empirically found that the sparse coding results tend to be local and accordingly proposed the local coordinate coding (LCC) scheme, which uses a subset of codewords to perform sparse coding. Wang et al. [2] further proposed the locality-constrained linear coding (LLC) scheme, which explicitly incorporates the locality constraint instead of the sparsity constraint to obtain an analytical solution. In addition to locality, other researchers have approached the problem either by using supervised codebook learning [10]–[12] to improve the classification accuracy even for small codebook or by utilizing higher-order sparse coding based on a small codebook [5], [13].