I. Introduction
Over the past decade, local feature descriptors have made significant success in addressing many problems in multimedia processing, i.e. image classification [1], [2], object categorization [3] –[5], and image/video analysis [6]–[8], etc. There are mainly three kinds of local feature descriptors – shape, color and texture descriptors. After extracting a set of local features in an image, a feature encoding step is usually applied to cast them into a fixed-length feature vector for image representation [9]. Two widely used feature encoding techniques are the Bag-of-Words (BoW) [10] and Fisher Vector (FV) [11].