1. Introduction
The task of fine-grained classification is to recognize sub-ordinate categories belonging to the same superordinate category [42], [39], [32], [25]. The major challenge is that fine-grained objects share similar overall appearance and only have subtle differences in highly localized regions. To effectively and accurately find these discriminative regions, some previous approaches utilize humans-in-the-loop [4], [7], [40], or require semantic part annotations [30], [2], [1], [3], [12], [47], [48] or 3D models [29], [25]. These methods are effective, but they require extra keypoint/part/3D annotations from humans, which are often expensive to obtain. On the other hand, recent research on discriminative mid-level visual elements mining [9], [36], [8], [21], [27] automatically finds discriminative patches or regions from a huge pool and uses the responses of those discriminative elements as a mid-level representation for classification. However, this approach has mainly been applied to scene classification and not typically to fine-grained classification. This is probably due to the fact that the discriminative patches needed for fine-grained categories need to be more accurately localized than for scene classification.