I. Introduction
Fine-grained image classification aims to recognize hundreds of subcategories in the same basic-level category, which lies in the continuum between basic-level image classification (e.g. object recognition [2], [3]) and identification of individuals (e.g. face recognition [4], [5]). It is one of the most significant and highly challenging open problems in computer vision area due to the following two aspects: (1) Large variance in the same subcategory. As shown in the first row of Fig. 1, the four images belong to the same subcategory of “Laysan Albatross”, but they are different in poses, views, feathers and so on. It is easy for human beings to misclassify them into different subcategories. (2) Small variance among different subcategories. As shown in the second row of Fig. 1, the four images belong to different subcategories, but they are all black and look similar. It is hard for human beings to distinguish “Fish Crow” from the other three subcategories. These subcategories in the same basic-level category look similar in global appearance, but distinct in some discriminative regions of the objects, such as the head. So the localization of the key discriminative regions becomes crucial for fine-grained image classification. Recently, methods based on discriminative localization have achieved great progress [6]–[12].
Examples of CUB-200-2011 dataset [1]. Large variance in the same subcategory is shown in the first row, and small variance among different subcategories is shown in the second row.