1. Introduction
Object recognition is one of the basic issues in computer vision. It has made great progress in recent years with the rapid development of deep learning approaches [18, 33, 30, 13], where large numbers of labeled images are required, such as ImageNet [28]. However, collecting and annotating large numbers of images are difficult, especially for fine-grained categories in specific domains. Moreover, such supervised learning approaches can only recognize a fixed number of categories, which is not flexible. In contrast, humans can learn from only a few samples or even recognize unseen objects. Therefore, learning visual classifiers with no need of human annotation is becoming a hot topic in recent years.