1. INTRODUCTION
Recognizing fine-grained categories (e.g. bird species [2], flower types [13], car models [10]) by computer vision techniques has attracted extensive attention. The task is very challenging as fine-grained image recognition should be capable of localizing and representing the subtle visual differences within subordinate categories. Early applications of deep learning in this task build traditional multistage frameworks upon convolutional neural network (CNN) features; more recent CNN-based approaches can be roughly divided into two categories: localization-classification and end-to-end feature encoding.