1. Introduction
Fine-grained visual classification (FGVC) was first introduced to the vision community almost two decades ago with the landmark paper of [2]. It brought out a critical question that was largely overlooked back then – that can machines match up to humans on recognising objects at fine-grained level (e.g., a "flamingo" other than a "bird"). Great strides have been made over the years, starting with the conventional part-based models [51], [14], [1], [3], to the recent surge of deep models that either explicitly or implicitly tackle part learning with or without strong supervision [26], [34], [52], [55], [57], [48]. Without exception, the focus has been on mining fine-grained discriminative features to better classification performances.