1. Introduction
In the past two decades, fine-grained visual classification (FGVC) has made significant progress in recognizing sub-categories of objects belonging to the same class. This progress has been demonstrated in various domains, such as recognizing cars [32], [57], aircraft [36], birds [50], [48], and foods [39], with extensive outstanding works surpassing human experts in many application scenarios [34], [56], [19], [53], [13], [4], [5], [16], [14]. However, the previous efforts on FGVC have remained mainly limited to a single-view-based paradigm, where only the visual content within a single static image is considered. This paradigm may be sufficient for coarse-grained classification where inter-class differences are easily captured, such as distinguishing a coupe from other vehicles by its streamlined body, seductive engine, or headlamps. However, fine-grained classification presents a different challenge where discriminative clues are rare and often found in subtle structural differences that are not easily captured by a single static view. For instance, to distinguish between different Ford sedans, one can only rely on subtle differences in the design of car headlights. Predictably, for single-view-based approaches, an image/view without discriminative clues is completely indistinguishable at the fine-grained level, fundamentally limiting the model’s theoretical performance.