I. Introduction
Fine-grained visual recognition (FGVC) stands as a fundamental task in image understanding. Its principal objective lies in the discrimination of similar objects, species, brands and comparable entities, assigning them to highly specific categories that are subsets of broader, coarse-grained classifications [1]. A prevalent paradigm adopted by existing FGVC methodologies is part-driven approach. These methods, whether explicitly or implicitly, are geared towards the identification and utilization of informative components, often referred to as parts, to aptly represent a fine-grained category [2], [3], [4], [5]. In this context, parts typically allude to localized regions or patches within an image [6], [7], [8]. The part-driven approaches have yielded promising results, especially when applied to object-centric images, such as diverse dog species or various aircraft types, which are typically constrained and exhibit favorable front-view perspectives.