Multi-View Active Fine-Grained Visual Recognition | IEEE Conference Publication | IEEE Xplore

Multi-View Active Fine-Grained Visual Recognition


Abstract:

Despite the remarkable progress of Fine-grained visual classification (FGVC) with years of history, it is still limited to recognizing 2D images. Recognizing objects in t...Show More

Abstract:

Despite the remarkable progress of Fine-grained visual classification (FGVC) with years of history, it is still limited to recognizing 2D images. Recognizing objects in the physical world (i.e., 3D environment) poses a unique challenge – discriminative information is not only present in visible local regions but also in other unseen views. Therefore, in addition to finding the distinguishable part from the current view, efficient and accurate recognition requires inferring the critical perspective with minimal glances. E.g., a person might recognize a "Ford sedan" with a glance at its side and then know that looking at the front can help tell which model it is. In this paper, towards FGVC in the real physical world, we put forward the problem of multi-view active fine-grained visual recognition (MAFR) and complete this study in three steps: (i) a multi-view, fine-grained vehicle dataset is collected as the testbed, (ii) a pilot experiment is designed to validate the need and research value of MAFR, (iii) a policy-gradient-based framework along with a dynamic exiting strategy is proposed to achieve efficient recognition with active view selection. Our comprehensive experiments demonstrate that the proposed method outperforms previous multi-view recognition works and can extend existing state-of-the-art FGVC methods and advanced neural networks to become "FGVC experts" in the 3D environment. Our code is available at https://github.com/PRIS-CV/MAFR.
Date of Conference: 01-06 October 2023
Date Added to IEEE Xplore: 15 January 2024
ISBN Information:

ISSN Information:

Conference Location: Paris, France

Funding Agency:


1. Introduction

In the past two decades, fine-grained visual classification (FGVC) has made significant progress in recognizing sub-categories of objects belonging to the same class. This progress has been demonstrated in various domains, such as recognizing cars [32], [57], aircraft [36], birds [50], [48], and foods [39], with extensive outstanding works surpassing human experts in many application scenarios [34], [56], [19], [53], [13], [4], [5], [16], [14]. However, the previous efforts on FGVC have remained mainly limited to a single-view-based paradigm, where only the visual content within a single static image is considered. This paradigm may be sufficient for coarse-grained classification where inter-class differences are easily captured, such as distinguishing a coupe from other vehicles by its streamlined body, seductive engine, or headlamps. However, fine-grained classification presents a different challenge where discriminative clues are rare and often found in subtle structural differences that are not easily captured by a single static view. For instance, to distinguish between different Ford sedans, one can only rely on subtle differences in the design of car headlights. Predictably, for single-view-based approaches, an image/view without discriminative clues is completely indistinguishable at the fine-grained level, fundamentally limiting the model’s theoretical performance.

Contact IEEE to Subscribe

References

References is not available for this document.