Loading [MathJax]/extensions/MathZoom.js
Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification | IEEE Conference Publication | IEEE Xplore

Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification


Abstract:

Single-modal object re-identification (ReID) faces great challenges in maintaining robustness within complex visual scenarios. In contrast, multi-modal object ReID utiliz...Show More

Abstract:

Single-modal object re-identification (ReID) faces great challenges in maintaining robustness within complex visual scenarios. In contrast, multi-modal object ReID utilizes complementary information from diverse modalities, showing great potentials for practical applications. How-ever, previous methods may be easily affected by irrele-vant backgrounds and usually ignore the modality gaps. To address above issues, we propose a novel learning frame-work named EDITOR to select diverse tokens from vision Transformers for multi-modal object ReID. We be-gin with a shared vision Transformer to extract tokenized features from different input modalities. Then, we intro-duce a Spatial-Frequency Token Selection (SFTS) module to adaptively select object-centric tokens with both spa-tial and frequency information. Afterwards, we employ a Hierarchical Masked Aggregation (HMA) module to fa-cilitate feature interactions within and across modalities. Finally, to further reduce the effect of backgrounds, we propose a Background Consistency Constraint (BCC) and an Object-Centric Feature Refinement (OCFR). They are formulated as two new loss functions, which improve the feature discrimination with background suppression. As a result, our framework can generate more discriminative features for multi-modal object ReID. Extensive ex-periments on three multi-modal ReID benchmarks verify the effectiveness of our methods. The code is available at https://github.com/924973292/EDITOR.
Date of Conference: 16-22 June 2024
Date Added to IEEE Xplore: 16 September 2024
ISBN Information:

ISSN Information:

Conference Location: Seattle, WA, USA

Funding Agency:


1. Introduction

Object re-identification (ReID) aims to retrieve specific ob-jects (e.g., person, vehicle) across non-overlapping cam-eras. Over the past few decades, object ReID has advanced significantly. However, traditional object ReID with single-modal input encounters substantial challenges [17], particularly in complex visual scenarios, such as extreme illu-mination, thick fog and low image resolution. It can re-sult in noticeable distortions in critical object regions, leading to disruptions during the retrieval process [53]. There-fore, there has been a notable shift toward multi-modal approaches in recent years, capitalizing on diverse data sources to enhance the feature robustness for practical ap-plications [43], [44], [53]. However, as illustrated in Fig. 1, pre-vious multi-modal ReID methods typically extract global features from all regions of images in different modali-ties and subsequently aggregate them. Nevertheless, these methods present two key limitations: (1) Within individual modalities, backgrounds introduce additional noise [37], es-pecially in challenging visual scenarios. (2) Across differ-ent modalities, backgrounds introduce overhead in reducing modality gaps, which may amplify the difficulty in ag-gregating features [15]. Hence, our method prioritizes the selection of object-centric information, aiming to preserve the diverse features of different modalities while minimizing background interference. -

Comparison of different methods and token selections. (a) Framework of previous methods; (b) Framework of our pro-posed EDITOR; (c) RGB images; (d) Spatial-based token selection; (e) Multi-modal frequency transform; (f) Frequency-based token selection; (g) Selected tokens in the NIR modality; (h) Se-lected tokens in the TIR modality.

Contact IEEE to Subscribe

References

References is not available for this document.