Loading [MathJax]/extensions/MathMenu.js
Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification | IEEE Conference Publication | IEEE Xplore

Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification


Abstract:

Single-modal object re-identification (ReID) faces great challenges in maintaining robustness within complex visual scenarios. In contrast, multi-modal object ReID utiliz...Show More

Abstract:

Single-modal object re-identification (ReID) faces great challenges in maintaining robustness within complex visual scenarios. In contrast, multi-modal object ReID utilizes complementary information from diverse modalities, showing great potentials for practical applications. How-ever, previous methods may be easily affected by irrele-vant backgrounds and usually ignore the modality gaps. To address above issues, we propose a novel learning frame-work named EDITOR to select diverse tokens from vision Transformers for multi-modal object ReID. We be-gin with a shared vision Transformer to extract tokenized features from different input modalities. Then, we intro-duce a Spatial-Frequency Token Selection (SFTS) module to adaptively select object-centric tokens with both spa-tial and frequency information. Afterwards, we employ a Hierarchical Masked Aggregation (HMA) module to fa-cilitate feature interactions within and across modalities. Finally, to further reduce the effect of backgrounds, we propose a Background Consistency Constraint (BCC) and an Object-Centric Feature Refinement (OCFR). They are formulated as two new loss functions, which improve the feature discrimination with background suppression. As a result, our framework can generate more discriminative features for multi-modal object ReID. Extensive ex-periments on three multi-modal ReID benchmarks verify the effectiveness of our methods. The code is available at https://github.com/924973292/EDITOR.
Date of Conference: 16-22 June 2024
Date Added to IEEE Xplore: 16 September 2024
ISBN Information:

ISSN Information:

Conference Location: Seattle, WA, USA

Funding Agency:

School of Future Technology, School of Artificial Intelligence, Dalian University of Technology, China
Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Anhui University, China
School of Future Technology, School of Artificial Intelligence, Dalian University of Technology, China
School of Future Technology, School of Artificial Intelligence, Dalian University of Technology, China
Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Anhui University, China
School of Computer Science and Technology, Anhui University, China
School of Future Technology, School of Artificial Intelligence, Dalian University of Technology, China

1. Introduction

Object re-identification (ReID) aims to retrieve specific ob-jects (e.g., person, vehicle) across non-overlapping cam-eras. Over the past few decades, object ReID has advanced significantly. However, traditional object ReID with single-modal input encounters substantial challenges [17], particularly in complex visual scenarios, such as extreme illu-mination, thick fog and low image resolution. It can re-sult in noticeable distortions in critical object regions, leading to disruptions during the retrieval process [53]. There-fore, there has been a notable shift toward multi-modal approaches in recent years, capitalizing on diverse data sources to enhance the feature robustness for practical ap-plications [43], [44], [53]. However, as illustrated in Fig. 1, pre-vious multi-modal ReID methods typically extract global features from all regions of images in different modali-ties and subsequently aggregate them. Nevertheless, these methods present two key limitations: (1) Within individual modalities, backgrounds introduce additional noise [37], es-pecially in challenging visual scenarios. (2) Across differ-ent modalities, backgrounds introduce overhead in reducing modality gaps, which may amplify the difficulty in ag-gregating features [15]. Hence, our method prioritizes the selection of object-centric information, aiming to preserve the diverse features of different modalities while minimizing background interference. -

Comparison of different methods and token selections. (a) Framework of previous methods; (b) Framework of our pro-posed EDITOR; (c) RGB images; (d) Spatial-based token selection; (e) Multi-modal frequency transform; (f) Frequency-based token selection; (g) Selected tokens in the NIR modality; (h) Se-lected tokens in the TIR modality.

School of Future Technology, School of Artificial Intelligence, Dalian University of Technology, China
Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Anhui University, China
School of Future Technology, School of Artificial Intelligence, Dalian University of Technology, China
School of Future Technology, School of Artificial Intelligence, Dalian University of Technology, China
Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Anhui University, China
School of Computer Science and Technology, Anhui University, China
School of Future Technology, School of Artificial Intelligence, Dalian University of Technology, China
Contact IEEE to Subscribe

References

References is not available for this document.