Loading [MathJax]/extensions/MathMenu.js
Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification | IEEE Conference Publication | IEEE Xplore

Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification


Abstract:

Single-modal object re-identification (ReID) faces great challenges in maintaining robustness within complex visual scenarios. In contrast, multi-modal object ReID utiliz...Show More

Abstract:

Single-modal object re-identification (ReID) faces great challenges in maintaining robustness within complex visual scenarios. In contrast, multi-modal object ReID utilizes complementary information from diverse modalities, showing great potentials for practical applications. How-ever, previous methods may be easily affected by irrele-vant backgrounds and usually ignore the modality gaps. To address above issues, we propose a novel learning frame-work named EDITOR to select diverse tokens from vision Transformers for multi-modal object ReID. We be-gin with a shared vision Transformer to extract tokenized features from different input modalities. Then, we intro-duce a Spatial-Frequency Token Selection (SFTS) module to adaptively select object-centric tokens with both spa-tial and frequency information. Afterwards, we employ a Hierarchical Masked Aggregation (HMA) module to fa-cilitate feature interactions within and across modalities. Finally, to further reduce the effect of backgrounds, we propose a Background Consistency Constraint (BCC) and an Object-Centric Feature Refinement (OCFR). They are formulated as two new loss functions, which improve the feature discrimination with background suppression. As a result, our framework can generate more discriminative features for multi-modal object ReID. Extensive ex-periments on three multi-modal ReID benchmarks verify the effectiveness of our methods. The code is available at https://github.com/924973292/EDITOR.
Date of Conference: 16-22 June 2024
Date Added to IEEE Xplore: 16 September 2024
ISBN Information:

ISSN Information:

Conference Location: Seattle, WA, USA

Funding Agency:

Citations are not available for this document.

1. Introduction

Object re-identification (ReID) aims to retrieve specific ob-jects (e.g., person, vehicle) across non-overlapping cam-eras. Over the past few decades, object ReID has advanced significantly. However, traditional object ReID with single-modal input encounters substantial challenges [17], particularly in complex visual scenarios, such as extreme illu-mination, thick fog and low image resolution. It can re-sult in noticeable distortions in critical object regions, leading to disruptions during the retrieval process [53]. There-fore, there has been a notable shift toward multi-modal approaches in recent years, capitalizing on diverse data sources to enhance the feature robustness for practical ap-plications [43], [44], [53]. However, as illustrated in Fig. 1, pre-vious multi-modal ReID methods typically extract global features from all regions of images in different modali-ties and subsequently aggregate them. Nevertheless, these methods present two key limitations: (1) Within individual modalities, backgrounds introduce additional noise [37], es-pecially in challenging visual scenarios. (2) Across differ-ent modalities, backgrounds introduce overhead in reducing modality gaps, which may amplify the difficulty in ag-gregating features [15]. Hence, our method prioritizes the selection of object-centric information, aiming to preserve the diverse features of different modalities while minimizing background interference. -

Comparison of different methods and token selections. (a) Framework of previous methods; (b) Framework of our pro-posed EDITOR; (c) RGB images; (d) Spatial-based token selection; (e) Multi-modal frequency transform; (f) Frequency-based token selection; (g) Selected tokens in the NIR modality; (h) Se-lected tokens in the TIR modality.

Cites in Papers - |

Cites in Papers - IEEE (4)

Select All
1.
Lanyun Zhu, Tianrun Chen, Deyi Ji, Jieping Ye, Jun Liu, "Not Every Patch is Needed: Toward a More Efficient and Effective Backbone for Video-Based Person Re-Identification", IEEE Transactions on Image Processing, vol.34, pp.785-800, 2025.
2.
Yuhao Wang, Pingping Zhang, Xuehu Liu, Zhengzheng Tu, Huchuan Lu, "Unity Is Strength: Unifying Convolutional and Transformeral Features for Better Person Re-Identification", IEEE Transactions on Intelligent Transportation Systems, vol.26, no.3, pp.3713-3723, 2025.
3.
Andong Lu, Chenglong Li, Tianrui Zha, Xiao-Feng Wang, Jin Tang, Bin Luo, "Nighttime Person Re-Identification via Collaborative Enhancement Network With Multi-Domain Learning", IEEE Transactions on Information Forensics and Security, vol.20, pp.1305-1319, 2025.
4.
Guangzhao Dai, Xiangbo Shu, Wenhao Wu, Rui Yan, Jiachao Zhang, "GPT4Ego: Unleashing the Potential of Pre-Trained Models for Zero-Shot Egocentric Action Recognition", IEEE Transactions on Multimedia, vol.27, pp.401-413, 2025.
Contact IEEE to Subscribe

References

References is not available for this document.