1. Introduction
Object re-identification (ReID) aims to retrieve specific ob-jects (e.g., person, vehicle) across non-overlapping cam-eras. Over the past few decades, object ReID has advanced significantly. However, traditional object ReID with single-modal input encounters substantial challenges [17], particularly in complex visual scenarios, such as extreme illu-mination, thick fog and low image resolution. It can re-sult in noticeable distortions in critical object regions, leading to disruptions during the retrieval process [53]. There-fore, there has been a notable shift toward multi-modal approaches in recent years, capitalizing on diverse data sources to enhance the feature robustness for practical ap-plications [43], [44], [53]. However, as illustrated in Fig. 1, pre-vious multi-modal ReID methods typically extract global features from all regions of images in different modali-ties and subsequently aggregate them. Nevertheless, these methods present two key limitations: (1) Within individual modalities, backgrounds introduce additional noise [37], es-pecially in challenging visual scenarios. (2) Across differ-ent modalities, backgrounds introduce overhead in reducing modality gaps, which may amplify the difficulty in ag-gregating features [15]. Hence, our method prioritizes the selection of object-centric information, aiming to preserve the diverse features of different modalities while minimizing background interference. -
Comparison of different methods and token selections. (a) Framework of previous methods; (b) Framework of our pro-posed EDITOR; (c) RGB images; (d) Spatial-based token selection; (e) Multi-modal frequency transform; (f) Frequency-based token selection; (g) Selected tokens in the NIR modality; (h) Se-lected tokens in the TIR modality.