Loading [MathJax]/extensions/MathZoom.js
BiLMa: Bidirectional Local-Matching for Text-based Person Re-identification | IEEE Conference Publication | IEEE Xplore

BiLMa: Bidirectional Local-Matching for Text-based Person Re-identification


Abstract:

Text-based person re-identification (TBPReID) aims to retrieve person images represented by a given textual query. In this task, how to effectively align images and texts...Show More

Abstract:

Text-based person re-identification (TBPReID) aims to retrieve person images represented by a given textual query. In this task, how to effectively align images and texts globally and locally is a crucial challenge. Recent works have obtained high performances by solving Masked Language Modeling (MLM) to align image/text parts. However, they only performed uni-directional (i.e., from image to text) local-matching, leaving room for improvement by introducing opposite-directional (i.e., from text to image) localmatching. In this work, we introduce Bidirectional LocalMatching (BiLMa) framework that jointly optimize MLM and Masked Image Modeling (MIM) in TBPReID model training. With this framework, our model is trained so as the labels of randomly masked both image and text tokens are predicted by unmasked tokens. In addition, to narrow the semantic gap between image and text in MIM, we propose Semantic MIM (SemMIM), in which the labels of masked image tokens are automatically given by a state-of-the-art human parser. Experimental results demonstrate that our BiLMa framework with SemMIM achieves state-of-the-art Rank@1 and mAP scores on three benchmarks.
Date of Conference: 02-06 October 2023
Date Added to IEEE Xplore: 25 December 2023
ISBN Information:

ISSN Information:

Conference Location: Paris, France

1. Introduction

Text-based person re-identification (TBPReID) [11] aims to retrieve a target person from an image pool given a textual query. Since text queries are more user-friendly than image queries, TBPReID has been more and more expected to benefit various applications of surveillance and public safety. Existing literatures focus on how to align images and texts globally [23], [22] and/or locally [10], [4]. Particularly, recent works have demonstrated the importance of image-text local-matching [15], [20], and state-of-the-art (SOTA) methods [8], [12], [1] employ Masked Language Modeling (MLM) to align parts between image and text.

Contact IEEE to Subscribe

References

References is not available for this document.