1. Introduction
Text-based person re-identification (TBPReID) [11] aims to retrieve a target person from an image pool given a textual query. Since text queries are more user-friendly than image queries, TBPReID has been more and more expected to benefit various applications of surveillance and public safety. Existing literatures focus on how to align images and texts globally [23], [22] and/or locally [10], [4]. Particularly, recent works have demonstrated the importance of image-text local-matching [15], [20], and state-of-the-art (SOTA) methods [8], [12], [1] employ Masked Language Modeling (MLM) to align parts between image and text.