I. Introduction
Image-text retrieval is one of the fundamental tasks in the field of vision and language [6]. Its goal is to effectively retrieve the most similar samples to its content from the database of image (text) modality given a query of text (image) modality. The biggest challenge is to narrow the semantic gap between cross-modal data for accurate similarity of image-text pairs.