1. Introduction
With rapid growth of multimedia data, cross-modal retrieval becomes a compelling topic in the multimodal learning community due to its flexibility in retrieving semantically relevant samples across distinct modalities, e.g., image query text [6], [16]. However, most existing methods require clean-annotated training data, which are expensive and time-consuming. Although some unsupervised multi-modal learning methods can mitigate such labeling pressure, their performance is usually much worse than the supervised counterparts’ [60]. To balance performance and labeling cost, semi-supervised multimodal learning methods are proposed to simultaneously utilize labeled and un-labeled data to learn common discriminative representations [61], [17]. However, semi-supervised approaches still require a certain number of clean-annotated data to reach reasonable performance.