Learning Cross-Modal Retrieval with Noisy Labels | IEEE Conference Publication | IEEE Xplore

Learning Cross-Modal Retrieval with Noisy Labels


Abstract:

Recently, cross-modal retrieval is emerging with the help of deep multimodal learning. However, even for unimodal data, collecting large-scale well-annotated data is expe...Show More

Abstract:

Recently, cross-modal retrieval is emerging with the help of deep multimodal learning. However, even for unimodal data, collecting large-scale well-annotated data is expensive and time-consuming, and not to mention the additional challenges from multiple modalities. Although crowd-sourcing annotation, e.g., Amazon’s Mechanical Turk, can be utilized to mitigate the labeling cost, but leading to the unavoidable noise in labels for the non-expert annotating. To tackle the challenge, this paper presents a general Multi-modal Robust Learning framework (MRL) for learning with multimodal noisy labels to mitigate noisy samples and correlate distinct modalities simultaneously. To be specific, we propose a Robust Clustering loss (RC) to make the deep networks focus on clean samples instead of noisy ones. Besides, a simple yet effective multimodal loss function, called Multimodal Contrastive loss (MC), is proposed to maxi-mize the mutual information between different modalities, thus alleviating the interference of noisy samples and cross-modal discrepancy. Extensive experiments are conducted on four widely-used multimodal datasets to demonstrate the effectiveness of the proposed approach by comparing to 14 state-of-the-art methods.
Date of Conference: 20-25 June 2021
Date Added to IEEE Xplore: 02 November 2021
ISBN Information:

ISSN Information:

Conference Location: Nashville, TN, USA
No metrics found for this document.

1. Introduction

With rapid growth of multimedia data, cross-modal retrieval becomes a compelling topic in the multimodal learning community due to its flexibility in retrieving semantically relevant samples across distinct modalities, e.g., image query text [6], [16]. However, most existing methods require clean-annotated training data, which are expensive and time-consuming. Although some unsupervised multi-modal learning methods can mitigate such labeling pressure, their performance is usually much worse than the supervised counterparts’ [60]. To balance performance and labeling cost, semi-supervised multimodal learning methods are proposed to simultaneously utilize labeled and un-labeled data to learn common discriminative representations [61], [17]. However, semi-supervised approaches still require a certain number of clean-annotated data to reach reasonable performance.

Usage
Select a Year
2025

View as

Total usage sinceNov 2021:345
05101520JanFebMarAprMayJunJulAugSepOctNovDec6140000000000
Year Total:20
Data is updated monthly. Usage includes PDF downloads and HTML views.
Contact IEEE to Subscribe

References

References is not available for this document.