Multimodal Contrastive Training for Visual Representation Learning | IEEE Conference Publication | IEEE Xplore

Multimodal Contrastive Training for Visual Representation Learning


Abstract:

We develop an approach to learning visual representations that embraces multimodal data, driven by a combination of intra- and inter-modal similarity preservation objecti...Show More

Abstract:

We develop an approach to learning visual representations that embraces multimodal data, driven by a combination of intra- and inter-modal similarity preservation objectives. Unlike existing visual pre-training methods, which solve a proxy prediction task in a single domain, our method exploits intrinsic data properties within each modality and semantic information from cross-modal correlation simultaneously, hence improving the quality of learned visual representations. By including multimodal training in a unified framework with different types of contrastive losses, our method can learn more powerful and generic visual features. We first train our model on COCO and evaluate the learned visual representations on various downstream tasks including image classification, object detection, and instance segmentation. For example, the visual representations pre-trained on COCO by our method achieve state-of-the-art top-1 validation accuracy of 55.3% on ImageNet classification, under the common transfer protocol. We also evaluate our method on the large-scale Stock images dataset and show its effectiveness on multi-label image tagging, and cross-modal retrieval tasks.
Date of Conference: 20-25 June 2021
Date Added to IEEE Xplore: 02 November 2021
ISBN Information:

ISSN Information:

Conference Location: Nashville, TN, USA
Citations are not available for this document.

Cites in Papers - |

Cites in Papers - IEEE (36)

Select All
1.
Haiyan Lan, Shujun Li, Mingjie Xie, Xuanjia Zhao, Hongning Liu, Pengming Feng, Dongli Xu, Guangjun He, Jian Guan, "Band Prompting Aided SAR and Multi-Spectral Data Fusion Framework for Local Climate Zone Classification", ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.1-5, 2025.
2.
Zahid Ur Rahman, Ju-Hwan Lee, Dang Thanh Vu, Iqbal Murtza, Jin-Young Kim, "DuCo-Net: Dual-Contrastive Learning Network for Medical Report Retrieval Leveraging Enhanced Encoders and Augmentations", IEEE Access, vol.13, pp.27462-27476, 2025.
3.
Yutong Hu, Qingwu Hu, Jiayuan Li, "CMINet: A Unified Cross-Modal Integration Framework for Crop Classification From Satellite Image Time Series", IEEE Transactions on Geoscience and Remote Sensing, vol.63, pp.1-13, 2025.
4.
Rui Cai, Shichao Pei, Xiangliang Zhang, "Zero-Shot Relational Learning for Multimodal Knowledge Graphs", 2024 IEEE International Conference on Big Data (BigData), pp.499-508, 2024.
5.
Justin Chung, Chenrui Zhang, Tingting Chen, "Mobility Scooter Riding Behavior Stability Analysis Based on Multimodal Contrastive Learning", 2024 IEEE International Conference on Big Data (BigData), pp.6439-6445, 2024.
6.
Ali Rasekh, Reza Heidari, Amir Hosein Haji Mohammad Rezaie, Parsa Sharifi Sedeh, Zahra Ahmadi, Prasenjit Mitra, Wolfgang Nejdl, "Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction", IEEE Access, vol.12, pp.174107-174121, 2024.
7.
Ruixuan Liu, Hong Kyu Lee, Sivasubramanium V Bhavani, Xiaoqian Jiang, Lucila Ohno-Machado, Li Xiong, "Patient-Centered and Practical Privacy to Support AI for Healthcare", 2024 IEEE 6th International Conference on Trust, Privacy and Security in Intelligent Systems, and Applications (TPS-ISA), pp.265-272, 2024.
8.
Vaishnavi Khindkar, Vineeth Balasubramanian, Chetan Arora, Anbumani Subramanian, C.V. Jawahar, "Can Reasons Help Improve Pedestrian Intent Estimation? A Cross-Modal Approach", 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.11515-11522, 2024.
9.
Yunkang Zhang, Ziyu Wu, Zhen Liang, Fangting Xie, Quan Wan, Mingjie Zhao, Xiaohui Cai, "Contrastive Learning-Based User Identification with Limited Data on Smart Textiles", 2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp.2820-2825, 2024.
10.
Qian Zhang, Lin Zhang, Ran Song, Runmin Cong, Yonghuai Liu, Wei Zhang, "Learning Common Semantics via Optimal Transport for Contrastive Multi-View Clustering", IEEE Transactions on Image Processing, vol.33, pp.4501-4515, 2024.
11.
Lixia Ji, Shijie Xiao, Bingzhi Xu, Han Zhang, "Transferrable DP-Adapter Tuning: A Privacy-Preserving Multimodal Parameter-Efficient Fine-Tuning Framework", 2024 IEEE 24th International Conference on Software Quality, Reliability and Security (QRS), pp.471-482, 2024.
12.
Ye Zhu, Yu Wu, Nicu Sebe, Yan Yan, "Vision + X: A Survey on Multimodal Learning in the Light of Data", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.46, no.12, pp.9102-9122, 2024.
13.
Yan Feng, Alexander Carballo, Yingjie Niu, Kazuya Takeda, "Contrasting Disentangled Partial Observations for Pedestrian Action Prediction", 2024 IEEE Intelligent Vehicles Symposium (IV), pp.2828-2833, 2024.
14.
Vedant Dave, Fotios Lygerakis, Elmar Rueckert, "Multimodal Visual-Tactile Representation Learning through Self-Supervised Contrastive Pre-Training", 2024 IEEE International Conference on Robotics and Automation (ICRA), pp.8013-8020, 2024.
15.
Bin Liang, Lin Gui, Yulan He, Erik Cambria, Ruifeng Xu, "Fusion and Discrimination: A Multimodal Graph Contrastive Learning Framework for Multimodal Sarcasm Detection", IEEE Transactions on Affective Computing, vol.15, no.4, pp.1874-1888, 2024.
16.
Haochen Han, Qinghua Zheng, Minnan Luo, Kaiyao Miao, Feng Tian, Yan Chen, "Noise-Tolerant Learning for Audio-Visual Action Recognition", IEEE Transactions on Multimedia, vol.26, pp.7761-7774, 2024.
17.
Jiahao Zheng, Yu Tang, Anthony Huang, Dapeng Wu, "Hierarchical Multivariate Representation Learning for Face Sketch Recognition", IEEE Transactions on Emerging Topics in Computational Intelligence, vol.8, no.2, pp.2037-2049, 2024.
18.
Meiyu Liang, Yawen Li, Yang Yu, Xiaowen Cao, Zhe Xue, Ang Li, Kangkang Lu, "Structures Aware Fine-Grained Contrastive Adversarial Hashing for Cross-Media Retrieval", IEEE Transactions on Knowledge and Data Engineering, vol.36, no.7, pp.3514-3528, 2024.
19.
Yifei Zhang, Chang Liu, Yu Zhou, Weiping Wang, Qixiang Ye, Xiangyang Ji, "Beyond Instance Discrimination: Relation-Aware Contrastive Self-Supervised Learning", IEEE Transactions on Multimedia, vol.26, pp.4628-4640, 2024.
20.
Cong Ma, Xu Han, Linghui Wu, Yaping Zhang, Yang Zhao, Yu Zhou, Chengqing Zong, "Modal Contrastive Learning Based End-to-End Text Image Machine Translation", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.32, pp.2153-2165, 2024.
21.
Jiaming Liu, Yue Wu, Maoguo Gong, Zhixiao Liu, Qiguang Miao, Wenping Ma, "Inter-Modal Masked Autoencoder for Self-Supervised Learning on Point Clouds", IEEE Transactions on Multimedia, vol.26, pp.3897-3908, 2024.
22.
Zhangxiang Shi, Tianzhu Zhang, Xi Wei, Feng Wu, Yongdong Zhang, "Decoupled Cross-Modal Phrase-Attention Network for Image-Sentence Matching", IEEE Transactions on Image Processing, vol.33, pp.1326-1337, 2024.
23.
Jun Long, Junkun Hong, Zidong Wang, Tingxuan Chen, Yunfei Chen, Liu Yang, "SPHASE: Multi-Modal and Multi-Branch Surgical Phase Segmentation Framework based on Temporal Convolutional Network", 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp.586-593, 2023.
24.
Ziyang Luo, Pu Zhao, Can Xu, Xiubo Geng, Tao Shen, Chongyang Tao, Jing Ma, Qingwei Lin, Daxin Jiang, "LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Sparse Retrieval", 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp.11172-11183, 2023.
25.
Xiangyang Zhu, Renrui Zhang, Bowei He, Aojun Zhou, Dong Wang, Bin Zhao, Peng Gao, "Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement", 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp.2605-2615, 2023.
26.
Burak Uzkent, Amanmeet Garg, Wentao Zhu, Keval Doshi, Jingru Yi, Xiaolong Wang, Mohamed Omar, "Dynamic Inference with Grounding Based Vision and Language Models", 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.2624-2633, 2023.
27.
Tiantian Gong, Junsheng Wang, Liyan Zhang, "Rethink Pair-Wise Self-Supervised Cross-Modal Retrieval From A Contrastive Learning Perspective", ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.1-5, 2023.
28.
Qing-Ling Guan, Yuze Zheng, Lei Meng, Li-Quan Dong, Qun Hao, "Improving the Generalization of Visual Classification Models Across IoT Cameras via Cross-Modal Inference and Fusion", IEEE Internet of Things Journal, vol.10, no.18, pp.15835-15846, 2023.
29.
Liu Yang, Zhenjie Wu, Junkun Hong, Jun Long, "MCL: A Contrastive Learning Method for Multimodal Data Fusion in Violence Detection", IEEE Signal Processing Letters, vol.30, pp.408-412, 2023.
30.
Jeonghyeok Do, Munchurl Kim, "Multi-modal Transformer for Indoor Human Action Recognition", 2022 22nd International Conference on Control, Automation and Systems (ICCAS), pp.1155-1160, 2022.

Cites in Papers - Other Publishers (44)

1.
Saman Motamed, Danda Pani Paudel, Luc Van Gool, "Lego: Learning to\\xa0Disentangle and\\xa0Invert Personalized Concepts Beyond Object Appearance in\\xa0Text-to-Image Diffusion Models", Computer Vision – ECCV 2024, vol.15073, pp.116, 2025.
2.
Bo Li, Zhiwei Xu, Jing Yun, Jiatai Wang, "Balancing Complementarity and\\xa0Consistency via\\xa0Delayed Activation in\\xa0Incomplete Multi-view Clustering", Pattern Recognition and Computer Vision, vol.15040, pp.531, 2025.
3.
Huiqun Wang, Yiping Bao, Panwang Pan, Zeming Li, Xiao Liu, Ruijie Yang, Di Huang, "Multi-modal Relation Distillation for\\xa0Unified 3D Representation Learning", Computer Vision – ECCV 2024, vol.15091, pp.364, 2025.
4.
Yihan Zhao, Wei Xi, Gairui Bai, Xinhui Liu, Jizhong Zhao, "Robust Contrastive Learning Against Audio-Visual Noisy Correspondence", Pattern Recognition and Computer Vision, vol.15035, pp.526, 2025.
5.
Seokha Moon, Hyun Woo, Hongbeen Park, Haeji Jung, Reza Mahjourian, Hyung-gun Chi, Hyerin Lim, Sangpil Kim, Jinkyu Kim, "VisionTrap: Vision-Augmented Trajectory Prediction Guided by\\xa0Textual Descriptions", Computer Vision – ECCV 2024, vol.15064, pp.361, 2025.
6.
Yeon-Seung Choo, Boeun Kim, Hyun-Sik Kim, Yong-Suk Park, "Supervised Contrastive Learning for 3D Cross-Modal Retrieval", Applied Sciences, vol.14, no.22, pp.10322, 2024.
7.
Shichao Wu, Yongru Wang, Yushan Jiang, Qianyi Zhang, Jingtai Liu, "CRATI: Contrastive representation-based multimodal sound event localization and detection", Knowledge-Based Systems, pp.112692, 2024.
8.
Huinian Li, Baoyu Chen, Jingjia Chen, Shuting Li, Feiyong He, Hu Yingbiao, "ITIMCA: Image-Text Information and Cross-Attention for Multi-Modal Cassava Leaf Disease Classification Based on a Novel Multi-Modal Dataset in Natural Environments", Crop Protection, pp.106981, 2024.
9.
Kaihuai Ran, Yongxin Yang, Ye Yan, Xiaoming Jiang, "Automated diagnosis of renal sinus invasion in renal cell carcinoma with contrastive learning and proxy task", International Conference on Image, Signal Processing, and Pattern Recognition (ISPP 2024), pp.259, 2024.
10.
Guojia An, Jing Sun, Yuhan Yang, Fuming Sun, "Enhancing Collaborative Information with Contrastive Learning for Session-based Recommendation", Information Processing & Management, vol.61, no.4, pp.103738, 2024.
11.
Biagio Grasso, Valerio La Gatta, Vincenzo Moscato, Giancarlo Sperlì, "KERMIT: Knowledge-EmpoweRed model in harmful meme deTection", Information Fusion, pp.102269, 2024.
12.
Yi Li, Qingmeng Zhu, Hao He, Ziyin Gu, Changwen Zheng, "MOC: Multi-modal Sentiment Analysis via Optimal Transport and Contrastive Interactions", Neural Information Processing, vol.14448, pp.439, 2024.
13.
Ke Wang, Yanmin Zhu, Tianzi Zang, Chunyang Wang, Kuan Liu, Peibo Ma, "Multi-aspect Graph Contrastive Learning for Review-enhanced Recommendation", ACM Transactions on Information Systems, vol.42, no.2, pp.1, 2024.
14.
Xiaohan Xing, Zhen Chen, Yuenan Hou, Yixuan Yuan, "Gradient modulated contrastive distillation of low-rank multi-modal knowledge for disease diagnosis", Medical Image Analysis, pp.102874, 2023.
15.
Qinglang Guo, Yong Liao, Zhe Li, Shenglin Liang, "Multi-Modal Representation via Contrastive Learning with Attention Bottleneck Fusion and Attentive Statistics Features", Entropy, vol.25, no.10, pp.1421, 2023.
16.
Yuxi Liu, Zhenhao Zhang, Shaowen Qin, Flora D. Salim, Antonio Jimeno Yepes, "Contrastive Learning-Based Imputation-Prediction Networks for In-hospital Mortality Risk Modeling Using EHRs", Machine Learning and Knowledge Discovery in Databases: Applied Data Science and Demo Track, vol.14174, pp.428, 2023.
17.
Yeonju Park, Sangmin Woo, Sumin Lee, Muhammad Adi Nugroho, Changick Kim, "Cross-modal alignment and translation for missing modality action recognition", Computer Vision and Image Understanding, pp.103805, 2023.
18.
Son D. Dao, He Zhao, Dinh Phung, Jianfei Cai, "Contrastively enforcing distinctiveness for multi-label image classification", Neurocomputing, pp.126605, 2023.
19.
Sijie Mai, Ying Zeng, Haifeng Hu, "Learning from the global view: Supervised contrastive learning of multimodal representation", Information Fusion, pp.101920, 2023.
20.
Yiyi Cao, Lei Chen, Yuan Yuan, Guangling Sun, "Cucumber disease recognition with small samples using image-text-label-based multi-modal language model", Computers and Electronics in Agriculture, vol.211, pp.107993, 2023.
21.
Lingling Xu, Haoran Xie, Zongxi Li, Fu Lee Wang, Weiming Wang, Qing Li, "Contrastive Learning Models for Sentence Representations", ACM Transactions on Intelligent Systems and Technology, vol.14, no.4, pp.1, 2023.
22.
Heesoo Won, Byungkook Oh, Hyeongjun Yang, Kyong-Ho Lee, "Cross-modal contrastive learning for aspect-based recommendation", Information Fusion, pp.101858, 2023.
23.
Yujue Cai, Xiubao Sui, Guohua Gu, "Multi-modal multi-task feature fusion for RGBT tracking", Information Fusion, pp.101816, 2023.
24.
Cheng Huang, Jinrong Cui, Yulu Fu, Dong Huang, Min Zhao, Lusi Li, "Incomplete multi-view clustering network via nonlinear manifold embedding and probability-induced loss", Neural Networks, 2023.
25.
Semih Gunel, Florian Aymanns, Sina Honari, Pavan Ramdya, Pascal Fua, "Overcoming the Domain Gap in Neural Action Representations", International Journal of Computer Vision, vol.131, no.3, pp.813, 2023.
26.
Sijie Mai, Ya Sun, Ying Zeng, Haifeng Hu, "Excavating multimodal correlation for representation learning", Information Fusion, vol.91, pp.542, 2023.
27.
Jiatai Wang, Zhiwei Xu, Xuewen Yang, Dongjin Guo, Limin Liu, "Self?supervised image clustering from multiple incomplete views via constrastive complementary generation", IET Computer Vision, vol.17, no.2, pp.189, 2023.
28.
Abdullah-Al-Zubaer Imran, Sen Wang, Debashish Pal, Sandeep Dutta, Evan Zucker, Adam Wang, "Multimodal Contrastive Learning for Prospective Personalized Estimation of CT Organ Dose", Medical Image Computing and Computer Assisted Intervention ? MICCAI 2022, vol.13431, pp.634, 2022.
29.
Georgios Paraskevopoulos, Petros Pistofidis, Georgios Banoutsos, Efthymios Georgiou, Vassilis Katsouros, "Multimodal Classification of Safety-Report Observations", Applied Sciences, vol.12, no.12, pp.5781, 2022.
30.
Jiangfeng Li, Zijian Zhang, Bowen Wang, Qinpei Zhao, Chenxi Zhang, "Inter- and Intra-Modal Contrastive Hybrid Learning Framework for Multimodal Abstractive Summarization", Entropy, vol.24, no.6, pp.764, 2022.
Contact IEEE to Subscribe

References

References is not available for this document.