Loading [MathJax]/extensions/MathZoom.js
Rethinking Multi-View Representation Learning via Distilled Disentangling | IEEE Conference Publication | IEEE Xplore

Rethinking Multi-View Representation Learning via Distilled Disentangling


Abstract:

Multi-view representation learning aims to derive robust representations that are both view-consistent and view-specific from diverse data sources. This paper presents an...Show More

Abstract:

Multi-view representation learning aims to derive robust representations that are both view-consistent and view-specific from diverse data sources. This paper presents an in-depth analysis of existing approaches in this domain, highlighting a commonly overlooked aspect: the redundancy between view-consistent and view-specific representations. To this end, we propose an innovative framework for multi-view representation learning, which incorporates a technique we term ‘distilled disentangling’. Our method introduces the concept of masked cross-view prediction, enabling the extraction of compact, high-quality view-consistent representations from various sources without incurring extra computational overhead. Additionally, we develop a distilled disentangling module that efficiently filters out consistency-related information from multi-view representations, resulting in purer view-specific representations. This approach significantly reduces redundancy between view-consistent and view-specific representations, enhancing the overall efficiency of the learning process. Our empirical evaluations reveal that higher mask ratios substantially improve the quality of view-consistent representations. Moreover, we find that reducing the dimensionality of view-consistent representations relative to that of view-specific representations further refines the quality of the combined representations. Our code is accessible at: https://github.com/Guanzhou-Ke/MRDD.
Date of Conference: 16-22 June 2024
Date Added to IEEE Xplore: 16 September 2024
ISBN Information:

ISSN Information:

Conference Location: Seattle, WA, USA

Funding Agency:

Beijing Jiaotong University
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences
Nanjing University of Science and Technology
Singapore Management University

1. Introduction

Multi-view representation learning (MvRL) [46] forms the cornerstone of various multi-view applications, such as video understanding [7], [17], 3D rendering [62], and cross-modal retrieval [35]. In the MvRL context, “views” commonly refer to distinct angles from which objects are captured by cameras or data descriptors, like the histogram of oriented gradients (HOG) [9] and the scale-invariant feature transform (SIFT) [32]. The success of multi-view applications relies on effectively leveraging shared information (consistency) among views and distinctive information (specificity) within each view. However, learning high-quality view-consistent and view-specific representations from multiple sources poses an open challenge.

Existing multi-view representation learning methods show high inter-view correlations. We estimate the mutual information of multi-view consistency and specificity of three baseline MvRL models DVIB [3], CONAN [18], Multi-VAE [56], and our method using MINE [4] on the same settings across five datasets.

Beijing Jiaotong University
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences
Nanjing University of Science and Technology
Singapore Management University
Contact IEEE to Subscribe

References

References is not available for this document.