1. Introduction
Multi-view representation learning (MvRL) [46] forms the cornerstone of various multi-view applications, such as video understanding [7], [17], 3D rendering [62], and cross-modal retrieval [35]. In the MvRL context, “views” commonly refer to distinct angles from which objects are captured by cameras or data descriptors, like the histogram of oriented gradients (HOG) [9] and the scale-invariant feature transform (SIFT) [32]. The success of multi-view applications relies on effectively leveraging shared information (consistency) among views and distinctive information (specificity) within each view. However, learning high-quality view-consistent and view-specific representations from multiple sources poses an open challenge.
Existing multi-view representation learning methods show high inter-view correlations. We estimate the mutual information of multi-view consistency and specificity of three baseline MvRL models DVIB [3], CONAN [18], Multi-VAE [56], and our method using MINE [4] on the same settings across five datasets.