I. Introduction
As multi-view data, such as images, textual descriptions, and videos, continues to rapidly grow, there is an increasing demand for developing multi-view learning approaches to cater to a wide range of applications, including multimedia retrieval [1], [2], [3], [4], image annotation [5], heterogeneous face recognition [6], and cross-view retrieval [7], [8]. It is a fundamental challenge in multi-view learning to measure the similarity between samples from different views, commonly known as the ‘heterogeneous gap’ [1], [9], [10].