I. Introduction
A landscape can be painted or photographed. Different recording methods and different sensors bring rich multi-view data. In recent times, multi-view image retrieval has gained significant importance owing to its immense commercial value. Matching multi-view data, however, poses a formidable challenge due to the inherent heterogeneous gap between multi-view data. A typical example of the multi-view data matching problem is Face Sketch Recognition (FSR), which plays a crucial role in law enforcement [1]. When a crime occurs, law enforcement officials often encounter situations where poor resolution from crime scene surveillance videos or even surveillance equipment is missing or vandalized by the suspect. These factors make it difficult to obtain a clear facial image of a suspect. In this case, the common approach to solving the crime involves creating a sketch of the suspect based on the witnesses' descriptions. Then, the sketch is compared with the police mugshot database, one entry at a time. A successfully matched mugshot is pivotal in criminal investigations. However, the heterogeneous gap between images and sketches makes solving FSR smoothly a challenge, and conventional face recognition methods lack the ability to bridge this gap [2]. Over the past few decades, many scholars have devoted their efforts to FSR. Many methods have been proposed, mostly falling into two categories: photo-sketch generation (PSG) [3], [4], [5], [6], [7] and multi-view representation learning (MRL) [8], [9], [10], [11].