1. Introduction
Sketch and text represent the two most common [11], [59] input modalities in the realm of image retrieval. The choice between these modalities depends on the nature of the retrieval problem, especially when fine-grained distinctions are required [18], [59], [60], [69]. In inter-category retrieval, text dominates as the primary modality, exemplified by widely-used platforms like Google Images. However, when the challenge transitions to fine-grained image retrieval, sketches take the spotlight [11], [59], [60]. Sketches promise to capture fine-grained visual cues that can be cumbersome or even impossible for text to express [11]. Research in this domain predominantly revolves around harnessing the unique qualities of sketches, exploring aspects such as style [54], abstraction [32], and more [4], [18], [59].