1 Introduction
Many definitions of intelligence have been formulated by psychologists and learning researches along the years. Despite the differences, they all indicate the ability to adapt and achieve goals under a wide range of conditions as a key component [1]. Artificial intelligence inherits these definitions, with the most recent research demonstrating the importance of knowledge transfer and domain generalization [18]. Indeed, in many practical applications the underlying distributions of training (i.e., source) and test (i.e., target) data are inevitably different, asking for robust and adaptable solutions. When dealing with visual domains, most of the current strategies are based on supervised learning. These processes search for semantic spaces able to capture basic data knowledge regardless of the specific appearance of input images: some decouple image style from the shared object content [7], others generate new samples [75], or impose adversarial conditions to reduce feature discrepancy [46], [48]. With the analogous aim of getting general purpose feature embeddings, an alternative research direction is pursued by self-supervised learning that captures visual invariances and regularities solving tasks that do not need data annotation, like image orientation recognition [30] or image coloring [84]. Unlabeled data are largely available and by their very nature are less prone to bias (no labeling bias issue [72]), thus they seem the perfect candidate to provide visual information independent from specific domain styles. However their potential has not been fully exploited: the existing self-supervised approaches often come with tailored architectures that need dedicated fine-tuning strategies to re-engineer the acquired knowledge [60]. Moreover, they are mainly applied on real-world photos without considering cross-domains scenarios with images of paintings or sketches.