1. Introduction
Learning representations without manual human annotations that can be successfully transferred to various downstream tasks has been a long standing goal in machine learning [17], [48]. Self-supervised learning (SSL) aims at learning such representations discriminatively through pretext tasks such as identifying the relative position of image patches [14] and solving jigsaw puzzles [36]. The recent success of SSL methods [23], [6], [8] builds on contrastive learning where the representations are in a latent space invariant to various image transformations such as cropping, blurring and colour jittering. Contrastive learned representations have been shown to obtain on par performance with their supervised counterparts when transferred to various vision tasks including image classification, object detection, semantic segmentation [7], [22], and extended to medical imaging [1] as well as multi-view [45] and multi-modal learning [37].