1. Introduction
Self-supervised learning (SSL) approaches learn generic feature representations from data in the absence of any external supervision. These approaches often solve an in-stance discrimination pretext task in which multiple trans-formations of the same image are required to generate similar learned features. Recent SSL methods have shown remarkable promise in global tasks such as classifying images by training simple classifiers on the features learned via instance discrimination [1], [2], [4], [17], [18]. However, global feature-learning SSL approaches do not explicitly retain spatial information thus rendering them ill-suited for semi-global tasks such as object detection, and instance and semantic segmentation [37], [43].