1. Introduction
Visual representation learning is crucial for many computer vision tasks including image classification [9], [50], [27], [30], tagging [16], [23], object detection [17], [47], [40], semantic and instance segmentation [41], [26]. Supervised pre-training over large-scale datasets [9] yields useful visual features which lead to state-of-the-art performance on those tasks. Yet, fine-grained class labeling efforts [9] are prohibitively heavy. Self-supervised learning methods [4], [12], [59], [25], [5], [6] do not require any annotations, but still require either extremely large training sets or longer training epochs.