1. Introduction
Unsupervised representation learning is highly successful in natural language processing, e.g., as shown by GPT [50], [51] and BERT [12]. But supervised pre-training is still dominant in computer vision, where unsupervised methods generally lag behind. The reason may stem from differences in their respective signal spaces. Language tasks have discrete signal spaces (words, sub-word units, etc.) for building tokenized dictionaries, on which unsupervised learning can be based. Computer vision, in contrast, further concerns dictionary building [54], [9], [5], as the raw signal is in a continuous, high-dimensional space and is not structured for human communication (e.g., unlike words).