1. Introduction
Representation learning aims to extract latent or semantic information from raw data. Typically, a model is first trained on a large-scale annotated dataset [34] and then tuned on a small-scale dataset for a downstream task [25]. As the model gets bigger and deeper [26], [29], more annotated data are needed; supervised pre-training is no longer viable.