1. Introduction
In recent years, large-scale pre-trained models [5], [13], [27], [30], [37], [5] [5],6[5], [91] have swept a variety of computer vision tasks with their strong performance. To adequately train large models with billions of parameters, researchers design various annotation-free self-training tasks and obtain sufficiently large amounts of data from various modalities and sources. In general, existing large-scale pre-training strategies are mainly divided into three types: supervised learning [20], [65] on pseudo-labeled data (e.g., 1FT-300M [85]), weakly supervised learning [37], [55] on web crawling images text pairs (e.g., LAION400M [56]), and self-supervised learning [5], [13], [27], [30], [91] on unlabeled images. Supported by massive data, all these strategies have their own advantages and have been proven to be effective for large models of different tasks. In pursuit of stronger representations of large models, some recent approaches [47], [78], [81] combine the advantages of these strategies by directly using different proxy tasks at different stages, significantly pushing the performance boundaries of various vision tasks.