1. Introduction
Learning generalized representation with unlabeled data is a challenging task in various fields, but Self-Supervised Learning (SSL) has recently demonstrated remarkable success in learning semantic invariant representations without labels [40], [41], [53]. There are two main types of self-supervised learning (SSL) based on the pretext task used: generative and discriminative SSL, with generative SSL reconstructing altered or distorted data to its original input [9], [28], [31], [59], [65], [71] and early discriminative SSL predicting easily designed labels and task-specific representations that are not very generalizable [25], [57], [75]. More recent discriminative SSL trains the model to identify similarities and differences between pairs of augmented examples [7], [10], [11], [26], [29], [74]. The success of SSL in deep image models has resulted in progress in other data modalities [53], [52], [54], [61], [62] and attention-based models like transformers [12], [8], [49], [72]. Recent discriminative SSL aims to learn content and semantic invariant representations that are robust to data augmentations, but the learned representations can be unstable when one subtle factor of the data is changed to a value that is not accessible through all augmentations. To avoid the high cost of incorporating all possible subtle changes during training, insights are needed to uncover the root cause of instability and find a solution to prevent performance deterioration during inference. Figure 1 summarizes this deterioration effect.