1. INTRODUCTION
A significant breakthrough in the field of speech processing is the use of self-supervised learning (SSL) as an appealing strategy for harnessing insights from extensive unlabeled datasets. SSL-based solutions have demonstrated their effectiveness in enhancing downstream tasks such as automatic speech recognition (ASR) [1]–[7]. These models have been successfully used in other speech tasks, including speech emotion recognition (SER) [8]–[10]. Recognizing emotions from speech plays a major role in natural human-computer interaction [11], [12], so it is important to improve SER solutions before they are deployed in practical applications. Even though recent studies in SER showed significant gains by using SSL-based models for an SER task, very few studies have explored their performance in domain conditions that differ from the ones used for training the models. One of the major barriers to obtaining a deployable SER system is its generalization across different domain conditions. Hence, it is necessary to assess the SSL-based model’s performance across different domain conditions.