I. Introduction
A subdiscipline of affective computing, speech emotion recognition (SER) is concerned with the research and development of systems that can identify and interpret human emotions expressed through speech and other modalities [1]. The objective of SER is to autonomously identify and categorize emotions expressed via verbal communication, thereby facilitating a wide array of uses, including clinical diagnostics, psychological research, and human-computer interaction [2]. In addition to the transmission of information, human communication encompasses the manifestation of emotions, which are vital for deciphering the speaker's intentions, disposition, and emotional condition. The range of emotions expressed verbally is extensive, encompassing fear, surprise, happiness, sorrow, anger, and more intricate emotional states [3]. Accurately discerning these emotions from speech signals is a formidable task owing to the considerable diversity in linguistic content, speaker attributes, cultural impacts, and environmental variables.