I. Introduction
In human-machine interaction (HMI), we are still far from being able to fully communicate with machines because it is difficult for machines to interpret some paralinguistic information appearing in the spoken language such as emotions. Speech emotion recognition (SER), which aims to classify speaker's emotional states through speech signals, is one of the essential tasks for making HMI more natural and realistic. Although SER has been widely studied and attracting researchers' attention, the performance of SER systems developed so far remains relatively low, especially for spontaneous conversational speech. Consequently, improvement of SER performance is a crucial problem to be solved in HMI research area.