Loading [MathJax]/extensions/MathMenu.js
Improve Accuracy of Speech Emotion Recognition with Attention Head Fusion | IEEE Conference Publication | IEEE Xplore

Improve Accuracy of Speech Emotion Recognition with Attention Head Fusion


Abstract:

Speech Emotion Recognition (SER) refers to the use of machines to recognize the emotions of a speaker from his (or her) speech. SER has broad application prospects in the...Show More

Abstract:

Speech Emotion Recognition (SER) refers to the use of machines to recognize the emotions of a speaker from his (or her) speech. SER has broad application prospects in the fields of criminal investigation and medical care. However, the complexity of emotion makes it hard to be recognized and the current SER model still does not accurately recognize human emotions. In this paper, we propose a multi-head self-attention based attention method to improve the recognition accuracy of SER. We call this method head fusion. With this method, an attention layer can generate some attention map with multiple attention points instead of common attention maps with a single attention point. We implemented an attention-based convolutional neural networks (ACNN) model with this method and conducted experiments and evaluations on the Interactive Emotional Dyadic Motion Capture(IEMOCAP) corpus, obtained on improvised data 76.18% of weighted accuracy (WA) and 76.36% of unweighted accuracy (UA), which is increased by about 6% compared to the previous state-of-the-art SER model.
Date of Conference: 06-08 January 2020
Date Added to IEEE Xplore: 12 March 2020
ISBN Information:
Conference Location: Las Vegas, NV, USA
References is not available for this document.

I. Introduction

Today, machine speech recognition services such as Automatic Speech Recognition (ASR) have been widely used in society. The machine can easily recognize what humans are talking about. As shown in Fig. 1, similar to ASR, Speech Emotion Recognition (SER) uses machines to recognize humans' emotions when they are talking.

Select All
1.
R. Altrov and H. Pajupuu, "The influence of language and culture on the understanding of vocal emotions", Eesti ja soome-ugri keeleteaduse ajakiri. Journal of Estonian and Finno-Ugric Linguistics, vol. 6, no. 3, pp. 11-48, 2015.
2.
C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, et al., "Iemocap: Interactive emotional dyadic motion capture database", Language resources and evaluation, vol. 42, no. 4, pp. 335, 2008.
3.
L. Tarantino, P. N. Garner and A. Lazaridis, "Self-attention for speech emotion recognition", Proc. Interspeech 2019, pp. 2578-2582, 2019.
4.
B. Schuller, G. Rigoll and M. Lang, "Hidden markov model-based speech emotion recognition", 2003 IEEE International Conference. on Acoustics Speech and Signal Processing 2003. Proceedings. (ICASSP03), vol. 2, pp. II-1, 2003.
5.
E. Mower, M. J. Mataric and S. Narayanan, "A framework for automatic human emotion classification using emotion profiles", IEEE Transactions on Audio Speech and Language Processing, vol. 19, no. 5, pp. 1057-1070, 2010.
6.
K. Han, D. Yu and I. Tashev, "Speech emotion recognition using deep neural network and extreme learning machine", Fifteenth annual conference of the international speech communication association, 2014.
7.
V. Chernykh and P. Prikhodko, Emotion recognition from speech with recurrent neural networks, 2017.
8.
A. M. Badshah, J. Ahmad, N. Rahim and S. W. Baik, "Speech emotion recognition from spectrograms with deep convolutional neural network", 2017 international conference on platform technology and service (PlatCon), pp. 1-5, 2017.
9.
X. Wu, S. Liu, Y. Cao, X. Li, J. Yu, D. Dai, X. Ma, S. Hu, Z. Wu, X. Liu et al., "Speech emotion recognition using capsule networks", ICASSP 2019–2019 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 6695-6699, 2019.
10.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, et al., "Attention is all you need", Advances in neural information processing systems, pp. 5998-6008, 2017.
11.
N.-Q. Pham, T.-S. Nguyen, J. Niehues, M. Muller and A. Waibel, Very deep self-attention networks for end-to-end speech recognition, 2019.
12.
M. India, P. Safari and J. Hernando, Self multihead attention for speaker recognition, 2019.
13.
P. Li, Y. Song, I. V. McLoughlin, W. Guo and L. Dai, "An attention pooling based representation learning method for speech emotion recognition", Interspeech, pp. 3087-3091, 2018.
14.
Z. Zhao, Z. Bao, Z. Zhang, N. Cummins, H. Wang and B. Schuller, "Attention-enhanced connectionist temporal classification for discrete speech emotion recognition", Proc. Interspeech 2019, pp. 206-210, 2019.
15.
M. Chen, X. He, J. Yang and H. Zhang, "3-d convolutional recurrent neural networks with attention model for speech emotion recognition", IEEE Signal Processing Letters, vol. 25, no. 10, pp. 1440-1444, 2018.
16.
S. Yoon, S. Byun, S. Dey and K. Jung, "Speech emotion recognition using multi-hop attention mechanism", ICASSP 2019–2019 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 2822-2826, 2019.
17.
K. He, X. Zhang, S. Ren and J. Sun, "Deep residual learning for image recognition", Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778, 2016.
18.
M. Neumann and N. T. Vu, "Improving speech emotion recognition with unsupervised representation learning on unlabeled speech", ICASSP 2019–2019 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 7390-7394, 2019.
19.
G. Ramet, P. N. Garner, M. Baeriswyl and A. Lazaridis, "Context-aware attention mechanism for speech emotion recognition", 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 126-131, 2018.
20.
M. Neumann and N. T. Vu, Attentive convolutional neural network based speech emotion recognition: A study on the impact of input features signal length and acted speech, 2017.
21.
J. OConnor and G. Arnold, Intonation of Colloquial English, Longman, London, UK, 1973.
22.
M El Ayadi, M S Kamel and F Karray, "Survey on speech emotion recognition: Features classification schemes and databases[J]", Pattern Recognition, vol. 44, no. 3, pp. 572-587, 2011.
23.
R A Khalil, E Jones, M I Babar et al., "Speech Emotion Recognition Using Deep Learning Techniques: A Review[J]", IEEE Access, vol. 7, pp. 117327-117345, 2019.
24.
G N Peerzade, R R Deshmukh and S D Waghmare, "A review: Speech emotion recognition[J]", Int. J. Comput. Sci. Eng, vol. 6, pp. 400-402, 2018.
Contact IEEE to Subscribe

References

References is not available for this document.