Multimodal Speech Emotion Recognition Using Audio and Text | IEEE Conference Publication | IEEE Xplore

Multimodal Speech Emotion Recognition Using Audio and Text


Abstract:

Speech emotion recognition is a challenging task, and extensive reliance has been placed on models that use audio features in building well-performing classifiers. In thi...Show More

Abstract:

Speech emotion recognition is a challenging task, and extensive reliance has been placed on models that use audio features in building well-performing classifiers. In this paper, we propose a novel deep dual recurrent encoder model that utilizes text data and audio signals simultaneously to obtain a better understanding of speech data. As emotional dialogue is composed of sound and spoken content, our model encodes the information from audio and text sequences using dual recurrent neural networks (RNNs) and then combines the information from these sources to predict the emotion class. This architecture analyzes speech data from the signal level to the language level, and it thus utilizes the information within the data more comprehensively than models that focus on audio features. Extensive experiments are conducted to investigate the efficacy and properties of the proposed model. Our proposed model outperforms previous state-of-the-art methods in assigning data to one of four emotion categories (i.e., angry, happy, sad and neutral) when the model is applied to the IEMOCAP dataset, as reflected by accuracies ranging from 68.8% to 71.8%.
Date of Conference: 18-21 December 2018
Date Added to IEEE Xplore: 14 February 2019
ISBN Information:
Conference Location: Athens, Greece

1. Introduction

Recently, deep learning algorithms have successfully addressed problems in various fields, such as image classification, machine translation, speech recognition, text-to-speech generation and other machine learning related areas [1 , 2 , 3] . Similarly, substantial improvements in performance have been obtained when deep learning algorithms have been applied to statistical speech processing [4] . These fundamental improvements have led researchers to investigate additional topics related to human nature, which have long been objects of study. One such topic involves understanding human emotions and reflecting it through machine intelligence, such as emotional dialogue models [5 , 6] .

References

References is not available for this document.