Journals & Magazines >IEEE Access >Volume: 9

TLEFuzzyNet: Fuzzy Rank-Based Ensemble of Transfer Learning Models for Emotion Recognition From Human Speeches

A Graphical representation of our proposed TLEFuzzyNet model for emotion recognition from Human speeches.

Abstract:

Human speech is not only a verbose medium of communication but it also conveys emotions. The past decade has seen a lot of research going on with speech data which become...Show More

Metadata

Abstract:

Human speech is not only a verbose medium of communication but it also conveys emotions. The past decade has seen a lot of research going on with speech data which becomes especially important for human-computer interaction and also healthcare, security, and entertainment. This paper proposes the TLEFuzzyNet model, a three-stage pipeline for emotion recognition from speech. The first stage includes feature extraction by data augmentation of speech signals and extraction of Mel spectrograms, followed by the use of three pretrained transfer learning CNN models namely, ResNet18, Inception_v3, and GoogleNet whose prediction scores are fed to the third stage. In the final stage, we assign Fuzzy Ranks using a modified Gompertz function which gives the final prediction scores after considering the individual scores from the three CNN models. We have used the Surrey Audio-Visual Expressed Emotion (SAVEE), the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), and the Berlin Database of Emotional Speech (EmoDB) datasets to evaluate the TLEFuzzyNet model which has achieved state-of-the-art performance and is hence a dependable framework for Speech emotion recognition(SER). All the codes are available using GitHub link: https://github.com/KaramSahoo/SpeechEmotionRecognitionFuzzy

A Graphical representation of our proposed TLEFuzzyNet model for emotion recognition from Human speeches.

Published in: IEEE Access ( Volume: 9)

Page(s): 166518 - 166530

Date of Publication: 14 December 2021

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2021.3135658

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Speech emotion recognition (SER) has been gaining popularity within the research fraternity for the past few decades and has the potential in the domain of Human-Computer Interaction, along with Multimedia and Biomedical applications to name a few. Speech is one of the main forms of relaying information to our surroundings hence a detailed analysis of speech signals is necessary. Emotion is a piece of vital information that speech signals carry apart from the verbal corpus. Any human-computer interface must be able to capture the underlying emotion of human speech since the same sentence could carry different meanings depending on the emotion.

SER has a lot of potential in speech-enabled interfaces such as Artificial Intelligence enabled voice assistants, which can keep track of emotions and changes in the pattern to predict any psychological changes or signs of mental stress and depression. This can be extended in the medical field for the detection of Autism and Parkinson’s Disease as long as adjuvant treatment and diagnosis. Educational software and smart classrooms could make use of SER for student mental health detection. Automated vehicles could prevent accidents and ensure safer driving environments by analyzing the speech of the driver and judging whether the driver is sound to drive or not. The entire process of SER has two indispensable phases, namely: feature extraction and classification. For the former, we can divide the acoustic features into two broad categories. First is the temporal features such as the energy of the signal, zero-crossing rate, maximum amplitude, and minimum energy. Conversion of these temporal features to the frequency domain using Fourier Transforms gives us spectral features. Some of the spectral features include spectral centroid, Mel spectrograms, Mel Frequency Cepstral coefficients(MFCC), spectral flux, Shifted Delta Cepstral Coefficients (SDCC), spectral density, and chroma-stft. Since both Deep learning and image classification has attained great heights in the past few decades, our paper proposes to bring image classification using 2D convolutional neural networks (CNN) models to the field of SER. Hence, the choice of features is the Mel Spectrogram of the audio data.

In recent years, using ensemble models to fuse the prediction scores from different constituent models has been in practice. In our paper, we propose to build an ensemble model that assigns fuzzy ranks with the help of a modified Gompertz Function. The audio data used are from the Surrey Audio-Visual Expressed Emotion (SAVEE), the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), and the Berlin Database of Emotional Speech (EmoDB) datasets. Ensemble Learning assigns ranks to the prediction of the individual models hence providing superior results than individual models. Imbalance and Correlation problems are taken care of simultaneously by ensemble learning models. The Fuzzy rank approach is a fusion technique that predicts the final classification results by assigning adaptive weights to the multi-class confidence scores of the constituent models. The Gompertz function named after Benjamin Gompertz is a sigmoid function in the time domain which saturates to the slowest in the beginning and end of a given period. It was originally modeled for human mortality since mortality decreases exponentially with age, after which it saturates asymptotically. In the TLEFuzzyNet model, we choose three pre-trained transfer learning CNN models: ResNet18, GoogleNet, and Inception_v3. The fuzzy ranks were given the prediction scores of three models on the test set using the Gompertz Function, which provided better accuracy than the constituent models, hence making this a novel approach.

A. Motivation and Contributions

Since modern technology is all about automation with gestures and voice, SER holds the key to designing state-of-the-art Human-Computer interaction interfaces. From voice-enabled security devices and authentication systems to Automated Vehicle Environments, emotion can be analyzed to prevent identity mismatch or accidents. The medical sciences field too can benefit by classification of emotion from patients speech for treating Parkinson’s disease and Autism to name a few. Key points regarding our proposed model are as follows:

A huge amount of data is required when it comes to building an end-to-end model. Due to the lack of enough labelled audio data, we preferred to use three pre-trained transfer learning models as our constituent models for the ensemble pipeline which are: Inception_V3, GoogleNet, and ResNet18.
Training the audio samples on a single model for classification may lead to the problem of imbalance. The Ensemble approach gives an aggregate opinion to all the individual models thereby decreasing noise and giving better and unbiased prediction scores. This makes it a novel approach for SER.
A modified Gompertz function is employed to assign fuzzy ranks to the decision scores of the individual models. The prediction scores rarely go as low as zero and the Gompertz function saturates exponentially to an asymptote. This is different and more efficient than traditional ensemble pipelines since fuzzy ranks based fusion assigns adaptive priority weights to the prediction scores of individual models.
We have compared our accuracies and performance to other modern approaches for SER and proved that the TLEFuzzyNet model outperforms them hence establishing the novelty of our framework.
The TLEFuzzyNet model has been trained and tested on three open-source datasets: SAVEE, EmoDB, and RAVDESS. We have achieved state-of-the-art accuracy compared to contemporary machine learning and deep learning approaches.

SECTION II.

Literature Review

SER has seen a lot of research for the past few decades. A number of different machine-learning models have been proposed in the past which include hidden Markov models [38], Support Vector Machines [41], Decision Trees, and Random Forests [37].

Dellaert et al. [9] used smoothing spline approximation on the contour of the pitch for feature extraction. Pattern Recognition techniques were employed for classification purposes. Machine Learning models were applied by Noroozi et al. [36] such as Multi-class Support Vector Machine, Random Forest, and Single-layered Adaptive boosting on SAVEE and Polish database. Noroozi et al. [36] were able to obtain 75.71% and 87.91% testing accuracies on the respective databases using random forest models. By using a reduced feature size of 14 they were able to reduce computational time as well.

Nicholson et al. [35] used a special type of neural network called the one-class-in-one neural network which predicted the best emotion by making use of 12 LPC and Delta LPC parameters which attained an accuracy of 50%. Deshmukh et al. [11] worked with the Mel-frequency cepstral coefficients(MFCC) features and trained their dataset on a Support Vector Machine(SVM) achieving 80% accuracy. Slimi et al. [45] generated Mel spectrograms of the audio data and resized them to a size of 150 by 66. These resized spectrograms were flattened and fed to a single hidden layer neural network. The classifications were made by a softmax layer. Their proposed model achieved an accuracy of 81.82% after 1970 epochs of training on the EmoDB database. Badshah et al. [1] used deep convolutional neural networks to feed their model with Mel spectrograms of the speech data. They conducted two experiments. In the first method, they trained the CNN model from scratch, and in the second approach, they fine-tuned a pre-trained Alex-Net model which gave them a test accuracy of 84.3%. Etienne et al. [13] tried to capture the long term dependencies of speech by using a CNN+LSTM architecture. The high level features were captured by the CNN model from spectrograms whereas the recurrent LSTM layers extracted relation among the temporal features.

Zehra et al. [54] have used an ensemble approach by combining results from decision tree (J48), sequential minimal optimization (SMO), and random forest (RF). Their main objective was to design robotic systems that could analyze and detect emotions in not only in-corpus but also cross-corpus data. Khan and Roy [23] have developed a model that is both gender dependent and independent, by training a Naive Bayes Classifier on both pitch and MFCC features using the EmoDB database and has achieved an accuracy of 95.20%.

SECTION III.

Dataset

Since SER is a classification problem, the availability of sufficient data which is correctly labelled is of prime concern. Different individuals could perceive the same speech as different emotions hence it is very important to choose a dataset that is labelled by people from the same group of database creators. We have chosen to work with simulated databases where particular actors or speakers simulate the required emotion by reading out from a set of sentences. In our work, we use the SAVEE, EmoDB [3] and RAVDESS [30] dataset which contains accurately labelled and free of noise audio samples. All the three databases are accessible publicly and a tabular representation of the class-wise distribution of audio samples is provided in Table 1.

TABLE 1 Sample Distribution of Different Emotion Classes Present in the Datasets Used in the Present Work

A. SAVEE

The SAVEE dataset contains standard TIMIT sentences recorded by 4 male actors. All the audio samples have been recorded, processed, and labelled using superior quality equipment in a visual and media laboratory. The dataset has 480 audio samples in the .wav format and is divided into seven emotion classes: anger, fear, disgust, surprise, sad, happy, and neutral.

B. EmoDB

EmoDB [3] is a German Emotional Speech dataset of 535 utterances available in .wav format. It was created by the Institute of Communication Science, Technical University, Berlin, Germany. There are 5 male speakers and 5 female speakers. There are seven labelled emotion classes: anger, boredom, neutral, happiness, anxiety, sadness, and disgust. The sentences that the speakers are made to utter are everyday phrases and can be used with a variety of emotions.

C. RAVDESS

The RAVDESS [30] is a facial and vocal expression database in North American English. 24 professional actors vocalize statements in a North American accent. There are 8 classes of emotions namely: calm, happy, sad, angry, fearful, surprise, disgust, and neutral. All the classes except neutral are uttered in two levels of emotional intensity. 247 participants reviewed the database with 72 participants retesting for better accuracy of the dataset.

SECTION IV.

Methodology

The proposed framework has been categorized into the following subsections that include data augmentation, feature extraction, transfer learning and finally assigning the fuzzy ranks using the Gompertz function for the final prediction. A diagrammatic workflow of the entire TLEFuzzyNet model from feature extraction to final prediction has been given in Figure 1.

FIGURE 1.

A schematic representation of TLEFuzzyNet model.

TLEFuzzyNet: Fuzzy Rank-Based Ensemble of Transfer Learning Models for Emotion Recognition From Human Speeches

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

A. Motivation and Contributions

Literature Review

Dataset

A. SAVEE

B. EmoDB

C. RAVDESS

Methodology

A. Augmentation

B. Feature Extraction

C. Transfer Learning

1) GoogleNet

2) Inception_v3

3) ResNet18

D. Fuzzy Rank Ensemble

Results and Discussion

A. Evaluation Metrics

B. Performance of Constituent Models

C. Performance of Ensemble Model

D. Comparison With Other Frameworks

Conclusion

References