Introduction
Speech emotion recognition (SER) has been gaining popularity within the research fraternity for the past few decades and has the potential in the domain of Human-Computer Interaction, along with Multimedia and Biomedical applications to name a few. Speech is one of the main forms of relaying information to our surroundings hence a detailed analysis of speech signals is necessary. Emotion is a piece of vital information that speech signals carry apart from the verbal corpus. Any human-computer interface must be able to capture the underlying emotion of human speech since the same sentence could carry different meanings depending on the emotion.
SER has a lot of potential in speech-enabled interfaces such as Artificial Intelligence enabled voice assistants, which can keep track of emotions and changes in the pattern to predict any psychological changes or signs of mental stress and depression. This can be extended in the medical field for the detection of Autism and Parkinson’s Disease as long as adjuvant treatment and diagnosis. Educational software and smart classrooms could make use of SER for student mental health detection. Automated vehicles could prevent accidents and ensure safer driving environments by analyzing the speech of the driver and judging whether the driver is sound to drive or not. The entire process of SER has two indispensable phases, namely: feature extraction and classification. For the former, we can divide the acoustic features into two broad categories. First is the temporal features such as the energy of the signal, zero-crossing rate, maximum amplitude, and minimum energy. Conversion of these temporal features to the frequency domain using Fourier Transforms gives us spectral features. Some of the spectral features include spectral centroid, Mel spectrograms, Mel Frequency Cepstral coefficients(MFCC), spectral flux, Shifted Delta Cepstral Coefficients (SDCC), spectral density, and chroma-stft. Since both Deep learning and image classification has attained great heights in the past few decades, our paper proposes to bring image classification using 2D convolutional neural networks (CNN) models to the field of SER. Hence, the choice of features is the Mel Spectrogram of the audio data.
In recent years, using ensemble models to fuse the prediction scores from different constituent models has been in practice. In our paper, we propose to build an ensemble model that assigns fuzzy ranks with the help of a modified Gompertz Function. The audio data used are from the Surrey Audio-Visual Expressed Emotion (SAVEE), the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), and the Berlin Database of Emotional Speech (EmoDB) datasets. Ensemble Learning assigns ranks to the prediction of the individual models hence providing superior results than individual models. Imbalance and Correlation problems are taken care of simultaneously by ensemble learning models. The Fuzzy rank approach is a fusion technique that predicts the final classification results by assigning adaptive weights to the multi-class confidence scores of the constituent models. The Gompertz function named after Benjamin Gompertz is a sigmoid function in the time domain which saturates to the slowest in the beginning and end of a given period. It was originally modeled for human mortality since mortality decreases exponentially with age, after which it saturates asymptotically. In the TLEFuzzyNet model, we choose three pre-trained transfer learning CNN models: ResNet18, GoogleNet, and Inception_v3. The fuzzy ranks were given the prediction scores of three models on the test set using the Gompertz Function, which provided better accuracy than the constituent models, hence making this a novel approach.
A. Motivation and Contributions
Since modern technology is all about automation with gestures and voice, SER holds the key to designing state-of-the-art Human-Computer interaction interfaces. From voice-enabled security devices and authentication systems to Automated Vehicle Environments, emotion can be analyzed to prevent identity mismatch or accidents. The medical sciences field too can benefit by classification of emotion from patients speech for treating Parkinson’s disease and Autism to name a few. Key points regarding our proposed model are as follows:
A huge amount of data is required when it comes to building an end-to-end model. Due to the lack of enough labelled audio data, we preferred to use three pre-trained transfer learning models as our constituent models for the ensemble pipeline which are: Inception_V3, GoogleNet, and ResNet18.
Training the audio samples on a single model for classification may lead to the problem of imbalance. The Ensemble approach gives an aggregate opinion to all the individual models thereby decreasing noise and giving better and unbiased prediction scores. This makes it a novel approach for SER.
A modified Gompertz function is employed to assign fuzzy ranks to the decision scores of the individual models. The prediction scores rarely go as low as zero and the Gompertz function saturates exponentially to an asymptote. This is different and more efficient than traditional ensemble pipelines since fuzzy ranks based fusion assigns adaptive priority weights to the prediction scores of individual models.
We have compared our accuracies and performance to other modern approaches for SER and proved that the TLEFuzzyNet model outperforms them hence establishing the novelty of our framework.
The TLEFuzzyNet model has been trained and tested on three open-source datasets: SAVEE, EmoDB, and RAVDESS. We have achieved state-of-the-art accuracy compared to contemporary machine learning and deep learning approaches.
Literature Review
SER has seen a lot of research for the past few decades. A number of different machine-learning models have been proposed in the past which include hidden Markov models [38], Support Vector Machines [41], Decision Trees, and Random Forests [37].
Dellaert et al. [9] used smoothing spline approximation on the contour of the pitch for feature extraction. Pattern Recognition techniques were employed for classification purposes. Machine Learning models were applied by Noroozi et al. [36] such as Multi-class Support Vector Machine, Random Forest, and Single-layered Adaptive boosting on SAVEE and Polish database. Noroozi et al. [36] were able to obtain 75.71% and 87.91% testing accuracies on the respective databases using random forest models. By using a reduced feature size of 14 they were able to reduce computational time as well.
Nicholson et al. [35] used a special type of neural network called the one-class-in-one neural network which predicted the best emotion by making use of 12 LPC and Delta LPC parameters which attained an accuracy of 50%. Deshmukh et al. [11] worked with the Mel-frequency cepstral coefficients(MFCC) features and trained their dataset on a Support Vector Machine(SVM) achieving 80% accuracy. Slimi et al. [45] generated Mel spectrograms of the audio data and resized them to a size of 150 by 66. These resized spectrograms were flattened and fed to a single hidden layer neural network. The classifications were made by a softmax layer. Their proposed model achieved an accuracy of 81.82% after 1970 epochs of training on the EmoDB database. Badshah et al. [1] used deep convolutional neural networks to feed their model with Mel spectrograms of the speech data. They conducted two experiments. In the first method, they trained the CNN model from scratch, and in the second approach, they fine-tuned a pre-trained Alex-Net model which gave them a test accuracy of 84.3%. Etienne et al. [13] tried to capture the long term dependencies of speech by using a CNN+LSTM architecture. The high level features were captured by the CNN model from spectrograms whereas the recurrent LSTM layers extracted relation among the temporal features.
Zehra et al. [54] have used an ensemble approach by combining results from decision tree (J48), sequential minimal optimization (SMO), and random forest (RF). Their main objective was to design robotic systems that could analyze and detect emotions in not only in-corpus but also cross-corpus data. Khan and Roy [23] have developed a model that is both gender dependent and independent, by training a Naive Bayes Classifier on both pitch and MFCC features using the EmoDB database and has achieved an accuracy of 95.20%.
Dataset
Since SER is a classification problem, the availability of sufficient data which is correctly labelled is of prime concern. Different individuals could perceive the same speech as different emotions hence it is very important to choose a dataset that is labelled by people from the same group of database creators. We have chosen to work with simulated databases where particular actors or speakers simulate the required emotion by reading out from a set of sentences. In our work, we use the SAVEE, EmoDB [3] and RAVDESS [30] dataset which contains accurately labelled and free of noise audio samples. All the three databases are accessible publicly and a tabular representation of the class-wise distribution of audio samples is provided in Table 1.
A. SAVEE
The SAVEE dataset contains standard TIMIT sentences recorded by 4 male actors. All the audio samples have been recorded, processed, and labelled using superior quality equipment in a visual and media laboratory. The dataset has 480 audio samples in the .wav format and is divided into seven emotion classes: anger, fear, disgust, surprise, sad, happy, and neutral.
B. EmoDB
EmoDB [3] is a German Emotional Speech dataset of 535 utterances available in .wav format. It was created by the Institute of Communication Science, Technical University, Berlin, Germany. There are 5 male speakers and 5 female speakers. There are seven labelled emotion classes: anger, boredom, neutral, happiness, anxiety, sadness, and disgust. The sentences that the speakers are made to utter are everyday phrases and can be used with a variety of emotions.
C. RAVDESS
The RAVDESS [30] is a facial and vocal expression database in North American English. 24 professional actors vocalize statements in a North American accent. There are 8 classes of emotions namely: calm, happy, sad, angry, fearful, surprise, disgust, and neutral. All the classes except neutral are uttered in two levels of emotional intensity. 247 participants reviewed the database with 72 participants retesting for better accuracy of the dataset.
Methodology
The proposed framework has been categorized into the following subsections that include data augmentation, feature extraction, transfer learning and finally assigning the fuzzy ranks using the Gompertz function for the final prediction. A diagrammatic workflow of the entire TLEFuzzyNet model from feature extraction to final prediction has been given in Figure 1.
A. Augmentation
The size of any dataset is a crucial deciding factor for any deep learning model.A small amount of data hinder the deep learning models to map the inputs to the ground label accurately. High variance in the predictions of test data is a major problem that inadequate data can give rise to. To overcome this problem we use augmentation to create multiple training data samples from the limited audio files in the SAVEE and EmoDB databases. The technique we have employed in this time-shifting whose equation is as follows:\begin{equation*} X_{new} = X[n \pm s]\tag{1}\end{equation*}
B. Feature Extraction
The choice of effective features from the speech data is very important to attain state-of-the-art performance for the TLEFuzzyNet model. There is a cascade of important features pertaining to speech such as pitch, energy, Mel spectrograms, Mel-frequency cepstrum coefficients (MFCC), Linear Prediction Cepstrum Coefficients (LPCC), modulation spectral features (MSFs) to name a few. Since we are using Convolutional Neural Network(CNN) models to train our datasets, the Mel spectrogram is an ideal choice of features.
Spectrograms are generated with the help of Fourier Transforms on sound signals. The sound signal is divided into small time segments to which the Fourier Transform is applied individually. As a result, we get a frequency versus time graph with the amplitude of the frequency denoted by the color of the spectrogram. However, humans do not perceive frequency linearly but rather logarithmically. The Mel scale solves this problem by mapping the perceived frequency to the measured frequency of a tone. Figure 2 plots the Mel pitch against the frequency of the sound waves. We can clearly see from Fig. 2 that with increasing frequencies the slope of the curve decreases thereby establishing that differentiating between higher frequencies is difficult compared to lower frequencies.
The Mel scale is a perceptual scale of pitches judged by listeners to be equal in distance from one another.
C. Transfer Learning
The non-availability of sufficient data makes it difficult to create Deep neural network models. Transfer learning solves this problem because here we can reuse models that have been pre-trained on huge datasets. The weights of the neural network from one model are used on another model which takes as input a totally new set of data. The last few layers of the pre-trained model are fine-tuned by re-training them on the new datasets which achieve better performance and lesser training time. In image classification problems, transfer learning has been used extensively because obtaining millions of images for a database is not always feasible. CNNs demand hardware memory and are compute-intensive hence making it difficult to train in scenarios with limited power supply. Therefore not having to re-train the entire transfer learning model saves training time and system resources.
The three databases are split into training, validation, and testing sets in the ratio 8:1:1. The training and validation datasets are used to fine-tune the transfer learning models for our Mel spectrogram data. The testing dataset remains unseen by the model during training. The GoogLeNet, ResNet18, and Inception_v3 models are loaded from the Pytorch Model Zoo. They are pretrained on the ImageNet weights and fine-tuned using the Adam [24] optimizer. The confidence scores for the testing split dataset are generated for all three models and stored as a csv file for use in the fuzzy rank-based ensemble step.
The three transfer learning models are described below as follows:
1) GoogleNet
GoogLeNet is a 22 layer CNN model and one of the popular networks of the Inception architecture, developed by the researchers at Google. It takes as input images of shape
Tabular representation of GoogLeNet CNN transfer learning model used in the present work.
2) Inception_v3
Inception_v3 is a popular 2D CNN model with great classification performance that uses transfer learning. It is an extended network of the GoogLeNet model that extensively uses Batchnorm [17] in the activation layers. Inputs to this model are of sizes
Pictorial representation of Inception_v3 CNN transfer learning model used in the present work.
3) ResNet18
ResNet or Residual Network, introduced in 2015, is an architecture that uses residual mapping and is very effective against the “degradation problem” in deep networks. The residual learning approach enhances the optimization phase of the CNN model. Like most popular image recognition CNN models the ResNet-18 is pretrained on the ImageNet dataset. It takes as input images of size
Tabular representation of ResNet-18 CNN transfer learning model used in the present work.
Each dataset is trained separately on the three models for a total of 50 epochs on SAVEE, 10 epochs on RAVDESS, and 25 epochs on the EmoDB dataset. The epochs are experimentally determined to prevent overfitting on sample data. A batch size of 16 is used with a learning rate of 0.001 using the Adam optimizer. Figure 6 shows the training batch which is used to classify the emotions of the corresponding audio data.
A training batch of 16 spectrograms from the RAVDESS dataset with the labels denoting the respective classes present in the database. The labels are mapped as following: ‘f’ for fear, ‘d’ for disgust, ‘h’ for happiness, ‘s’ for sad, ‘n’ for neutral ‘su’ for surprised, ‘a’ for angry and ‘c’ for calm.
D. Fuzzy Rank Ensemble
In literature, the traditional ensemble technique gives equal priority to the classification scores of all constituent models, and pre-computed weights are used for the classifiers. The main issue with such an ensemble is the generation of static weights which are difficult to modify in the phase where we classify the test samples. However, in the proposed fuzzy-rank-based ensemble approach each base classifier’s predictions scores are taken into account for every individual test case separately. This way, enhanced and more accurate scores for classification can be obtained using this ensemble method. This is a dynamic process and there is no need to change any weights for different test datasets.
The Gompertz function describes time series with the slowest growth in the beginning and end of a time period. Originally used to describe the mortality rate concerning growing age, it is now used extensively in the field of biology. The Gompertz function can explain the growth of a population, cancerous tumor, colony of bacteria, as well as the number of affected people during an epidemic. The equation to understand the function is:\begin{equation*} f(t) = ae^{-e^{b-ct}}\tag{2}\end{equation*}
In our proposed method, we use a redesigned version of the Gompertz Function. Considering \begin{equation*} \sum _{l=1}^{L}S_{l}^{(n)} = 1;\quad \forall n,\quad n=1,2,3, {\dots },N\tag{3}\end{equation*}
The prediction scores denoted by \begin{align*}&R_{l}^{(n)} = (1-\epsilon ^{-\epsilon ^{-2\times S_{l}^{(n)}}}) \\&\forall l,n;\quad n= 1,2,\ldots,N;\quad l=1,2,\ldots,L\tag{4}\end{align*}
Corresponding to each class in the dataset, there can be \begin{align*} FRS_{l}=&\sum _{i=1}^{N}\begin{cases} R_{l}^{(i)}, & if R_{l}^{(i)}\in K^{(i)}\\ P_{l}^{R}, & otherwise \end{cases}\tag{5}\\ CCFS_{l}=&\frac {1}{N}\sum _{i=1}^{N}\begin{cases} CF_{l}^{(i)}, & if R_{l}^{(i)}\in K^{(i)}\\ P_{l}^{CF}, & otherwise \end{cases}\tag{6}\\ class(\mathbf {X})=&min\left \{{ FRS_{l} \times CCFS_{l} }\right \} \quad \forall l=1,2,\ldots,L\tag{7}\end{align*}
Results and Discussion
In this segment, we have provided tabular data for the results we have acquired after working on the three aforementioned datasets. A detailed explanation of the evaluation metrics, performance of the CNN Transfer learning models, and the final ensemble model is provided. We have compared our work with previous researches and proved that our method has attained state-of-the-art performance for the SER problem using an ensemble approach with deep learning 2D CNN models.
A. Evaluation Metrics
To assess the performance of the TLEFuzzyNet model, we have considered F1-Score, Precision, Recall, and Accuracy as our evaluation metrics. A majority of past researches have used Accuracy as the standard metric for the evaluation of performance. As a result, we will be providing a comparative study between the TLEFuzzyNet model and previous models for the SER problem.
The mentioned evaluation metrics can be calculated using basic parameters such as True Positives, True Negatives, False Positives, and False Negatives. The corresponding formulas are as follows:
Accuracy:\begin{equation*} Accuracy_{i}=\frac {\sum _{i}^{}M_{i_{i}} }{\sum _{i} \sum _{j}M_{i_{j}} }\tag{8}\end{equation*}
Precision:\begin{equation*} Precision_{i}=\frac {\sum _{i}^{}M_{i_{i}} }{\sum _{i} \sum _{j}M_{j_{i}} }\tag{9}\end{equation*}
Recall:\begin{equation*} Recall_{i}=\frac {\sum _{i}M_{i_{i}} }{\sum _{j}M_{i_{j}} }\tag{10}\end{equation*}
F1 Score:\begin{equation*} F1\:Score_{i}=\frac {2}{\frac {1}{Precision} + \frac {1}{Recall}}\tag{11}\end{equation*}
The reason we have used Precision, Recall, and F1 scores, is because of the unsymmetrical distribution of samples in our database. In the upcoming sections, we give a comparative study between the previous SER model which includes deep learning as well as machine learning classifiers.
B. Performance of Constituent Models
Each of the three constituent models has been loaded from the Pytorch Model Zoo and is pretrained on the ImageNet [10] dataset. The entire model weights have been freezed except the classification layers. The classification layers initially had an output layer with softmax activation of size (1, 1000) which are fine-tuned to (1, num_of_classes). Each model has been trained for exactly 50 epochs on SAVEE, 25 epochs for EmoDB, and 10 epochs for RAVDESS dataset after which the best validation accuracy has been taken into consideration. Adam optimizer has been used for gradient descent with the learning rate as 0.001 and
The Inception_V3 model achieved 98.76% accuracy, ResNet18 model achieved 98.77% accuracy, and GoogLeNet model achieved 99.08% accuracy for the EmoDB dataset. The ensemble model after applying ranks to the classification scores predicts the labels from the corresponding testing datasets with 99.38% accuracy which is greater than that of the constituent models.
For the RAVDESS dataset, the Inception_V3, ResNet18, and GoogLeNet models achieved an individual classification accuracy of 92.07%, 95.29%, and 97.24% respectively. However, the ensemble model was able to assign Fuzzy ranks in a manner to correctly classify each instance from the corresponding testing dataset with a final accuracy of 99.66%. These results connote the ability of the ensemble approach to minimize the errors of each CNN model and generate more accurate classification scores.
Figure 10 shows the learning curves for the SAVEE dataset using the Inception_v3, GoogLeNet, and Resnet18 models respectively. We can infer from the graphs that the model does not learn anything new and reaches a maximum accuracy around the 10th epoch. Similarly, for EmoDB dataset (shown in Figure 11) and RAVDESS dataset (illustrated in Figure 12), the training of the model reaches a point where the accuracy halts to improve which is again around the 10th epoch. All the learning curves have been plotted using the TensorBoard library in Python.
Graph showing the training vs. validation accuracy plotted for 50 epochs respectively on the SAVEE dataset.
Graph showing the training vs. validation accuracy plotted for 25 epochs respectively on the EmoDB dataset.
Graph showing the training vs. validation accuracy plotted for 10 epochs respectively on the RAVDESS dataset.
C. Performance of Ensemble Model
The ensemble model assigns fuzzy ranks to the classification scores given by the three CNN Transfer learning models mentioned in IV-C. Classification scores from the previous transfer learning phase are stored for each sample in the testing dataset. In this phase we assign fuzzy ranks for the top
In order to visualize the classification performance of our ensemble model, we make use of the receiver operating characteristic curve or ROC curve. The ROC curve can be used for multi-class classification though conventionally it is employed for binary classification. A method known as One vs All where the ground truth class is treated as one label and other classes treated as a collective label. The ROC curve is the measure of the models to distinguish between classes. The area under the ROC curve is proportional to the accuracy with which a class is classified correctly. The True Positive Rate is drafted against the False Positive Rate in the ROC curve.
The model achieves perfect accuracy is the area under ROC curve is 1. It can separate between the multiple classes with 100% accuracy. A ROC curve with almost 0 area under the curve will provide the wrong prediction for each sample data, hence being a poor model. The ROC curves for our ensemble model on the RAVDESS, EmoDB, and SAVEE dataset are given in Figs. 13 to 15. We can see in Fig. 13 that the area under the ROC curve has nearly approached 1 which is implied by the accuracy of 99.66% for the RAVDESS dataset.\begin{align*} TPR=&\frac {TP}{TP+FP}\tag{12}\\ FPR=&\frac {FP}{TN+FP}\tag{13}\end{align*}
D. Comparison With Other Frameworks
In this section, we present an explicate analysis between the performance of past researches and frameworks with our proposed TLEFuzzyNet model. The datasets we have used are open source and widely used for research in the field of speech processing and speech emotion recognition. Since research in the domain of SER has been evolving since the last few decades there is a cascade of deep learning and machine learning models to compare the TLEFuzzyNet model with. Tables 3–5 give a tabular comparison of the TLEFuzzyNet model with other state-of-the-art researches.
The datasets we have worked with are popular SER-datasets. As mentioned in Table 1, it can seen that both EmoDB and SAVEE are relatively small datasets compared to RAVDESS dataset. Additionally, EmoDB is an example of imbalanced dataset with relatively more data corresponding to
In case of RAVDESS dataset, the previous frameworks mentioned in Table 5 are able to classify different emotions with decent accuracy in the range 70%-85%. This might be due to the comparatively large size of the dataset. Our framework does not entirely depend on the transfer learning phase for classification, but the classification scores are fed to the fuzzy-rank ensemble model which ultimately provides the final classification results after assigning ranks to the top-
Conclusion
This paper proposed an ensemble learning based framework for SER using transfer learning 2D CNN models. It was found that models pre-trained on huge image datasets can extract essential features from Mel spectrograms of audio data hence converting the task of speech processing and recognition into a computer vision task. the TLEFuzzyNet model combined transfer learning, CNNs and fuzzy rank based ensemble approach by making use of the Gompertz function. Since the datasets used were not of very large scale hence transfer learning was a good choice for training the deep convolutional neural networks. The dynamic assignment of ranks to the classifiers makes it possible to make predictions without having to initialize a new set of weights for the entire ensemble phase of the framework for newer datasets. Errors of each individual CNN classifier is compensated by the fuzzy ranking algorithm. The experimental results depict that the TLEFuzzyNet model has achieved state-of-the-art accuracy of 98.57%, 99.38% and 99.66% on all the three benchmark datasets namely, SAVEE, EmoDB, and RAVDESS respectively. There is a promising application of transfer learning and ensemble approaches for SER.
There are few areas where the TLEFuzzyNet model can be improved which are as follows:
In our CNN models, we have used traditional Mel spectrograms. This greatly increases computation due to the input image sizes of (224, 224) for the transfer learning models. In future we can use smaller feature sets such as MFCC or feature vectors from neural network architectures of smaller size.
The generalization of the framework can be improved by using better data augmentation techniques such as voice conversion using generative model [51] and speed perturbation.