Abstract:
Automated recognition of continuous emotions in audio-visual data is a growing area of study that aids in understanding human-machine interaction. Training such systems p...Show MoreMetadata
Abstract:
Automated recognition of continuous emotions in audio-visual data is a growing area of study that aids in understanding human-machine interaction. Training such systems presupposes human annotation of the data. The annotation process, however, is laborious and expensive given that several human ratings are required for every data sample to compensate for the subjectivity of emotion perception. As a consequence, labelled data for emotion recognition are rare and the existing corpora are limited when compared to other state-of-the-art deep learning datasets. In this study, we explore different ways in which existing emotion annotations can be utilised more effectively to exploit available labelled information to the fullest. To reach this objective, we exploit individual raters’ opinions by employing an ensemble of rater-specific models, one for each annotator, by that reducing the loss of information which is a byproduct of annotation aggregation; we find that individual models can indeed infer subjective opinions. Furthermore, we explore the fusion of such ensemble predictions using different fusion techniques. Our ensemble model with only two annotators outperforms the regular Arousal baseline on the test set of the MuSe-CaR corpus. While no considerable improvements on valence could be obtained, using all annotators increases the prediction performance of arousal by up to. 07 Concordance Correlation Coefficient absolute improvement on test - solely trained on rate-specific models and fused by an attention-enhanced Long-short Term Memory-Recurrent Neural Network.
Published in: 2021 9th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW)
Date of Conference: 28 September 2021 - 01 October 2021
Date Added to IEEE Xplore: 10 January 2022
ISBN Information: