Conferences >ICASSP 2021 - 2021 IEEE Inter...

A Novel end-to-end Speech Emotion Recognition Network with Stacked Transformer Layers

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Speech emotion recognition (SER) aims to automatically recognize emotional category for a given speech utterance. The performance of a SER system heavily relies on the ef...Show More

Metadata

Abstract:

Speech emotion recognition (SER) aims to automatically recognize emotional category for a given speech utterance. The performance of a SER system heavily relies on the effectiveness of global representation expressed at utterance level. To effectively extract such a global feature, the mainstream of recent SER architectures adopts a pipeline with two key modules, feature extraction and aggregation. Although variant module designs have brought impressive progresses, SER is still a challenging task. In contrast with those previous works, herein we propose a novel strategy for global SER feature extraction by applying an additional enhancement module on top of the current SER pipeline. To verify its effect, an end-to-end SER architecture is proposed where stacked multiple transformer layers are explored to enhance the aggregated global feature. Such an architecture is evaluated on IEMO-CAP and results strongly substantiate the effectiveness of our proposal. In terms of weighted accuracy on four emotion categories, our proposed SER system outperforms the prior arts by a large margin of relatively 20% improvement. Our codes and the pre-trained SER models are made publicly available.

Published in: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Date of Conference: 06-11 June 2021

Date Added to IEEE Xplore: 13 May 2021

ISBN Information:

ISSN Information:

DOI: 10.1109/ICASSP39728.2021.9414314

Conference Location: Toronto, ON, Canada

Contents

1. INTRODUCTION

Speech emotion, as one kind of meta-information apart from text, plays an important role for understanding speakers’ psychology and response. The relevant research, called speech emotion recognition (SER), aims to automatically recognize emotional category for a given speech utterance. Since emotions are usually conveyed in a subtle and variable way, it have been challenging to identify emotion embeddings, as representation of an utterance, that can effectively classify emotion categories.

References is not available for this document.

A Novel end-to-end Speech Emotion Recognition Network with Stacked Transformer Layers

Abstract:

Metadata

Abstract:

ISSN Information:

1. INTRODUCTION

References

IEEE Account

Purchase Details

Profile Information

Need Help?

A Novel end-to-end Speech Emotion Recognition Network with Stacked Transformer Layers

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

1. INTRODUCTION

Authors

Figures

References

Citations

Keywords

Metrics

Footnotes

References