Transformer Based Unsupervised Pre-Training for Acoustic Representation Learning | IEEE Conference Publication | IEEE Xplore

Transformer Based Unsupervised Pre-Training for Acoustic Representation Learning


Abstract:

Recently, a variety of acoustic tasks and related applications arised. For many acoustic tasks, the labeled data size may be limited. To handle this problem, we propose a...Show More

Abstract:

Recently, a variety of acoustic tasks and related applications arised. For many acoustic tasks, the labeled data size may be limited. To handle this problem, we propose an unsupervised pre-training method using Transformer based encoder to learn a general and robust high-level representation for all acoustic tasks. Experiments have been conducted on three kinds of acoustic tasks: speech emotion recognition, sound event detection and speech translation. All the experiments have shown that pre-training using its own training data can significantly improve the performance. With a larger pre-training data combining MuST-C, Librispeech and ESC-US datasets, for speech emotion recognition, the UAR can further improve absolutely 4.3% on IEMOCAP dataset. For sound event detection, the F1 score can further improve absolutely 1.5% on DCASE2018 task5 development set and 2.1% on evaluation set. For speech translation, the BLEU score can further improve relatively 12.2% on En-De dataset and 8.4% on En-Fr dataset.
Date of Conference: 06-11 June 2021
Date Added to IEEE Xplore: 13 May 2021
ISBN Information:

ISSN Information:

Conference Location: Toronto, ON, Canada

1. INTRODUCTION

The goal of acoustic representation learning is to transform raw or surface features into the high-level feature which are more accessible to acoustic tasks [1]. It is critical to make acoustic representations more general and robust to improve the performance of acoustic tasks. However, the labeled data size of the specific acoustic task may be limited so that the learned representations can be less robust and the performance can be vulnerable to unseen data. On the other hand, there exists varieties of acoustic tasks which range from speaker verification, speech recognition to event and scene detection. For supervised learning, the learned representation useful for one task may be less suited for another task. It is worthwhile to explore how to utilize all kinds of datasets to learn a general and robust representation for all kinds of acoustic tasks.

Contact IEEE to Subscribe

References

References is not available for this document.