Conferences >ICASSP 2021 - 2021 IEEE Inter...

Fine-Tuning of Pre-Trained End-to-End Speech Recognition with Generative Adversarial Networks

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Adversarial training of end-to-end (E2E) ASR systems using generative adversarial networks (GAN) has recently been explored for low-resource ASR corpora. GANs help to lea...Show More

Metadata

Abstract:

Adversarial training of end-to-end (E2E) ASR systems using generative adversarial networks (GAN) has recently been explored for low-resource ASR corpora. GANs help to learn the true data representation through a two-player min-max game. However, training an E2E ASR model using a large ASR corpus with a GAN framework has never been explored, because it might take excessively long time due to high-variance gradient updates and face convergence issues. In this paper, we introduce a novel framework for fine-tuning a pre-trained ASR model using the GAN objective where the ASR model acts as a generator and a discriminator tries to distinguish the ASR output from the real data. Since the ASR model is pre-trained, we hypothesize that the ASR model output (soft distribution vectors) helps to get higher scores from the discriminator and makes the task of the discriminator harder within our GAN framework, which in turn improves the performance of the ASR model in the fine-tuning stage. Here, the pre-trained ASR model is fine-tuned adversarially against the discriminator using an additional adversarial loss. Experiments on full LibriSpeech dataset show that our proposed approach outperforms baselines and conventional GAN-based adversarial models.

Published in: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Date of Conference: 06-11 June 2021

Date Added to IEEE Xplore: 13 May 2021

ISBN Information:

ISSN Information:

DOI: 10.1109/ICASSP39728.2021.9413703

Conference Location: Toronto, ON, Canada

Contents

1. INTRODUCTION

End-to-end (E2E) automatic speech recognition (ASR) systems map speech acoustic signal to text transcription by using a single sequence-to-sequence neural model without decomposing the problem into different parts such as lexicon modeling, acoustic modeling and language modeling as in traditional ASR architectures [1]. It has received a lot of attention because of its simple training and inference procedures over traditional HMM-based systems which require a hand-crafted pronunciation dictionary and a complex decoding system using a finite state transducer (FST) [2]. One of the earliest E2E ASR model is the connectionist temporal classification (CTC) [3] model which independently maps acoustic frames into outputs. To get better results with CTC, the CTC output needs to be rescored with language models [4]. The conditional independence assumption in CTC was tackled by recurrent neural network transducer (RNNT) model [5], [6] which showed better performance for streaming. Attention based encoder-decoder networks yield state-of-the-art results for offline ASR model [7], [8], [9]. These networks are trained by using sequence-to-sequence and/or CTC losses to learn the true data distribution.

References is not available for this document.

MIT Libraries

MIT Libraries

Fine-Tuning of Pre-Trained End-to-End Speech Recognition with Generative Adversarial Networks

Abstract:

Metadata

Abstract:

ISSN Information:

1. INTRODUCTION

References

IEEE Account

Purchase Details

Profile Information

Need Help?

MIT Libraries

MIT Libraries

Fine-Tuning of Pre-Trained End-to-End Speech Recognition with Generative Adversarial Networks

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

1. INTRODUCTION

References