Conferences >ICASSP 2021 - 2021 IEEE Inter...

Cascaded Encoders for Unifying Streaming and Non-Streaming ASR

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

End-to-end (E2E) automatic speech recognition (ASR) models, by now, have shown competitive performance on several benchmarks. These models are structured to either operat...Show More

Metadata

Abstract:

End-to-end (E2E) automatic speech recognition (ASR) models, by now, have shown competitive performance on several benchmarks. These models are structured to either operate in streaming or non-streaming mode. This work presents cascaded encoders for building a single E2E ASR model that can operate in both these modes simultaneously. The proposed model consists of streaming and non-streaming encoders. Input features are first processed by the streaming encoder; the non-streaming encoder operates exclusively on the output of the streaming encoder. A single decoder then learns to decode either using the output of the streaming or the non-streaming encoder. Results show that this model achieves similar word error rates (WER) as a standalone streaming model when operating in streaming mode, and obtains 10% – 27% relative improvement when operating in non-streaming mode. Our results also show that the proposed approach outperforms existing E2E two-pass models, especially on long-form speech.

Published in: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Date of Conference: 06-11 June 2021

Date Added to IEEE Xplore: 13 May 2021

ISBN Information:

ISSN Information:

DOI: 10.1109/ICASSP39728.2021.9414607

Conference Location: Toronto, ON, Canada

Contents

1. INTRODUCTION

End-to-end (E2E) automatic speech recognition systems (ASR) have made tremendous progress over the last few years, achieving word error rates (WER) that match or surpass conventional ASR models in several common benchmarks [1], [2], [3], [4]. Typical E2E systems consist of a single neural network that transforms input audio to sequences of output tokens, like characters or word-pieces, that can be readily transformed to the final sequence of words. Examples of such models include connectionist temporal classification (CTC) [5], attention-based encoder-decoder models [6] like listen-attend-spell (LAS) [7], recurrent neural net transducer (RNN-T) [8], and other interesting variations [9], [10], [11].

References is not available for this document.

Cascaded Encoders for Unifying Streaming and Non-Streaming ASR

Abstract:

Metadata

Abstract:

ISSN Information:

1. INTRODUCTION

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Cascaded Encoders for Unifying Streaming and Non-Streaming ASR

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

1. INTRODUCTION

References