1. INTRODUCTION
End-to-end (E2E) automatic speech recognition systems (ASR) have made tremendous progress over the last few years, achieving word error rates (WER) that match or surpass conventional ASR models in several common benchmarks [1], [2], [3], [4]. Typical E2E systems consist of a single neural network that transforms input audio to sequences of output tokens, like characters or word-pieces, that can be readily transformed to the final sequence of words. Examples of such models include connectionist temporal classification (CTC) [5], attention-based encoder-decoder models [6] like listen-attend-spell (LAS) [7], recurrent neural net transducer (RNN-T) [8], and other interesting variations [9], [10], [11].