1. INTRODUCTION
End-to-end (E2E) automatic speech recognition (ASR) systems map speech acoustic signal to text transcription by using a single sequence-to-sequence neural model without decomposing the problem into different parts such as lexicon modeling, acoustic modeling and language modeling as in traditional ASR architectures [1]. It has received a lot of attention because of its simple training and inference procedures over traditional HMM-based systems which require a hand-crafted pronunciation dictionary and a complex decoding system using a finite state transducer (FST) [2]. One of the earliest E2E ASR model is the connectionist temporal classification (CTC) [3] model which independently maps acoustic frames into outputs. To get better results with CTC, the CTC output needs to be rescored with language models [4]. The conditional independence assumption in CTC was tackled by recurrent neural network transducer (RNNT) model [5], [6] which showed better performance for streaming. Attention based encoder-decoder networks yield state-of-the-art results for offline ASR model [7], [8], [9]. These networks are trained by using sequence-to-sequence and/or CTC losses to learn the true data distribution.