1. INTRODUCTION
Significant developments have taken place in the neural end-to-end text-to-speech (TTS) synthesis models for generating high fidelity speech with a simplified pipeline [1]–[4]. Such systems usually incorporate an encoder-decoder neural network architecture [5] that maps a given text sequence to a sequence of acoustic features. More recent advancement in such models enables the use of crowd-sourced data by disentangling and controlling different attributes such as speaker identity, noise, recording channels as well as prosody [6]–[8]. The focus of this paper, prosody, is a collection of attributes including fundamental frequency (F0), energy and duration [9]. Efforts have been made to model and control these attributes by factorizing the latent attributes (e.g. prosody) from observed attributes (e.g. speaker). Although most of these works use latent representations at utterance level which captures the salient features of the utterance [6],[10]–[12], fine-grained prosody that are aligned with the phone sequence can be captured using techniques recently proposed in [13]. This model provides a localized prosody control that achieves more variability and higher robustness to speaker perturbations.