1. INTRODUCTION
Voice conversion (VC) aims to modify the speech uttered by the source speaker to sound as if it was spoken by the target speaker without changing the linguistic information. The conventional voice conversion approach usually needs parallel data from source and target speakers. The parallel data is first aligned by dynamic time warping, then a mapping function can be trained to convert speech from source to target speaker. Several statistical conversion models such as Gaussian mixture model (GMM) [1], [2], neural networks [3], [4], [5], [6], non-negative matrix factorization [7], [8] have been proposed. Recently sequence-to-sequence modeling [9], [10] which adopts encoder-decoder architecture with attention has been studied for voice conversion task. This method can achieve appropriate duration conversion which is quite difficult in conventional methods.