I. Introduction
Voice conversion is a technique that is used to transform the voice of one speaker so that it is perceived as the voice of another speaker. There are many existing transformation approaches such as the use of vector quantization [1]–[3], Gaussian mixture models [4]–[7], pitch-synchronous overlap addition [8], artificial neural networks [9], and multiple functions [10], [11]. All these techniques have two common stages: training and transformation. The voice conversion system gathers information on the voices of the source and target speakers and automatically formulates voice conversion rules in the training stage. For this purpose, a process called data alignment is required, in which a relationship between the acoustic parameter spaces of the two speakers is estimated. The transformation stage employs the mapping obtained in the training stage to modify the source voice so that it matches the characteristics of the target speaker. The majority of methods proposed in existing literature assume the availability of parallel training sentences, which are referred to as the text-dependent corpus for the source and target speakers. In these approaches, the source and target voices can be aligned using, for example, dynamic time warping [4]. For research purposes, the requirement of having parallel speech databases is not prohibitive, but from the viewpoint of potential practical applications, this requirement is rather inconvenient and sometimes hard to fulfill. Moreover, in some applications, it may even be impossible to obtain parallel speech corpora; e.g., in cross-lingual voice conversion where the source and target speakers speak different languages. To address this problem, text-independent voice conversion techniques using nonparallel databases are developed.