55 research outputs found
Voice Conversion Based on Cross-Domain Features Using Variational Auto Encoders
An effective approach to non-parallel voice conversion (VC) is to utilize
deep neural networks (DNNs), specifically variational auto encoders (VAEs), to
model the latent structure of speech in an unsupervised manner. A previous
study has confirmed the ef- fectiveness of VAE using the STRAIGHT spectra for
VC. How- ever, VAE using other types of spectral features such as mel- cepstral
coefficients (MCCs), which are related to human per- ception and have been
widely used in VC, have not been prop- erly investigated. Instead of using one
specific type of spectral feature, it is expected that VAE may benefit from
using multi- ple types of spectral features simultaneously, thereby improving
the capability of VAE for VC. To this end, we propose a novel VAE framework
(called cross-domain VAE, CDVAE) for VC. Specifically, the proposed framework
utilizes both STRAIGHT spectra and MCCs by explicitly regularizing multiple
objectives in order to constrain the behavior of the learned encoder and de-
coder. Experimental results demonstrate that the proposed CD- VAE framework
outperforms the conventional VAE framework in terms of subjective tests.Comment: Accepted to ISCSLP 201
Rhythm-Flexible Voice Conversion without Parallel Data Using Cycle-GAN over Phoneme Posteriorgram Sequences
Speaking rate refers to the average number of phonemes within some unit time,
while the rhythmic patterns refer to duration distributions for realizations of
different phonemes within different phonetic structures. Both are key
components of prosody in speech, which is different for different speakers.
Models like cycle-consistent adversarial network (Cycle-GAN) and variational
auto-encoder (VAE) have been successfully applied to voice conversion tasks
without parallel data. However, due to the neural network architectures and
feature vectors chosen for these approaches, the length of the predicted
utterance has to be fixed to that of the input utterance, which limits the
flexibility in mimicking the speaking rates and rhythmic patterns for the
target speaker. On the other hand, sequence-to-sequence learning model was used
to remove the above length constraint, but parallel training data are needed.
In this paper, we propose an approach utilizing sequence-to-sequence model
trained with unsupervised Cycle-GAN to perform the transformation between the
phoneme posteriorgram sequences for different speakers. In this way, the length
constraint mentioned above is removed to offer rhythm-flexible voice conversion
without requiring parallel data. Preliminary evaluation on two datasets showed
very encouraging results.Comment: 8 pages, 6 figures, Submitted to SLT 201
- …