Search CORE

4,373 research outputs found

Adversarially Trained Autoencoders for Parallel-Data-Free Voice Conversion

Author: Elibol Oguz H.
Keskin Gokce
Ocal Orhan
Ramchandran Kannan
Stephenson Cory
Thomas Anil
Publication venue
Publication date: 09/05/2019
Field of study

We present a method for converting the voices between a set of speakers. Our method is based on training multiple autoencoder paths, where there is a single speaker-independent encoder and multiple speaker-dependent decoders. The autoencoders are trained with an addition of an adversarial loss which is provided by an auxiliary classifier in order to guide the output of the encoder to be speaker independent. The training of the model is unsupervised in the sense that it does not require collecting the same utterances from the speakers nor does it require time aligning over phonemes. Due to the use of a single encoder, our method can generalize to converting the voice of out-of-training speakers to speakers in the training dataset. We present subjective tests corroborating the performance of our method

arXiv.org e-Print Archive

Crossref

Nonparallel Emotional Speech Conversion

Author: Chakraborty Deep
Gao Jian
Olaleye Olaitan
Tembine Hamidou
Publication venue: 'International Speech Communication Association'
Publication date: 13/04/2020
Field of study

We propose a nonparallel data-driven emotional speech conversion method. It enables the transfer of emotion-related characteristics of a speech signal while preserving the speaker's identity and linguistic content. Most existing approaches require parallel data and time alignment, which is not available in most real applications. We achieve nonparallel training based on an unsupervised style transfer technique, which learns a translation model between two distributions instead of a deterministic one-to-one mapping between paired examples. The conversion model consists of an encoder and a decoder for each emotion domain. We assume that the speech signal can be decomposed into an emotion-invariant content code and an emotion-related style code in latent space. Emotion conversion is performed by extracting and recombining the content code of the source speech and the style code of the target emotion. We tested our method on a nonparallel corpora with four emotions. Both subjective and objective evaluations show the effectiveness of our approach.Comment: Published in INTERSPEECH 2019, 5 pages, 6 figures. Simulation available at http://www.jian-gao.org/emoga

arXiv.org e-Print Archive

Crossref