Search CORE

55 research outputs found

Adversarially Trained Autoencoders for Parallel-Data-Free Voice Conversion

Author: Elibol Oguz H.
Keskin Gokce
Ocal Orhan
Ramchandran Kannan
Stephenson Cory
Thomas Anil
Publication venue
Publication date: 09/05/2019
Field of study

We present a method for converting the voices between a set of speakers. Our method is based on training multiple autoencoder paths, where there is a single speaker-independent encoder and multiple speaker-dependent decoders. The autoencoders are trained with an addition of an adversarial loss which is provided by an auxiliary classifier in order to guide the output of the encoder to be speaker independent. The training of the model is unsupervised in the sense that it does not require collecting the same utterances from the speakers nor does it require time aligning over phonemes. Due to the use of a single encoder, our method can generalize to converting the voice of out-of-training speakers to speakers in the training dataset. We present subjective tests corroborating the performance of our method

arXiv.org e-Print Archive

Crossref

The Zero Resource Speech Challenge 2019: TTS without T

Author: Algayres Robin
Benjumea Juan
Bernard Mathieu
Besacier Laurent
Black Alan,
Cao Xuan-Nga
Dugrain Charlotte
Dunbar Ewan
Dupoux Emmanuel
Karadayi Julien
Miskic Lucie
Ondel Lucas
Sakti Sakriani
Publication venue: HAL CCSD
Publication date: 15/09/2019
Field of study

International audienceWe present the Zero Resource Speech Challenge 2019, which proposes to build a speech synthesizer without any text or pho-netic labels: hence, TTS without T (text-to-speech without text). We provide raw audio for a target voice in an unknown language (the Voice dataset), but no alignment, text or labels. Participants must discover subword units in an unsupervised way (using the Unit Discovery dataset) and align them to the voice recordings in a way that works best for the purpose of synthesizing novel utterances from novel speakers, similar to the target speaker's voice. We describe the metrics used for evaluation , a baseline system consisting of unsupervised subword unit discovery plus a standard TTS system, and a topline TTS using gold phoneme transcriptions. We present an overview of the 19 submitted systems from 10 teams and discuss the main results

Reimagining Speech: A Scoping Review of Deep Learning-Powered Voice Conversion

Author: Bargum Anders R.
Erkut Cumhur
Serafin Stefania
Publication venue
Publication date: 14/11/2023
Field of study

Research on deep learning-powered voice conversion (VC) in speech-to-speech scenarios is getting increasingly popular. Although many of the works in the field of voice conversion share a common global pipeline, there is a considerable diversity in the underlying structures, methods, and neural sub-blocks used across research efforts. Thus, obtaining a comprehensive understanding of the reasons behind the choice of the different methods in the voice conversion pipeline can be challenging, and the actual hurdles in the proposed solutions are often unclear. To shed light on these aspects, this paper presents a scoping review that explores the use of deep learning in speech analysis, synthesis, and disentangled speech representation learning within modern voice conversion systems. We screened 621 publications from more than 38 different venues between the years 2017 and 2023, followed by an in-depth review of a final database consisting of 123 eligible studies. Based on the review, we summarise the most frequently used approaches to voice conversion based on deep learning and highlight common pitfalls within the community. Lastly, we condense the knowledge gathered, identify main challenges and provide recommendations for future research directions

arXiv.org e-Print Archive