10 research outputs found
Unsupervised Acoustic Unit Representation Learning for Voice Conversion using WaveNet Auto-encoders
Unsupervised representation learning of speech has been of keen interest in
recent years, which is for example evident in the wide interest of the
ZeroSpeech challenges. This work presents a new method for learning frame level
representations based on WaveNet auto-encoders. Of particular interest in the
ZeroSpeech Challenge 2019 were models with discrete latent variable such as the
Vector Quantized Variational Auto-Encoder (VQVAE). However these models
generate speech with relatively poor quality. In this work we aim to address
this with two approaches: first WaveNet is used as the decoder and to generate
waveform data directly from the latent representation; second, the low
complexity of latent representations is improved with two alternative
disentanglement learning methods, namely instance normalization and sliced
vector quantization. The method was developed and tested in the context of the
recent ZeroSpeech challenge 2020. The system output submitted to the challenge
obtained the top position for naturalness (Mean Opinion Score 4.06), top
position for intelligibility (Character Error Rate 0.15), and third position
for the quality of the representation (ABX test score 12.5). These and further
analysis in this paper illustrates that quality of the converted speech and the
acoustic units representation can be well balanced.Comment: To be presented in Interspeech 202
Unsupervised Subword Modeling Using Autoregressive Pretraining and Cross-Lingual Phone-Aware Modeling
This study addresses unsupervised subword modeling, i.e., learning feature
representations that can distinguish subword units of a language. The proposed
approach adopts a two-stage bottleneck feature (BNF) learning framework,
consisting of autoregressive predictive coding (APC) as a front-end and a
DNN-BNF model as a back-end. APC pretrained features are set as input features
to a DNN-BNF model. A language-mismatched ASR system is used to provide
cross-lingual phone labels for DNN-BNF model training. Finally, BNFs are
extracted as the subword-discriminative feature representation. A second aim of
this work is to investigate the robustness of our approach's effectiveness to
different amounts of training data. The results on Libri-light and the
ZeroSpeech 2017 databases show that APC is effective in front-end feature
pretraining. Our whole system outperforms the state of the art on both
databases. Cross-lingual phone labels for English data by a Dutch ASR
outperform those by a Mandarin ASR, possibly linked to the larger similarity of
Dutch compared to Mandarin with English. Our system is less sensitive to
training data amount when the training data is over 50 hours. APC pretraining
leads to a reduction of needed training material from over 5,000 hours to
around 200 hours with little performance degradation.Comment: 5 pages, 3 figures. Accepted for publication in INTERSPEECH 2020,
Shanghai, Chin
The Zero Resource Speech Challenge 2019: TTS without T
International audienceWe present the Zero Resource Speech Challenge 2019, which proposes to build a speech synthesizer without any text or pho-netic labels: hence, TTS without T (text-to-speech without text). We provide raw audio for a target voice in an unknown language (the Voice dataset), but no alignment, text or labels. Participants must discover subword units in an unsupervised way (using the Unit Discovery dataset) and align them to the voice recordings in a way that works best for the purpose of synthesizing novel utterances from novel speakers, similar to the target speaker's voice. We describe the metrics used for evaluation , a baseline system consisting of unsupervised subword unit discovery plus a standard TTS system, and a topline TTS using gold phoneme transcriptions. We present an overview of the 19 submitted systems from 10 teams and discuss the main results