GenerTTS: Pronunciation Disentanglement for Timbre and Style
  Generalization in Cross-Lingual Text-to-Speech

Cong, Yahuan; Lin, Haopeng; Liu, Shichao; Ma, Zejun; Ren, Yi; Wang, Chunfeng; Yin, Xiang; Zhang, Haoyu

GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech

Authors: Yahuan Cong
Haopeng Lin
Shichao Liu
Zejun Ma
Yi Ren
Chunfeng Wang
Xiang Yin
Haoyu Zhang
Publication date: 27 June 2023
Publisher

Abstract

Cross-lingual timbre and style generalizable text-to-speech (TTS) aims to synthesize speech with a specific reference timbre or style that is never trained in the target language. It encounters the following challenges: 1) timbre and pronunciation are correlated since multilingual speech of a specific speaker is usually hard to obtain; 2) style and pronunciation are mixed because the speech style contains language-agnostic and language-specific parts. To address these challenges, we propose GenerTTS, which mainly includes the following works: 1) we elaborately design a HuBERT-based information bottleneck to disentangle timbre and pronunciation/style; 2) we minimize the mutual information between style and language to discard the language-specific information in the style embedding. The experiments indicate that GenerTTS outperforms baseline systems in terms of style similarity and pronunciation accuracy, and enables cross-lingual timbre and style generalization.Comment: Accepted by INTERSPEECH 202

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2306.15304

Last time updated on 02/07/2023