Cross-lingual timbre and style generalizable text-to-speech (TTS) aims to
synthesize speech with a specific reference timbre or style that is never
trained in the target language. It encounters the following challenges: 1)
timbre and pronunciation are correlated since multilingual speech of a specific
speaker is usually hard to obtain; 2) style and pronunciation are mixed because
the speech style contains language-agnostic and language-specific parts. To
address these challenges, we propose GenerTTS, which mainly includes the
following works: 1) we elaborately design a HuBERT-based information bottleneck
to disentangle timbre and pronunciation/style; 2) we minimize the mutual
information between style and language to discard the language-specific
information in the style embedding. The experiments indicate that GenerTTS
outperforms baseline systems in terms of style similarity and pronunciation
accuracy, and enables cross-lingual timbre and style generalization.Comment: Accepted by INTERSPEECH 202