14 research outputs found
GenTranslate : Large Language Models are Generative Multilingual Speech and Machine Translators
17 pages. This work is open sourced at: https://github.com/YUCHEN005/GenTranslat
Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech
This paper proposes Virtuoso, a massively multilingual speech-text joint
semi-supervised learning framework for text-to-speech synthesis (TTS) models.
Existing multilingual TTS typically supports tens of languages, which are a
small fraction of the thousands of languages in the world. One difficulty to
scale multilingual TTS to hundreds of languages is collecting high-quality
speech-text paired data in low-resource languages. This study extends Maestro,
a speech-text joint pretraining framework for automatic speech recognition
(ASR), to speech generation tasks. To train a TTS model from various types of
speech and text data, different training schemes are designed to handle
supervised (paired TTS and ASR data) and unsupervised (untranscribed speech and
unspoken text) datasets. Experimental evaluation shows that 1) multilingual TTS
models trained on Virtuoso can achieve significantly better naturalness and
intelligibility than baseline ones in seen languages, and 2) they can
synthesize reasonably intelligible and naturally sounding speech for unseen
languages where no high-quality paired TTS data is available.Comment: Submitted to ICASSP 202