3 research outputs found
Translatotron 3: Speech to Speech Translation with Monolingual Data
This paper presents Translatotron 3, a novel approach to train a direct
speech-to-speech translation model from monolingual speech-text datasets only
in a fully unsupervised manner. Translatotron 3 combines masked autoencoder,
unsupervised embedding mapping, and back-translation to achieve this goal.
Experimental results in speech-to-speech translation tasks between Spanish and
English show that Translatotron 3 outperforms a baseline cascade system,
reporting 18.14 BLEU points improvement on the synthesized
Unpaired-Conversational dataset. In contrast to supervised approaches that
necessitate real paired data, which is unavailable, or specialized modeling to
replicate para-/non-linguistic information, Translatotron 3 showcases its
capability to retain para-/non-linguistic such as pauses, speaking rates, and
speaker identity. Audio samples can be found in our website
http://google-research.github.io/lingvo-lab/translatotron
LMs with a Voice: Spoken Language Modeling beyond Speech Tokens
We present SPECTRON, a novel approach to adapting pre-trained language models
(LMs) to perform speech continuation. By leveraging pre-trained speech
encoders, our model generates both text and speech outputs with the entire
system being trained end-to-end operating directly on spectrograms. Training
the entire model in the spectrogram domain simplifies our speech continuation
system versus existing cascade methods which use discrete speech
representations. We further show our method surpasses existing spoken language
models both in semantic content and speaker preservation while also benefiting
from the knowledge transferred from pre-existing models. Audio samples can be
found in our website https://michelleramanovich.github.io/spectron/spectro