Search CORE

1 research outputs found

Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

Author: Bai Jionghao
Hong Zhiqing
Huang Rongjie
Li Ruiqi
Wang Yongqi
Zhao Zhou
Publication venue
Publication date: 14/09/2023
Field of study

Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy, but is unable to preserve the speaker timbre of the source speech during translation. Meanwhile, the scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer between source and target speech. We propose an S2ST framework with an acoustic language model based on discrete units from a self-supervised model and a neural codec for style transfer. The acoustic language model leverages self-supervised in-context learning, acquiring the ability for style transfer without relying on any speaker-parallel data, thereby overcoming the issue of data scarcity. By using extensive training data, our model achieves zero-shot cross-lingual style transfer on previously unseen source languages. Experiments show that our model generates translated speeches with high fidelity and style similarity. Audio samples are available at http://stylelm.github.io/ .Comment: 5 pages, 1 figure. submitted to ICASSP 202

arXiv.org e-Print Archive