Expressive speech-to-speech translation (S2ST) is a key research topic in
seamless communication, which focuses on the preservation of semantics and
speaker vocal style in translated speech. Early works synthesized speaker style
aligned speech in order to directly learn the mapping from speech to target
speech spectrogram. Without reliance on style aligned data, recent studies
leverage the advances of language modeling (LM) and build cascaded LMs on
semantic and acoustic tokens. This work proposes SeamlessExpressiveLM, a single
speech language model for expressive S2ST. We decompose the complex
source-to-target speech mapping into intermediate generation steps with
chain-of-thought prompting. The model is first guided to translate target
semantic content and then transfer the speaker style to multi-stream acoustic
units. Evaluated on Spanish-to-English and Hungarian-to-English translations,
SeamlessExpressiveLM outperforms cascaded LMs in both semantic quality and
style transfer, meanwhile achieving better parameter efficiency