SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

Chen, Sanyuan; Chen, Zhuo; Eskimez, Sefik Emre; Kanda, Naoyuki; Li, Jinyu; Liu, Shujie; Tang, Min; Thakker, Manthan; Wang, Xiaofei; Yoshioka, Takuya

SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

Authors: Sanyuan Chen
Zhuo Chen
Sefik Emre Eskimez
Naoyuki Kanda
Jinyu Li
Shujie Liu
Min Tang
Manthan Thakker
Xiaofei Wang
Takuya Yoshioka
Publication date: 13 August 2023
Publisher

Abstract

Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving comparable or superior performance to specialized models across tasks. See https://aka.ms/speechx for demo samples.Comment: See https://aka.ms/speechx for demo sample

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2308.06873

Last time updated on 18/08/2023