Search CORE

970 research outputs found

CLN-VC: Text-Free Voice Conversion Based on Fine-Grained Style Control and Contrastive Learning with Negative Samples Augmentation

Author: Cheng Ning
Deng Yimin
Wang Jianzong
Xiao Jing
Zhang Xulong
Publication venue
Publication date: 14/11/2023
Field of study

Better disentanglement of speech representation is essential to improve the quality of voice conversion. Recently contrastive learning is applied to voice conversion successfully based on speaker labels. However, the performance of model will reduce in conversion between similar speakers. Hence, we propose an augmented negative sample selection to address the issue. Specifically, we create hard negative samples based on the proposed speaker fusion module to improve learning ability of speaker encoder. Furthermore, considering the fine-grain modeling of speaker style, we employ a reference encoder to extract fine-grained style and conduct the augmented contrastive learning on global style. The experimental results show that the proposed method outperforms previous work in voice conversion tasks.Comment: Accepted by the 21st IEEE International Symposium on Parallel and Distributed Processing with Applications (IEEE ISPA 2023

arXiv.org e-Print Archive

Controllable Accented Text-to-Speech Synthesis

Author: Gao Guanglai
Li Haizhou
Liu Rui
Sisman Berrak
Publication venue
Publication date: 22/09/2022
Field of study

Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a variant of the standard version (L1). Accented TTS synthesis is challenging as L2 is different from L1 in both in terms of phonetic rendering and prosody pattern. Furthermore, there is no easy solution to the control of the accent intensity in an utterance. In this work, we propose a neural TTS architecture, that allows us to control the accent and its intensity during inference. This is achieved through three novel mechanisms, 1) an accent variance adaptor to model the complex accent variance with three prosody controlling factors, namely pitch, energy and duration; 2) an accent intensity modeling strategy to quantify the accent intensity; 3) a consistency constraint module to encourage the TTS system to render the expected accent intensity at a fine level. Experiments show that the proposed system attains superior performance to the baseline models in terms of accent rendering and intensity control. To our best knowledge, this is the first study of accented TTS synthesis with explicit intensity control.Comment: To be submitted for possible journal publicatio

arXiv.org e-Print Archive

Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis

Author: Bian Yanyao
Li Xiang
Meng Helen
Song Changhe
Su Dan
Wu Zhiyong
Zhang Luwen
Zhou Yixuan
Publication venue
Publication date: 03/04/2022
Field of study

Zero-shot speaker adaptation aims to clone an unseen speaker's voice without any adaptation time and parameters. Previous researches usually use a speaker encoder to extract a global fixed speaker embedding from reference speech, and several attempts have tried variable-length speaker embedding. However, they neglect to transfer the personal pronunciation characteristics related to phoneme content, leading to poor speaker similarity in terms of detailed speaking styles and pronunciation habits. To improve the ability of the speaker encoder to model personal pronunciation characteristics, we propose content-dependent fine-grained speaker embedding for zero-shot speaker adaptation. The corresponding local content embeddings and speaker embeddings are extracted from a reference speech, respectively. Instead of modeling the temporal relations, a reference attention module is introduced to model the content relevance between the reference speech and the input text, and to generate the fine-grained speaker embedding for each phoneme encoder output. The experimental results show that our proposed method can improve speaker similarity of synthesized speeches, especially for unseen speakers.Comment: Submitted to Interspeech 202

arXiv.org e-Print Archive