15,451 research outputs found
Context-aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis
Recent advances in text-to-speech have significantly improved the
expressiveness of synthesized speech. However, it is still challenging to
generate speech with contextually appropriate and coherent speaking style for
multi-sentence text in audiobooks. In this paper, we propose a context-aware
coherent speaking style prediction method for audiobook speech synthesis. To
predict the style embedding of the current utterance, a hierarchical
transformer-based context-aware style predictor with a mixture attention mask
is designed, considering both text-side context information and speech-side
style information of previous speeches. Based on this, we can generate
long-form speech with coherent style and prosody sentence by sentence.
Objective and subjective evaluations on a Mandarin audiobook dataset
demonstrate that our proposed model can generate speech with more expressive
and coherent speaking style than baselines, for both single-sentence and
multi-sentence test.Comment: Accepted by ICASSP 202
StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based Pre-training for Expressive Audiobook Speech Synthesis
The expressive quality of synthesized speech for audiobooks is limited by
generalized model architecture and unbalanced style distribution in the
training data. To address these issues, in this paper, we propose a
self-supervised style enhancing method with VQ-VAE-based pre-training for
expressive audiobook speech synthesis. Firstly, a text style encoder is
pre-trained with a large amount of unlabeled text-only data. Secondly, a
spectrogram style extractor based on VQ-VAE is pre-trained in a self-supervised
manner, with plenty of audio data that covers complex style variations. Then a
novel architecture with two encoder-decoder paths is specially designed to
model the pronunciation and high-level style expressiveness respectively, with
the guidance of the style extractor. Both objective and subjective evaluations
demonstrate that our proposed method can effectively improve the naturalness
and expressiveness of the synthesized speech in audiobook synthesis especially
for the role and out-of-domain scenarios.Comment: Accepted to ICASSP 202
Learning utterance-level representations through token-level acoustic latents prediction for Expressive Speech Synthesis
This paper proposes an Expressive Speech Synthesis model that utilizes
token-level latent prosodic variables in order to capture and control
utterance-level attributes, such as character acting voice and speaking style.
Current works aim to explicitly factorize such fine-grained and utterance-level
speech attributes into different representations extracted by modules that
operate in the corresponding level. We show that the fine-grained latent space
also captures coarse-grained information, which is more evident as the
dimension of latent space increases in order to capture diverse prosodic
representations. Therefore, a trade-off arises between the diversity of the
token-level and utterance-level representations and their disentanglement. We
alleviate this issue by first capturing rich speech attributes into a
token-level latent space and then, separately train a prior network that given
the input text, learns utterance-level representations in order to predict the
phoneme-level, posterior latents extracted during the previous step. Both
qualitative and quantitative evaluations are used to demonstrate the
effectiveness of the proposed approach. Audio samples are available in our demo
page.Comment: Submitted to ICASSP 202
- …