Diverse and Expressive Speech Prosody Prediction with Denoising
  Diffusion Probabilistic Model

Lam, Max W. Y.; Li, Xiang; Liu, Songxiang; Meng, Helen; Weng, Chao; Wu, Zhiyong

Diverse and Expressive Speech Prosody Prediction with Denoising Diffusion Probabilistic Model

Authors: Max W. Y. Lam
Xiang Li
Songxiang Liu
Helen Meng
Chao Weng
Zhiyong Wu
Publication date: 7 October 2023
Publisher

Abstract

Expressive human speech generally abounds with rich and flexible speech prosody variations. The speech prosody predictors in existing expressive speech synthesis methods mostly produce deterministic predictions, which are learned by directly minimizing the norm of prosody prediction error. Its unimodal nature leads to a mismatch with ground truth distribution and harms the model's ability in making diverse predictions. Thus, we propose a novel prosody predictor based on the denoising diffusion probabilistic model to take advantage of its high-quality generative modeling and training stability. Experiment results confirm that the proposed prosody predictor outperforms the deterministic baseline on both the expressiveness and diversity of prediction results with even fewer network parameters.Comment: Proceedings of Interspeech 2023 (doi: 10.21437/Interspeech.2023-715), demo site at https://thuhcsi.github.io/interspeech2023-DiffVar

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2305.16749

Last time updated on 29/05/2023