Expressive human speech generally abounds with rich and flexible speech
prosody variations. The speech prosody predictors in existing expressive speech
synthesis methods mostly produce deterministic predictions, which are learned
by directly minimizing the norm of prosody prediction error. Its unimodal
nature leads to a mismatch with ground truth distribution and harms the model's
ability in making diverse predictions. Thus, we propose a novel prosody
predictor based on the denoising diffusion probabilistic model to take
advantage of its high-quality generative modeling and training stability.
Experiment results confirm that the proposed prosody predictor outperforms the
deterministic baseline on both the expressiveness and diversity of prediction
results with even fewer network parameters.Comment: Proceedings of Interspeech 2023 (doi: 10.21437/Interspeech.2023-715),
demo site at https://thuhcsi.github.io/interspeech2023-DiffVar