This paper proposes an expressive singing voice synthesis system by
introducing explicit vibrato modeling and latent energy representation. Vibrato
is essential to the naturalness of synthesized sound, due to the inherent
characteristics of human singing. Hence, a deep learning-based vibrato model is
introduced in this paper to control the vibrato's likeliness, rate, depth and
phase in singing, where the vibrato likeliness represents the existence
probability of vibrato and it would help improve the singing voice's
naturalness. Actually, there is no annotated label about vibrato likeliness in
existing singing corpus. We adopt a novel vibrato likeliness labeling method to
label the vibrato likeliness automatically. Meanwhile, the power spectrogram of
audio contains rich information that can improve the expressiveness of singing.
An autoencoder-based latent energy bottleneck feature is proposed for
expressive singing voice synthesis. Experimental results on the open dataset
NUS48E show that both the vibrato modeling and the latent energy representation
could significantly improve the expressiveness of singing voice. The audio
samples are shown in the demo website