5 research outputs found
SponTTS: modeling and transferring spontaneous style for TTS
Spontaneous speaking style exhibits notable differences from other speaking
styles due to various spontaneous phenomena (e.g., filled pauses, prolongation)
and substantial prosody variation (e.g., diverse pitch and duration variation,
occasional non-verbal speech like a smile), posing challenges to modeling and
prediction of spontaneous style. Moreover, the limitation of high-quality
spontaneous data constrains spontaneous speech generation for speakers without
spontaneous data. To address these problems, we propose SponTTS, a two-stage
approach based on neural bottleneck (BN) features to model and transfer
spontaneous style for TTS. In the first stage, we adopt a Conditional
Variational Autoencoder (CVAE) to capture spontaneous prosody from a BN feature
and involve the spontaneous phenomena by the constraint of spontaneous
phenomena embedding prediction loss. Besides, we introduce a flow-based
predictor to predict a latent spontaneous style representation from the text,
which enriches the prosody and context-specific spontaneous phenomena during
inference. In the second stage, we adopt a VITS-like module to transfer the
spontaneous style learned in the first stage to the target speakers.
Experiments demonstrate that SponTTS is effective in modeling spontaneous style
and transferring the style to the target speakers, generating spontaneous
speech with high naturalness, expressiveness, and speaker similarity. The
zero-shot spontaneous style TTS test further verifies the generalization and
robustness of SponTTS in generating spontaneous speech for unseen speakers.Comment: 5 pages, 3 figures, Accepted by ICASSP202