19 research outputs found
NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS
Expressive text-to-speech (TTS) can synthesize a new speaking style by
imiating prosody and timbre from a reference audio, which faces the following
challenges: (1) The highly dynamic prosody information in the reference audio
is difficult to extract, especially, when the reference audio contains
background noise. (2) The TTS systems should have good generalization for
unseen speaking styles. In this paper, we present a
\textbf{no}ise-\textbf{r}obust \textbf{e}xpressive TTS model (NoreSpeech),
which can robustly transfer speaking style in a noisy reference utterance to
synthesized speech. Specifically, our NoreSpeech includes several components:
(1) a novel DiffStyle module, which leverages powerful probabilistic denoising
diffusion models to learn noise-agnostic speaking style features from a teacher
model by knowledge distillation; (2) a VQ-VAE block, which maps the style
features into a controllable quantized latent space for improving the
generalization of style transfer; and (3) a straight-forward but effective
parameter-free text-style alignment module, which enables NoreSpeech to
transfer style to a textual input from a length-mismatched reference utterance.
Experiments demonstrate that NoreSpeech is more effective than previous
expressive TTS models in noise environments. Audio samples and code are
available at:
\href{http://dongchaoyang.top/NoreSpeech\_demo/}{http://dongchaoyang.top/NoreSpeech\_demo/}Comment: Submitted to ICASSP202
Diverse and Expressive Speech Prosody Prediction with Denoising Diffusion Probabilistic Model
Expressive human speech generally abounds with rich and flexible speech
prosody variations. The speech prosody predictors in existing expressive speech
synthesis methods mostly produce deterministic predictions, which are learned
by directly minimizing the norm of prosody prediction error. Its unimodal
nature leads to a mismatch with ground truth distribution and harms the model's
ability in making diverse predictions. Thus, we propose a novel prosody
predictor based on the denoising diffusion probabilistic model to take
advantage of its high-quality generative modeling and training stability.
Experiment results confirm that the proposed prosody predictor outperforms the
deterministic baseline on both the expressiveness and diversity of prediction
results with even fewer network parameters.Comment: Proceedings of Interspeech 2023 (doi: 10.21437/Interspeech.2023-715),
demo site at https://thuhcsi.github.io/interspeech2023-DiffVar
SnakeGAN: A Universal Vocoder Leveraging DDSP Prior Knowledge and Periodic Inductive Bias
Generative adversarial network (GAN)-based neural vocoders have been widely
used in audio synthesis tasks due to their high generation quality, efficient
inference, and small computation footprint. However, it is still challenging to
train a universal vocoder which can generalize well to out-of-domain (OOD)
scenarios, such as unseen speaking styles, non-speech vocalization, singing,
and musical pieces. In this work, we propose SnakeGAN, a GAN-based universal
vocoder, which can synthesize high-fidelity audio in various OOD scenarios.
SnakeGAN takes a coarse-grained signal generated by a differentiable digital
signal processing (DDSP) model as prior knowledge, aiming at recovering
high-fidelity waveform from a Mel-spectrogram. We introduce periodic
nonlinearities through the Snake activation function and anti-aliased
representation into the generator, which further brings the desired inductive
bias for audio synthesis and significantly improves the extrapolation capacity
for universal vocoding in unseen scenarios. To validate the effectiveness of
our proposed method, we train SnakeGAN with only speech data and evaluate its
performance for various OOD distributions with both subjective and objective
metrics. Experimental results show that SnakeGAN significantly outperforms the
compared approaches and can generate high-fidelity audio samples including
unseen speakers with unseen styles, singing voices, instrumental pieces, and
nonverbal vocalization.Comment: Accepted by ICME 202