3 research outputs found
DiffWave: A Versatile Diffusion Model for Audio Synthesis
In this work, we propose DiffWave, a versatile diffusion probabilistic model
for conditional and unconditional waveform generation. The model is
non-autoregressive, and converts the white noise signal into structured
waveform through a Markov chain with a constant number of steps at synthesis.
It is efficiently trained by optimizing a variant of variational bound on the
data likelihood. DiffWave produces high-fidelity audios in different waveform
generation tasks, including neural vocoding conditioned on mel spectrogram,
class-conditional generation, and unconditional generation. We demonstrate that
DiffWave matches a strong WaveNet vocoder in terms of speech quality (MOS: 4.44
versus 4.43), while synthesizing orders of magnitude faster. In particular, it
significantly outperforms autoregressive and GAN-based waveform models in the
challenging unconditional generation task in terms of audio quality and sample
diversity from various automatic and human evaluations.Comment: ICLR 2021 (oral
Using Cyclic Noise as the Source Signal for Neural Source-Filter-based Speech Waveform Model
Neural source-filter (NSF) waveform models generate speech waveforms by
morphing sine-based source signals through dilated convolution in the time
domain. Although the sine-based source signals help the NSF models to produce
voiced sounds with specified pitch, the sine shape may constrain the generated
waveform when the target voiced sounds are less periodic. In this paper, we
propose a more flexible source signal called cyclic noise, a quasi-periodic
noise sequence given by the convolution of a pulse train and a static random
noise with a trainable decaying rate that controls the signal shape. We further
propose a masked spectral loss to guide the NSF models to produce periodic
voiced sounds from the cyclic noise-based source signal. Results from a
large-scale listening test demonstrated the effectiveness of the cyclic noise
and the masked spectral loss on speaker-independent NSF models in
copy-synthesis experiments on the CMU ARCTIC database.Comment: Submitted to Interspeech 202
A Survey on Neural Speech Synthesis
Text to speech (TTS), or speech synthesis, which aims to synthesize
intelligible and natural speech given text, is a hot research topic in speech,
language, and machine learning communities and has broad applications in the
industry. As the development of deep learning and artificial intelligence,
neural network-based TTS has significantly improved the quality of synthesized
speech in recent years. In this paper, we conduct a comprehensive survey on
neural TTS, aiming to provide a good understanding of current research and
future trends. We focus on the key components in neural TTS, including text
analysis, acoustic models and vocoders, and several advanced topics, including
fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS, etc.
We further summarize resources related to TTS (e.g., datasets, opensource
implementations) and discuss future research directions. This survey can serve
both academic researchers and industry practitioners working on TTS.Comment: A comprehensive survey on TTS, 63 pages, 18 tables, 7 figures, 457
reference