18 research outputs found
Sampling-based speech parameter generation using moment-matching networks
This paper presents sampling-based speech parameter generation using
moment-matching networks for Deep Neural Network (DNN)-based speech synthesis.
Although people never produce exactly the same speech even if we try to express
the same linguistic and para-linguistic information, typical statistical speech
synthesis produces completely the same speech, i.e., there is no
inter-utterance variation in synthetic speech. To give synthetic speech natural
inter-utterance variation, this paper builds DNN acoustic models that make it
possible to randomly sample speech parameters. The DNNs are trained so that
they make the moments of generated speech parameters close to those of natural
speech parameters. Since the variation of speech parameters is compressed into
a low-dimensional simple prior noise vector, our algorithm has lower
computation cost than direct sampling of speech parameters. As the first step
towards generating synthetic speech that has natural inter-utterance variation,
this paper investigates whether or not the proposed sampling-based generation
deteriorates synthetic speech quality. In evaluation, we compare speech quality
of conventional maximum likelihood-based generation and proposed sampling-based
generation. The result demonstrates the proposed generation causes no
degradation in speech quality.Comment: Submitted to INTERSPEECH 201
Robust model training and generalisation with Studentising flows
Normalising flows are tractable probabilistic models that leverage the power
of deep learning to describe a wide parametric family of distributions, all
while remaining trainable using maximum likelihood. We discuss how these
methods can be further improved based on insights from robust (in particular,
resistant) statistics. Specifically, we propose to endow flow-based models with
fat-tailed latent distributions such as multivariate Student's , as a simple
drop-in replacement for the Gaussian distribution used by conventional
normalising flows. While robustness brings many advantages, this paper explores
two of them: 1) We describe how using fatter-tailed base distributions can give
benefits similar to gradient clipping, but without compromising the asymptotic
consistency of the method. 2) We also discuss how robust ideas lead to models
with reduced generalisation gap and improved held-out data likelihood.
Experiments on several different datasets confirm the efficacy of the proposed
approach in both regards.Comment: 9 pages, 8 figures, accepted for publication at INNF+ 2020 (Second
ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit
Likelihood Models
Duration modeling using DNN for Arabic speech synthesis
International audienceDuration modeling is a key task for every parametric speech synthesis system. Though such parametric systems have been adapted to many languages, no special attention was paid to explicitly handling Arabic speech characteristics. Actually, in Arabic phoneme duration has a distinctive role, because of consonant gemination and vowel quantity. Therefore, a precise modeling of sound durations is critical. In this paper we compare several modeling of phoneme durations (including duration modeling by HTS and MERLIN toolkits), and we propose a new approach which relies on using a set of models, each one being optimal for a given phoneme class (e.g., simple consonants, geminated consonants, short vowels, and long vowels). An objective evaluation carried out on a set of test sentences shows that the proposed approach leads to a more accurate modeling of the phoneme durations
Neural Harmonic-plus-Noise Waveform Model with Trainable Maximum Voice Frequency for Text-to-Speech Synthesis
Neural source-filter (NSF) models are deep neural networks that produce
waveforms given input acoustic features. They use dilated-convolution-based
neural filter modules to filter sine-based excitation for waveform generation,
which is different from WaveNet and flow-based models. One of the NSF models,
called harmonic-plus-noise NSF (h-NSF) model, uses separate pairs of source and
neural filters to generate harmonic and noise waveform components. It is close
to WaveNet in terms of speech quality while being superior in generation speed.
The h-NSF model can be improved even further. While h-NSF merges the harmonic
and noise components using pre-defined digital low- and high-pass filters, it
is well known that the maximum voice frequency (MVF) that separates the
periodic and aperiodic spectral bands are time-variant. Therefore, we propose a
new h-NSF model with time-variant and trainable MVF. We parameterize the
digital low- and high-pass filters as windowed-sinc filters and predict their
cut-off frequency (i.e., MVF) from the input acoustic features. Our experiments
demonstrated that the new model can predict a good trajectory of the MVF and
produce high-quality speech for a text-to-speech synthesis system.Comment: Accepted by Speech Synthesis Workshop 201
Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis
Neural waveform models such as WaveNet have demonstrated better performance
than conventional vocoders for statistical parametric speech synthesis. As an
autoregressive (AR) model, WaveNet is limited by a slow sequential waveform
generation process. Some new models that use the inverse-autoregressive flow
(IAF) can generate a whole waveform in a one-shot manner. However, these
IAF-based models require sequential transformation during training, which
severely slows down the training speed. Other models such as Parallel WaveNet
and ClariNet bring together the benefits of AR and IAF-based models and train
an IAF model by transferring the knowledge from a pre-trained AR teacher to an
IAF student without any sequential transformation. However, both models require
additional training criteria, and their implementation is prohibitively
complicated.
We propose a framework for neural source-filter (NSF) waveform modeling
without AR nor IAF-based approaches. This framework requires only three
components for waveform generation: a source module that generates a sine-based
signal as excitation, a non-AR dilated-convolution-based filter module that
transforms the excitation into a waveform, and a conditional module that
pre-processes the acoustic features for the source and filer modules. This
framework minimizes spectral-amplitude distances for model training, which can
be efficiently implemented by using short-time Fourier transform routines.
Under this framework, we designed three NSF models and compared them with
WaveNet. It was demonstrated that the NSF models generated waveforms at least
100 times faster than WaveNet, and the quality of the synthetic speech from the
best NSF model was better than or equally good as that from WaveNet.Comment: Accepted to IEEE/ACM TASLP. Note: this paper is on a follow-up work
of our ICASSP paper. Based on the h-NSF introduced in this work, we proposed
a h-sinc-NSF model and published the third paper in SSW 10
(https://www.isca-speech.org/archive/SSW_2019/pdfs/SSW10_O_1-1.pdf