48 research outputs found
Reducing mismatch in training of DNN-based glottal excitation models in a statistical parametric text-to-speech system
Neural network-based models that generate glottal excitation waveforms from acoustic features have been found to give improved quality in statistical parametric speech synthesis. Until now, however, these models have been trained separately from the acoustic model. This creates mismatch between training and synthesis, as the synthesized acoustic features used for the excitation model input differ from the original inputs, with which the model was trained on. Furthermore, due to the errors in predicting the vocal tract filter, the original excitation waveforms do not provide perfect reconstruction of the speech waveform even if predicted without error. To address these issues and to make the excitation model more robust against errors in acoustic modeling, this paper proposes two modifications to the excitation model training scheme. First, the excitation model is trained in a connected manner, with inputs generated by the acoustic model. Second, the target glottal waveforms are re-estimated by performing glottal inverse filtering with the predicted vocal tract filters. The results show that both of these modifications improve performance measured in MSE and MFCC distortion, and slightly improve the subjective quality of the synthetic speech.Peer reviewe
Waveform Generation for Text-to-speech Synthesis Using Pitch-synchronous Multi-scale Generative Adversarial Networks
The state-of-the-art in text-to-speech synthesis has recently improved
considerably due to novel neural waveform generation methods, such as WaveNet.
However, these methods suffer from their slow sequential inference process,
while their parallel versions are difficult to train and even more expensive
computationally. Meanwhile, generative adversarial networks (GANs) have
achieved impressive results in image generation and are making their way into
audio applications; parallel inference is among their lucrative properties. By
adopting recent advances in GAN training techniques, this investigation studies
waveform generation for TTS in two domains (speech signal and glottal
excitation). Listening test results show that while direct waveform generation
with GAN is still far behind WaveNet, a GAN-based glottal excitation model can
achieve quality and voice similarity on par with a WaveNet vocoder.Comment: Submitted to ICASSP 201
Statistical parametric speech synthesis based on sinusoidal models
This study focuses on improving the quality of statistical speech synthesis based on sinusoidal
models. Vocoders play a crucial role during the parametrisation and reconstruction process,
so we first lead an experimental comparison of a broad range of the leading vocoder types.
Although our study shows that for analysis / synthesis, sinusoidal models with complex amplitudes
can generate high quality of speech compared with source-filter ones, component
sinusoids are correlated with each other, and the number of parameters is also high and varies
in each frame, which constrains its application for statistical speech synthesis.
Therefore, we first propose a perceptually based dynamic sinusoidal model (PDM) to decrease
and fix the number of components typically used in the standard sinusoidal model.
Then, in order to apply the proposed vocoder with an HMM-based speech synthesis system
(HTS), two strategies for modelling sinusoidal parameters have been compared. In the first
method (DIR parameterisation), features extracted from the fixed- and low-dimensional PDM
are statistically modelled directly. In the second method (INT parameterisation), we convert
both static amplitude and dynamic slope from all the harmonics of a signal, which we term
the Harmonic Dynamic Model (HDM), to intermediate parameters (regularised cepstral coefficients
(RDC)) for modelling. Our results show that HDM with intermediate parameters can
generate comparable quality to STRAIGHT.
As correlations between features in the dynamic model cannot be modelled satisfactorily
by a typical HMM-based system with diagonal covariance, we have applied and tested a deep
neural network (DNN) for modelling features from these two methods. To fully exploit DNN
capabilities, we investigate ways to combine INT and DIR at the level of both DNN modelling
and waveform generation. For DNN training, we propose to use multi-task learning to
model cepstra (from INT) and log amplitudes (from DIR) as primary and secondary tasks. We
conclude from our results that sinusoidal models are indeed highly suited for statistical parametric
synthesis. The proposed method outperforms the state-of-the-art STRAIGHT-based
equivalent when used in conjunction with DNNs.
To further improve the voice quality, phase features generated from the proposed vocoder
also need to be parameterised and integrated into statistical modelling. Here, an alternative
statistical model referred to as the complex-valued neural network (CVNN), which treats complex coefficients as a whole, is proposed to model complex amplitude explicitly. A complex-valued
back-propagation algorithm using a logarithmic minimisation criterion which includes
both amplitude and phase errors is used as a learning rule. Three parameterisation methods
are studied for mapping text to acoustic features: RDC / real-valued log amplitude, complex-valued
amplitude with minimum phase and complex-valued amplitude with mixed phase. Our
results show the potential of using CVNNs for modelling both real and complex-valued acoustic
features. Overall, this thesis has established competitive alternative vocoders for speech
parametrisation and reconstruction. The utilisation of proposed vocoders on various acoustic
models (HMM / DNN / CVNN) clearly demonstrates that it is compelling to apply them for
the parametric statistical speech synthesis
Overcoming the limitations of statistical parametric speech synthesis
At the time of beginning this thesis, statistical parametric speech synthesis (SPSS)
using hidden Markov models (HMMs) was the dominant synthesis paradigm within the
research community. SPSS systems are effective at generalising across the linguistic
contexts present in training data to account for inevitable unseen linguistic contexts at
synthesis-time, making these systems flexible and their performance stable. However
HMM synthesis suffers from a ‘ceiling effect’ in the naturalness achieved, meaning
that, despite great progress, the speech output is rarely confused for natural speech.
There are many hypotheses for the causes of reduced synthesis quality, and subsequent
required improvements, for HMM speech synthesis in literature. However, until this
thesis, these hypothesised causes were rarely tested.
This thesis makes two types of contributions to the field of speech synthesis; each
of these appears in a separate part of the thesis. Part I introduces a methodology for
testing hypothesised causes of limited quality within HMM speech synthesis systems.
This investigation aims to identify what causes these systems to fall short of natural
speech. Part II uses the findings from Part I of the thesis to make informed improvements
to speech synthesis.
The usual approach taken to improve synthesis systems is to attribute reduced synthesis
quality to a hypothesised cause. A new system is then constructed with the aim
of removing that hypothesised cause. However this is typically done without prior testing
to verify the hypothesised cause of reduced quality. As such, even if improvements
in synthesis quality are observed, there is no knowledge of whether a real underlying
issue has been fixed or if a more minor issue has been fixed. In contrast, I perform a
wide range of perceptual tests in Part I of the thesis to discover what the real underlying
causes of reduced quality in HMM synthesis are and the level to which they contribute.
Using the knowledge gained in Part I of the thesis, Part II then looks to make improvements
to synthesis quality. Two well-motivated improvements to standard HMM
synthesis are investigated. The first of these improvements follows on from averaging
across differing linguistic contexts being identified as a major contributing factor to
reduced synthesis quality. This is a practice typically performed during decision tree
regression in HMM synthesis. Therefore a system which removes averaging across
differing linguistic contexts and instead performs averaging only across matching linguistic
contexts (called rich-context synthesis) is investigated. The second of the motivated
improvements follows the finding that the parametrisation (i.e., vocoding) of
speech, standard practice in SPSS, introduces a noticeable drop in quality before any
modelling is even performed. Therefore the hybrid synthesis paradigm is investigated.
These systems aim to remove the effect of vocoding by using SPSS to inform the selection
of units in a unit selection system. Both of the motivated improvements applied
in Part II are found to make significant gains in synthesis quality, demonstrating the
benefit of performing the style of perceptual testing conducted in the thesis
In search of the optimal acoustic features for statistical parametric speech synthesis
In the Statistical Parametric Speech Synthesis (SPSS) paradigm, speech is generally
represented as acoustic features and the waveform is generated by a vocoder. A comprehensive
summary of state-of-the-art vocoding techniques is presented, highlighting
their characteristics, advantages, and drawbacks, primarily when used in SPSS. We
conclude that state-of-the-art vocoding methods are suboptimal and are a cause of significant loss of quality, even though numerous vocoders have been proposed in the last
decade. In fact, it seems that the most complicated methods perform worse than simpler
ones based on more robust analysis/synthesis algorithms. Typical methods, based on
the source-filter or sinusoidal models, rely on excessive simplifying assumptions. They
perform what we call an "extreme decomposition" of speech (e.g., source+filter or sinusoids+
noise), which we believe to be a major drawback. Problems include: difficulties
in the estimation of components; modelling of complex non-linear mechanisms; a lack
of ground truth. In addition, the statistical dependence that exists between stochastic
and deterministic components of speech is not modelled.
We start by improving just the waveform generation stage of SPSS, using standard
acoustic features. We propose a new method of waveform generation tailored for SPSS,
based on neither source-filter separation nor sinusoidal modelling. The proposed waveform
generator avoids unnecessary assumptions and decompositions as far as possible,
and uses only the fundamental frequency and spectral envelope as acoustic features. A
very small speech database is used as a source of base speech signals which are subsequently
\reshaped" to match the specifications output by the acoustic model in the
SPSS framework. All of this is done without any decomposition, such as source+filter
or harmonics+noise. A comprehensive description of the waveform generation process
is presented, along with implementation issues. Two SPSS voices, a female and a male,
were built to test the proposed method by using a standard TTS toolkit, Merlin. In
a subjective evaluation, listeners preferred the proposed waveform generator over a
state-of-the-art vocoder, STRAIGHT.
Even though the proposed \waveform reshaping" generator generates higher speech
quality than STRAIGHT, the improvement is not large enough. Consequently, we propose
a new acoustic representation, whose implementation involves feature extraction
and waveform generation, i.e., a complete vocoder. The new representation encodes
the complex spectrum derived from the Fourier Transform in a way explicitly designed
for SPSS, rather than for speech coding or copy-synthesis. The feature set comprises
four feature streams describing magnitude spectrum, phase spectrum, and fundamental
frequency; all of these are represented by real numbers. It avoids heuristics or unstable
methods for phase unwrapping. The new feature extraction does not attempt to
decompose the speech structure and thus the "phasiness" and "buzziness" found in a
typical vocoder, such as STRAIGHT, is dramatically reduced. Our method works at
a lower frame rate than a typical vocoder. To demonstrate the proposed method, two
DNN-based voices, a male and a female, were built using the Merlin toolkit. Subjective
comparisons were performed with a state-of-the-art baseline. The proposed vocoder
substantially outperformed the baseline for both voices and under all configurations
tested. Furthermore, several enhancements were made over the original design, which
are beneficial for either sound quality or compatibility with other tools. In addition to
its use in SPSS, the proposed vocoder is also demonstrated being used for join smoothing
in unit selection-based systems, and can be used for voice conversion or automatic
speech recognition