70 research outputs found
Statistical parametric speech synthesis based on sinusoidal models
This study focuses on improving the quality of statistical speech synthesis based on sinusoidal
models. Vocoders play a crucial role during the parametrisation and reconstruction process,
so we first lead an experimental comparison of a broad range of the leading vocoder types.
Although our study shows that for analysis / synthesis, sinusoidal models with complex amplitudes
can generate high quality of speech compared with source-filter ones, component
sinusoids are correlated with each other, and the number of parameters is also high and varies
in each frame, which constrains its application for statistical speech synthesis.
Therefore, we first propose a perceptually based dynamic sinusoidal model (PDM) to decrease
and fix the number of components typically used in the standard sinusoidal model.
Then, in order to apply the proposed vocoder with an HMM-based speech synthesis system
(HTS), two strategies for modelling sinusoidal parameters have been compared. In the first
method (DIR parameterisation), features extracted from the fixed- and low-dimensional PDM
are statistically modelled directly. In the second method (INT parameterisation), we convert
both static amplitude and dynamic slope from all the harmonics of a signal, which we term
the Harmonic Dynamic Model (HDM), to intermediate parameters (regularised cepstral coefficients
(RDC)) for modelling. Our results show that HDM with intermediate parameters can
generate comparable quality to STRAIGHT.
As correlations between features in the dynamic model cannot be modelled satisfactorily
by a typical HMM-based system with diagonal covariance, we have applied and tested a deep
neural network (DNN) for modelling features from these two methods. To fully exploit DNN
capabilities, we investigate ways to combine INT and DIR at the level of both DNN modelling
and waveform generation. For DNN training, we propose to use multi-task learning to
model cepstra (from INT) and log amplitudes (from DIR) as primary and secondary tasks. We
conclude from our results that sinusoidal models are indeed highly suited for statistical parametric
synthesis. The proposed method outperforms the state-of-the-art STRAIGHT-based
equivalent when used in conjunction with DNNs.
To further improve the voice quality, phase features generated from the proposed vocoder
also need to be parameterised and integrated into statistical modelling. Here, an alternative
statistical model referred to as the complex-valued neural network (CVNN), which treats complex coefficients as a whole, is proposed to model complex amplitude explicitly. A complex-valued
back-propagation algorithm using a logarithmic minimisation criterion which includes
both amplitude and phase errors is used as a learning rule. Three parameterisation methods
are studied for mapping text to acoustic features: RDC / real-valued log amplitude, complex-valued
amplitude with minimum phase and complex-valued amplitude with mixed phase. Our
results show the potential of using CVNNs for modelling both real and complex-valued acoustic
features. Overall, this thesis has established competitive alternative vocoders for speech
parametrisation and reconstruction. The utilisation of proposed vocoders on various acoustic
models (HMM / DNN / CVNN) clearly demonstrates that it is compelling to apply them for
the parametric statistical speech synthesis
- …