Parametric speech synthesis has received increased attention in recent years following
the development of statistical HMM-based speech synthesis. However, the speech
produced using this method still does not sound as natural as human speech and there
is limited parametric flexibility to replicate voice quality aspects, such as breathiness.
The hypothesis of this thesis is that speech naturalness and voice quality can be
more accurately replicated by a HMM-based speech synthesiser using an acoustic glottal
source model, the Liljencrants-Fant (LF) model, to represent the source component
of speech instead of the traditional impulse train.
Two different analysis-synthesis methods were developed during this thesis, in order
to integrate the LF-model into a baseline HMM-based speech synthesiser, which is
based on the popular HTS system and uses the STRAIGHT vocoder. The first method,
which is called Glottal Post-Filtering (GPF), consists of passing a chosen LF-model
signal through a glottal post-filter to obtain the source signal and then generating
speech, by passing this source signal through the spectral envelope filter. The system
which uses the GPF method (HTS-GPF system) is similar to the baseline system,
but it uses a different source signal instead of the impulse train used by STRAIGHT.
The second method, called Glottal Spectral Separation (GSS), generates speech by
passing the LF-model signal through the vocal tract filter. The major advantage of the
synthesiser which incorporates the GSS method, named HTS-LF, is that the acoustic
properties of the LF-model parameters are automatically learnt by the HMMs.
In this thesis, an initial perceptual experiment was conducted to compare the LFmodel
to the impulse train. The results showed that the LF-model was significantly
better, both in terms of speech naturalness and replication of two basic voice qualities
(breathy and tense). In a second perceptual evaluation, the HTS-LF system was better
than the baseline system, although the difference between the two had been expected to
be more significant. A third experiment was conducted to evaluate the HTS-GPF system
and an improved HTS-LF system, in terms of speech naturalness, voice similarity
and intelligibility. The results showed that the HTS-GPF system performed similarly
to the baseline. However, the HTS-LF system was significantly outperformed by the
baseline. Finally, acoustic measurements were performed on the synthetic speech to
investigate the speech distortion in the HTS-LF system. The results indicated that a
problem in replicating the rapid variations of the vocal tract filter parameters at transitions
between voiced and unvoiced sounds is the most significant cause of speech
distortion. This problem encourages future work to further improve the system