Statistical parametric text-to-speech synthesis currently relies on predefined and highly
controlled prompts read in a “neutral” voice. This thesis presents work on utilising
recordings of free conversation for the purpose of filled pause synthesis and as an
inspiration for improved general modelling of speech for text-to-speech synthesis purposes.
A corpus of both standard prompts and free conversation is presented and the
potential usefulness of conversational speech as the basis for text-to-speech voices
is validated. Additionally, through psycholinguistic experimentation it is shown that
filled pauses can have potential subconscious benefits to the listener but that current
text-to-speech voices cannot replicate these effects. A method for pronunciation variant
forced alignment is presented in order to obtain a more accurate automatic speech
segmentation something which is particularly bad for spontaneously produced speech.
This pronunciation variant alignment is utilised not only to create a more accurate underlying
acoustic model, but also as the driving force behind creating more natural
pronunciation prediction at synthesis time. While this improves both the standard and
spontaneous voices the naturalness of spontaneous speech based voices still lags behind
the quality of voices based on standard read prompts. Thus, the synthesis of filled
pauses is investigated in relation to specific phonetic modelling of filled pauses and
through techniques for the mixing of standard prompts with spontaneous utterances in
order to retain the higher quality of standard speech based voices while still utilising
the spontaneous speech for filled pause modelling. A method for predicting where to
insert filled pauses in the speech stream is also developed and presented, relying on
an analysis of human filled pause usage and a mix of language modelling methods.
The method achieves an insertion accuracy in close agreement with human usage. The
various approaches are evaluated and their improvements documented throughout the
thesis, however, at the end the resulting filled pause quality is assessed through a repetition
of the psycholinguistic experiments and an evaluation of the compilation of all
developed methods