Conventional synthetic voices can synthesise neutral read aloud speech well. But, to
make synthetic speech more suitable for a wider range of applications, the voices need
to express more than just the word identity. We need to develop voices that can partake
in a conversation and express, e.g. agreement, disagreement, hesitation, in a natural
and believable manner.
In speech synthesis there are currently two dominating frameworks: unit selection
and HMM-based speech synthesis. Both frameworks utilise recordings of human
speech to build synthetic voices. Despite the fact that the content of the recordings determines
the segmental and prosodic phenomena that can be synthesised, surprisingly
little research has been made on utilising the corpus to extend the limited behaviour
of conventional synthetic voices. In this thesis we will show how natural sounding
conversational characteristics can be added to both unit selection and HMM-based
synthetic voices, by adding speech from a spontaneous conversation to the voices.
We recorded a spontaneous conversation, and by manually transcribing and selecting
utterances we obtained approximately two thousand utterances from it. These conversational
utterances were rich in conversational speech phenomena, but they lacked
the general coverage that allows unit selection and HMM-based synthesis techniques
to synthesise high quality speech. Therefore we investigated a number of blending approaches
in the synthetic voices, where the conversational utterances were augmented
with conventional read aloud speech.
The synthetic voices that contained conversational speech were contrasted with
conventional voices without conversational speech. The perceptual evaluations showed
that the conversational voices were generally perceived by listeners as having a more
conversational style than the conventional voices. This conversational style was largely
due to the conversational voices’ ability to synthesise utterances that contained conversational
speech phenomena in a more natural manner than the conventional voices.
Additionally, we conducted an experiment that showed that natural sounding conversational
characteristics in synthetic speech can convey pragmatic information, in our
case an impression of certainty or uncertainty, about a topic to a listener. The conclusion
drawn is that the limited behaviour of conventional synthetic voices can be
enriched by utilising conversational speech in both unit selection and HMM-based
speech synthesis