39 research outputs found
Recent development of the HMM-based speech synthesis system (HTS)
A statistical parametric approach to speech synthesis based on hidden Markov models (HMMs) has grown in popularity over the last few years. In this approach, spectrum, excitation, and duration of speech are simultaneously modeled by context-dependent HMMs, and speech waveforms are generate from the HMMs themselves. Since December 2002, we have publicly released an open-source software toolkit named âHMM-based speech synthesis system (HTS)â to provide a research and development toolkit for statistical parametric speech synthesis. This paper describes recent developments of HTS in detail, as well as future release plans
Double Articulation Analyzer with Prosody for Unsupervised Word and Phoneme Discovery
Infants acquire words and phonemes from unsegmented speech signals using
segmentation cues, such as distributional, prosodic, and co-occurrence cues.
Many pre-existing computational models that represent the process tend to focus
on distributional or prosodic cues. This paper proposes a nonparametric
Bayesian probabilistic generative model called the prosodic hierarchical
Dirichlet process-hidden language model (Prosodic HDP-HLM). Prosodic HDP-HLM,
an extension of HDP-HLM, considers both prosodic and distributional cues within
a single integrative generative model. We conducted three experiments on
different types of datasets, and demonstrate the validity of the proposed
method. The results show that the Prosodic DAA successfully uses prosodic cues
and outperforms a method that solely uses distributional cues. The main
contributions of this study are as follows: 1) We develop a probabilistic
generative model for time series data including prosody that potentially has a
double articulation structure; 2) We propose the Prosodic DAA by deriving the
inference procedure for Prosodic HDP-HLM and show that Prosodic DAA can
discover words directly from continuous human speech signals using statistical
information and prosodic information in an unsupervised manner; 3) We show that
prosodic cues contribute to word segmentation more in naturally distributed
case words, i.e., they follow Zipf's law.Comment: 11 pages, Submitted to IEEE Transactions on Cognitive and
Developmental System
Prosody takes over : towards a prosodically guided dialog system
The domain of the speech recognition and dialog system EVAR is train time table inquiry. We observed that in real human-human dialogs when the officer transmits the information, the customer very often interrupts. Many of these interruptions are just repetitions of the time of day given by the officer. The functional role of these interruptions is often determined by prosodic cues only. An important result of experiments where naive persons used the EVAR system is that it is hard to follow the train connection given via speech synthesis. In this case it is even more important than in human-human dialogs that the user has the opportunity to interact during the answer phase. Therefore we extended the dialog module to allow the user to repeat the time of day and we added a prosody module guiding the continuation of the dialog by analyzing the intonation contour of this utterance.Der Diskursbereich des Spracherkennungs- und Dialogsystems EVAR ist Fahrplanauskunft fĂŒr ZĂŒge. Wir beobachteten, dass in realen Mensch-Mensch Dialogen der Kunde sehr oft den Auskunftsbeamten unterbricht, wenn dieser die Information ĂŒbermittelt. Viele dieser Unterbrechungen sind ausschlieĂlich Wiederholungen der Uhrzeitangabe des Beamten. Die funktionale Rolle dieser Unterbrechungen wird hĂ€ufig alleine durch prosodische Mittel bestimmt. Ein wichtiges Ergebnis von Dialog Experimenten mit naiven Personen ergab, dass es schwer ist, den VerbindungsauskĂŒnften von EVAR via Sprachsynthese zu folgen. In diesem Fall ist es sogar noch wichtiger als in Mensch-Mensch Dialogen, dass der Benutzer die Möglichkeit hat, wĂ€hrend der Antwortphase zu interagieren. Deshalb haben wir das Dialogmodul erweitert, um dem Benutzer die Möglichkeit zu geben, die Uhrzeitangaben zu wiederholen, und wir fĂŒgten ein Prosodiemodul hinzu, das die FortfĂŒhrung des Dialogs steuert, indem die Intonation dieser ĂuĂerung analysiert wir
Syntactic surprisal affects spoken word duration in conversational contexts
Abstract We present results of a novel experiment to investigate speech production in conversational data that links speech rate to information density. We provide the first evidence for an association between syntactic surprisal and word duration in recorded speech. Using the AMI corpus which contains transcriptions of focus group meetings with precise word durations, we show that word durations correlate with syntactic surprisal estimated from the incremental Roark parser over and above simpler measures, such as word duration estimated from a state-of-the-art text-to-speech system and word frequencies, and that the syntactic surprisal estimates are better predictors of word durations than a simpler version of surprisal based on trigram probabilities. This result supports the uniform information density (UID) hypothesis and points a way to more realistic artificial speech generation
Building and Designing Expressive Speech Synthesis
We know there is something special about speech. Our voices are not just a means of communicating. They also give a deep impression of who we are and what we might know. They can betray our upbringing, our emotional state, our state of health. They can be used to persuade and convince, to calm and to excite. As speech systems enter the social domain they are required to interact, support and mediate our social relationships with 1) each other, 2) with digital information, and, increasingly, 3) with AI-based algorithms and processes. Socially Interactive Agents (SIAs) are at the fore- front of research and innovation in this area. There is an assumption that in the future âspoken language will provide a natural conversational interface between human beings and so-called intelligent systems.â [Moore 2017, p. 283]. A considerable amount of previous research work has tested this assumption with mixed results. However, as pointed out âvoice interfaces have become notorious for fostering frustration and failureâ [Nass and Brave 2005, p.6]. It is within this context, between our exceptional and intelligent human use of speech to communicate and interact with other humans, and our desire to leverage this means of communication for artificial systems, that the technology, often termed expressive speech synthesis uncomfortably falls. Uncomfortably, because it is often overshadowed by issues in interactivity and the underlying intelligence of the system which is something that emerges from the interaction of many of the components in a SIA. This is especially true of what we might term conversational speech, where decoupling how things are spoken, from when and to whom they are spoken, can seem an impossible task. This is an even greater challenge in evaluation and in characterising full systems which have made use of expressive speech. Furthermore when designing an interaction with a SIA, we must not only consider how SIAs should speak but how much, and whether they should even speak at all. These considerations cannot be ignored. Any speech synthesis that is used in the context of an artificial agent will have a perceived accent, a vocal style, an underlying emotion and an intonational model. Dimensions like accent and personality (cross speaker parameters) as well as vocal style, emotion and intonation during an interaction (within-speaker parameters) need to be built in the design of a synthetic voice. Even a default or neutral voice has to consider these same expressive speech synthesis components. Such design parameters have a strong influence on how effectively a system will interact, how it is perceived and its assumed ability to perform a task or function. To ignore these is to blindly accept a set of design decisions that ignores the complex effect speech has on the userâs successful interaction with a system. Thus expressive speech synthesis is a key design component in SIAs. This chapter explores the world of expressive speech synthesis, aiming to act as a starting point for those interested in the design, building and evaluation of such artificial speech. The debates and literature within this topic are vast and are fundamentally multidisciplinary in focus, covering a wide range of disciplines such as linguistics, pragmatics, psychology, speech and language technology, robotics and human-computer interaction (HCI), to name a few. It is not our aim to synthesise these areas but to give a scaffold and a starting point for the reader by exploring the critical dimensions and decisions they may need to consider when choosing to use expressive speech. To do this, the chapter explores the building of expressive synthesis, highlighting key decisions and parameters as well as emphasising future challenges in expressive speech research and development. Yet, before these are expanded upon we must first try and define what we actually mean by expressive speech
Synthesis and Evaluation of Conversational Characteristics in Speech Synthesis
Conventional synthetic voices can synthesise neutral read aloud speech well. But, to
make synthetic speech more suitable for a wider range of applications, the voices need
to express more than just the word identity. We need to develop voices that can partake
in a conversation and express, e.g. agreement, disagreement, hesitation, in a natural
and believable manner.
In speech synthesis there are currently two dominating frameworks: unit selection
and HMM-based speech synthesis. Both frameworks utilise recordings of human
speech to build synthetic voices. Despite the fact that the content of the recordings determines
the segmental and prosodic phenomena that can be synthesised, surprisingly
little research has been made on utilising the corpus to extend the limited behaviour
of conventional synthetic voices. In this thesis we will show how natural sounding
conversational characteristics can be added to both unit selection and HMM-based
synthetic voices, by adding speech from a spontaneous conversation to the voices.
We recorded a spontaneous conversation, and by manually transcribing and selecting
utterances we obtained approximately two thousand utterances from it. These conversational
utterances were rich in conversational speech phenomena, but they lacked
the general coverage that allows unit selection and HMM-based synthesis techniques
to synthesise high quality speech. Therefore we investigated a number of blending approaches
in the synthetic voices, where the conversational utterances were augmented
with conventional read aloud speech.
The synthetic voices that contained conversational speech were contrasted with
conventional voices without conversational speech. The perceptual evaluations showed
that the conversational voices were generally perceived by listeners as having a more
conversational style than the conventional voices. This conversational style was largely
due to the conversational voicesâ ability to synthesise utterances that contained conversational
speech phenomena in a more natural manner than the conventional voices.
Additionally, we conducted an experiment that showed that natural sounding conversational
characteristics in synthetic speech can convey pragmatic information, in our
case an impression of certainty or uncertainty, about a topic to a listener. The conclusion
drawn is that the limited behaviour of conventional synthetic voices can be
enriched by utilising conversational speech in both unit selection and HMM-based
speech synthesis