458 research outputs found
Towards Hierarchical Prosodic Prominence Generation in TTS Synthesis
We address the problem of identification (from text) and generation of pitch accents in HMM-based English TTS synthesis. We show, through a large scale perceptual test, that a large improvement of the binary discrimination between pitch accented and non-accented words has no effect on the quality of the speech generated by the system. On the other side adding a third accent type that emphatically marks words that convey âcontrastiveâ focus (automatically identified from text) produces beneficial effects on the synthesized speech. These results support the accounts on prosodic prominence that consider the prosodic patterns of utterances as hierarchical structured and point out the limits of a flattening of such structure resulting from a simple accent/non-accent distinction. Index Terms: speech synthesis, HMM, pitch accents, focus detection 1
Glottal Source and Prosodic Prominence Modelling in HMM-based Speech Synthesis for the Blizzard Challenge 2009
This paper describes the CSTR entry for the Blizzard Challenge 2009. The work focused on modifying two parts of the Nitech 2005 HTS speech synthesis system to improve naturalness and contextual appropriateness. The first part incorporated an implementation of the Linjencrants-Fant (LF) glottal source model. The second part focused on improving synthesis of prosodic prominence including emphasis through context dependent phonemes. Emphasis was assigned to the synthesised test sentences based on a handful of theory based rules. The two parts (LF-model and prosodic prominence) were not combined and hence evaluated separately. The results on naturalness for the LF-model showed that it is not yet perceived as natural as the Benchmark HTS system for neutral speech. The results for the prosodic prominence modelling showed that it was perceived as contextually appropriate as the Benchmark HTS system, despite a low naturalness score. The Blizzard challenge evaluation has provided valuable information on the status of our work and continued work will begin with analysing why our modifications resulted in reduced naturalness compared to the Benchmark HTS system
Identifying prosodic prominence patterns for English text-to-speech synthesis
This thesis proposes to improve and enrich the expressiveness of English Text-to-Speech (TTS) synthesis by identifying and generating natural patterns of prosodic
prominence.
In most state-of-the-art TTS systems the prediction from text of prosodic prominence
relations between words in an utterance relies on features that very loosely account
for the combined effects of syntax, semantics, word informativeness and salience,
on prosodic prominence.
To improve prosodic prominence prediction we first follow up the classic approach
in which prosodic prominence patterns are flattened into binary sequences of pitch accented
and pitch unaccented words. We propose and motivate statistic and syntactic
dependency based features that are complementary to the most predictive features proposed
in previous works on automatic pitch accent prediction and show their utility on
both read and spontaneous speech.
Different accentuation patterns can be associated to the same sentence. Such variability
rises the question on how evaluating pitch accent predictors when more patterns
are allowed. We carry out a study on prosodic symbols variability on a speech corpus
where different speakers read the same text and propose an information-theoretic definition
of optionality of symbolic prosodic events that leads to a novel evaluation metric
in which prosodic variability is incorporated as a factor affecting prediction accuracy.
We additionally propose a method to take advantage of the optionality of prosodic
events in unit-selection speech synthesis.
To better account for the tight links between the prosodic prominence of a word and
the discourse/sentence context, part of this thesis goes beyond the accent/no-accent dichotomy
and is devoted to a novel task, the automatic detection of contrast, where
contrast is meant as a (Information Structureâs) relation that ties two words that explicitly
contrast with each other. This task is mainly motivated by the fact that contrastive
words tend to be prosodically marked with particularly prominent pitch accents.
The identification of contrastive word pairs is achieved by combining lexical information,
syntactic information (which mainly aims to identify the syntactic parallelism
that often activates contrast) and semantic information (mainly drawn from the Word-
Net semantic lexicon), within a Support Vector Machines classifier.
Once we have identified patterns of prosodic prominence we propose methods to
incorporate such information in TTS synthesis and test its impact on synthetic speech
naturalness trough some large scale perceptual experiments. The results of these experiments cast some doubts on the utility of a simple accent/no-accent
distinction in Hidden Markov Model based speech synthesis while highlight the
importance of contrastive accents
Altering speech synthesis prosody through real time natural gestural control
A significant amount of research has been and continues to be undertaken into generating
expressive prosody within speech synthesis. Separately, recent developments in
HMM-based synthesis (specifically pHTS, developed at University of Mons) provide
a platform for reactive speech synthesis, able to react in real time to surroundings or
user interaction.
Considering both of these elements, this project explores whether it is possible to
generate superior prosody in a speech synthesis system, using natural gestural controls,
in real time. Building on a previous piece of work undertaken at The University of Edinburgh,
a system is constructed in which a user may apply a variety of prosodic effects
in real time through natural gestures, recognised by a Microsoft Kinect sensor. Gestures
are recognised and prosodic adjustments made through a series of hand-crafted
rules (based on data gathered from preliminary experiments), though machine learning
techniques are also considered within this project and recommended for future iterations
of the work.
Two sets of formal experiments are implemented, both of which suggest that - under
further development - the system developed may work successfully in a real world
environment. Firstly, user tests show that subjects can learn to control the device successfully,
adding prosodic effects to the intended words in the majority of cases with
practice. Results are likely to improve further as buffering issues are resolved. Secondly,
listening tests show that the prosodic effects currently implemented significantly
increase perceived naturalness, and in some cases are able to alter the semantic perception
of a sentence in an intended way.
Alongside this paper, a demonstration video of the project may be found on the accompanying
CD, or online at http://tinyurl.com/msc-synthesis. The reader is advised
to view this demonstration, as a way of understanding how the system functions and
sounds in action
Hesitations in Spoken Dialogue Systems
Betz S. Hesitations in Spoken Dialogue Systems. Bielefeld: Universität Bielefeld; 2020
Akustische Phonetik und ihre multidisziplinären Aspekte
The aim of this book is to honor the multidisciplinary work of Doz. Dr. Sylvia MoosmĂźllerâ in the field of acoustic phonetics. The essays in this volume range from sociophonetics, language diagnostics, dialectology, to language technology. They thus exemplify the breadth of acoustic phonetics, which has been shaped by influences from the humanities and technical sciences since its beginnings.Ziel dieses Buches ist es, die multidisziplinäre Arbeit von Doz. Dr. Sylvia MoosmĂźller (â ) im Bereich der akustischen Phonetik zu wĂźrdigen. Die Aufsätze in diesem Band sind in der Soziophonetik, Sprachdiagnostik, Dialektologie und Sprachtechnologie angesiedelt. Sie stellen damit exemplarisch die Breite der akustischen Phonetik dar, die seit ihrer Entstehung durch EinflĂźsse aus den Geisteswissenschaften und den technischen Wissenschaften geprägt war
- âŚ