458 research outputs found

    Towards Hierarchical Prosodic Prominence Generation in TTS Synthesis

    Get PDF
    We address the problem of identification (from text) and generation of pitch accents in HMM-based English TTS synthesis. We show, through a large scale perceptual test, that a large improvement of the binary discrimination between pitch accented and non-accented words has no effect on the quality of the speech generated by the system. On the other side adding a third accent type that emphatically marks words that convey ”contrastive” focus (automatically identified from text) produces beneficial effects on the synthesized speech. These results support the accounts on prosodic prominence that consider the prosodic patterns of utterances as hierarchical structured and point out the limits of a flattening of such structure resulting from a simple accent/non-accent distinction. Index Terms: speech synthesis, HMM, pitch accents, focus detection 1

    Glottal Source and Prosodic Prominence Modelling in HMM-based Speech Synthesis for the Blizzard Challenge 2009

    Get PDF
    This paper describes the CSTR entry for the Blizzard Challenge 2009. The work focused on modifying two parts of the Nitech 2005 HTS speech synthesis system to improve naturalness and contextual appropriateness. The first part incorporated an implementation of the Linjencrants-Fant (LF) glottal source model. The second part focused on improving synthesis of prosodic prominence including emphasis through context dependent phonemes. Emphasis was assigned to the synthesised test sentences based on a handful of theory based rules. The two parts (LF-model and prosodic prominence) were not combined and hence evaluated separately. The results on naturalness for the LF-model showed that it is not yet perceived as natural as the Benchmark HTS system for neutral speech. The results for the prosodic prominence modelling showed that it was perceived as contextually appropriate as the Benchmark HTS system, despite a low naturalness score. The Blizzard challenge evaluation has provided valuable information on the status of our work and continued work will begin with analysing why our modifications resulted in reduced naturalness compared to the Benchmark HTS system

    Identifying prosodic prominence patterns for English text-to-speech synthesis

    Get PDF
    This thesis proposes to improve and enrich the expressiveness of English Text-to-Speech (TTS) synthesis by identifying and generating natural patterns of prosodic prominence. In most state-of-the-art TTS systems the prediction from text of prosodic prominence relations between words in an utterance relies on features that very loosely account for the combined effects of syntax, semantics, word informativeness and salience, on prosodic prominence. To improve prosodic prominence prediction we first follow up the classic approach in which prosodic prominence patterns are flattened into binary sequences of pitch accented and pitch unaccented words. We propose and motivate statistic and syntactic dependency based features that are complementary to the most predictive features proposed in previous works on automatic pitch accent prediction and show their utility on both read and spontaneous speech. Different accentuation patterns can be associated to the same sentence. Such variability rises the question on how evaluating pitch accent predictors when more patterns are allowed. We carry out a study on prosodic symbols variability on a speech corpus where different speakers read the same text and propose an information-theoretic definition of optionality of symbolic prosodic events that leads to a novel evaluation metric in which prosodic variability is incorporated as a factor affecting prediction accuracy. We additionally propose a method to take advantage of the optionality of prosodic events in unit-selection speech synthesis. To better account for the tight links between the prosodic prominence of a word and the discourse/sentence context, part of this thesis goes beyond the accent/no-accent dichotomy and is devoted to a novel task, the automatic detection of contrast, where contrast is meant as a (Information Structure’s) relation that ties two words that explicitly contrast with each other. This task is mainly motivated by the fact that contrastive words tend to be prosodically marked with particularly prominent pitch accents. The identification of contrastive word pairs is achieved by combining lexical information, syntactic information (which mainly aims to identify the syntactic parallelism that often activates contrast) and semantic information (mainly drawn from the Word- Net semantic lexicon), within a Support Vector Machines classifier. Once we have identified patterns of prosodic prominence we propose methods to incorporate such information in TTS synthesis and test its impact on synthetic speech naturalness trough some large scale perceptual experiments. The results of these experiments cast some doubts on the utility of a simple accent/no-accent distinction in Hidden Markov Model based speech synthesis while highlight the importance of contrastive accents

    Altering speech synthesis prosody through real time natural gestural control

    Get PDF
    A significant amount of research has been and continues to be undertaken into generating expressive prosody within speech synthesis. Separately, recent developments in HMM-based synthesis (specifically pHTS, developed at University of Mons) provide a platform for reactive speech synthesis, able to react in real time to surroundings or user interaction. Considering both of these elements, this project explores whether it is possible to generate superior prosody in a speech synthesis system, using natural gestural controls, in real time. Building on a previous piece of work undertaken at The University of Edinburgh, a system is constructed in which a user may apply a variety of prosodic effects in real time through natural gestures, recognised by a Microsoft Kinect sensor. Gestures are recognised and prosodic adjustments made through a series of hand-crafted rules (based on data gathered from preliminary experiments), though machine learning techniques are also considered within this project and recommended for future iterations of the work. Two sets of formal experiments are implemented, both of which suggest that - under further development - the system developed may work successfully in a real world environment. Firstly, user tests show that subjects can learn to control the device successfully, adding prosodic effects to the intended words in the majority of cases with practice. Results are likely to improve further as buffering issues are resolved. Secondly, listening tests show that the prosodic effects currently implemented significantly increase perceived naturalness, and in some cases are able to alter the semantic perception of a sentence in an intended way. Alongside this paper, a demonstration video of the project may be found on the accompanying CD, or online at http://tinyurl.com/msc-synthesis. The reader is advised to view this demonstration, as a way of understanding how the system functions and sounds in action

    Hesitations in Spoken Dialogue Systems

    Get PDF
    Betz S. Hesitations in Spoken Dialogue Systems. Bielefeld: Universität Bielefeld; 2020

    Akustische Phonetik und ihre multidisziplinären Aspekte

    Get PDF
    The aim of this book is to honor the multidisciplinary work of Doz. Dr. Sylvia Moosmüller† in the field of acoustic phonetics. The essays in this volume range from sociophonetics, language diagnostics, dialectology, to language technology. They thus exemplify the breadth of acoustic phonetics, which has been shaped by influences from the humanities and technical sciences since its beginnings.Ziel dieses Buches ist es, die multidisziplinäre Arbeit von Doz. Dr. Sylvia Moosmüller (†) im Bereich der akustischen Phonetik zu würdigen. Die Aufsätze in diesem Band sind in der Soziophonetik, Sprachdiagnostik, Dialektologie und Sprachtechnologie angesiedelt. Sie stellen damit exemplarisch die Breite der akustischen Phonetik dar, die seit ihrer Entstehung durch Einflüsse aus den Geisteswissenschaften und den technischen Wissenschaften geprägt war
    • …
    corecore