215 research outputs found

    Intonation in a text-to-speech conversion system

    Get PDF

    A Timing Model for Fast French

    Get PDF
    Models of speech timing are of both fundamental and applied interest. At the fundamental level, the prediction of time periods occupied by syllables and segments is required for general models of speech prosody and segmental structure. At the applied level, complete models of timing are an essential component of any speech synthesis system. Previous research has established that a large number of factors influence various levels of speech timing. Statistical analysis and modelling can identify order of importance and mutual influences between such factors. In the present study, a three-tiered model was created by a modified step-wise statistical procedure. It predicts the temporal structure of French, as produced by a single, highly fluent speaker at a fast speech rate (100 phonologically balanced sentences, hand-scored in the acoustic signal). The first tier models segmental influences due to phoneme type and contextual interactions between phoneme types. The second tier models syllable-level influences of lexical vs. grammatical status of the containing word, presence of schwa and the position within the word. The third tier models utterance-final lengthening. The complete segmental-syllabic model correlated with the original corpus of 1204 syllables at an overall r = 0.846. Residuals were normally distributed. An examination of subsets of the data set revealed some variation in the closeness of fit of the model. The results are considered to be useful for an initial timing model, particularly in a speech synthesis context. However, further research is required to extend the model to other speech rates and to examine inter-speaker variability in greater detail

    Quality evaluation of synthesized speech

    Get PDF
    Fonetische correlaten en communicatieve functies van linguïstische structuu

    On the Usability of Spoken Dialogue Systems

    Get PDF

    MUST&P-SRL: Multi-lingual and Unified Syllabification in Text and Phonetic Domains for Speech Representation Learning

    Full text link
    In this paper, we present a methodology for linguistic feature extraction, focusing particularly on automatically syllabifying words in multiple languages, with a design to be compatible with a forced-alignment tool, the Montreal Forced Aligner (MFA). In both the textual and phonetic domains, our method focuses on the extraction of phonetic transcriptions from text, stress marks, and a unified automatic syllabification (in text and phonetic domains). The system was built with open-source components and resources. Through an ablation study, we demonstrate the efficacy of our approach in automatically syllabifying words from several languages (English, French and Spanish). Additionally, we apply the technique to the transcriptions of the CMU ARCTIC dataset, generating valuable annotations available online\footnote{\url{https://github.com/noetits/MUST_P-SRL}} that are ideal for speech representation learning, speech unit discovery, and disentanglement of speech factors in several speech-related fields.Comment: Accepted for publication at EMNLP 202

    Fast Speech in Unit Selection Speech Synthesis

    Get PDF
    Moers-Prinz D. Fast Speech in Unit Selection Speech Synthesis. Bielefeld: Universität Bielefeld; 2020.Speech synthesis is part of the everyday life of many people with severe visual disabilities. For those who are reliant on assistive speech technology the possibility to choose a fast speaking rate is reported to be essential. But also expressive speech synthesis and other spoken language interfaces may require an integration of fast speech. Architectures like formant or diphone synthesis are able to produce synthetic speech at fast speech rates, but the generated speech does not sound very natural. Unit selection synthesis systems, however, are capable of delivering more natural output. Nevertheless, fast speech has not been adequately implemented into such systems to date. Thus, the goal of the work presented here was to determine an optimal strategy for modeling fast speech in unit selection speech synthesis to provide potential users with a more natural sounding alternative for fast speech output

    Observations on the dynamic control of an articulatory synthesizer using speech production data

    Get PDF
    This dissertation explores the automatic generation of gestural score based control structures for a three-dimensional articulatory speech synthesizer. The gestural scores are optimized in an articulatory resynthesis paradigm using a dynamic programming algorithm and a cost function which measures the deviation from a gold standard in the form of natural speech production data. This data had been recorded using electromagnetic articulography, from the same speaker to which the synthesizer\u27s vocal tract model had previously been adapted. Future work to create an English voice for the synthesizer and integrate it into a text-to-speech platform is outlined.Die vorliegende Dissertation untersucht die automatische Erzeugung von gesturalpartiturbasierten Steuerdaten für ein dreidimensionales artikulatorisches Sprachsynthesesystem. Die gesturalen Partituren werden in einem artikulatorischen Resynthese-Paradigma mittels dynamischer Programmierung optimiert, unter Zuhilfenahme einer Kostenfunktion, die den Abstand zu einem "Gold Standard" in Form natürlicher Sprachproduktionsdaten mißt. Diese Daten waren mit elektromagnetischer Artikulographie am selben Sprecher aufgenommen worden, an den zuvor das Vokaltraktmodell des Synthesesystems angepaßt worden war. Weiterführende Forschung, eine englische Stimme für das Synthesesystem zu erzeugen und sie in eine Text-to-Speech-Plattform einzubetten, wird umrissen

    Further Investigation of MDS as a Tool for Evaluation of Speech Quality of Synthesized Speech

    Get PDF
    The dissertation investigates MDS as a tool for the evaluation of the quality of synthesized speech. More specifically, it investigates the relations between Weighted Euclidean Distance Scaling and Simple Euclidean Distance Scaling, and how aggregating data affects the MDS configuration. It is investigated to what extent a subset of experimental participants and/or experimental stimuli are representative of a larger test set. For that purpose an experiment was conducted on the basis of a subset of stimuli used in the Blizzard Challenge 2008. Issues in the evaluation of Speech Synthesis are discussed and an overview of the basics of multi-dimensional scaling is given to an extent that allows comprehension of methods used in the application of Multi-dimensional scaling to speech synthesis evaluation. Based on the experimental findings, further experiments are suggested with the goal in mind that testing procedures can be optimized to such an extent that the number of experimental participants can be drastically reduced
    corecore