125 research outputs found

    An introduction to statistical parametric speech synthesis

    Get PDF

    Development of the Slovak HMM-Based TTS System and Evaluation of Voices in Respect to the Used Vocoding Techniques

    Get PDF
    This paper describes the development of a Slovak text-to-speech system which applies a technique wherein speech is directly synthesized from hidden Markov models. Statistical models for Slovak speech units are trained by using the newly created female and male phonetically balanced speech corpora. In addition, contextual informations about phonemes, syllables, words, phrases, and utterances were determined, as well as questions for decision tree-based context clustering algorithms. In this paper, recent statistical parametric speech synthesis methods including the conventional, STRAIGHT and AHOcoder speech synthesis systems are implemented and evaluated. Objective evaluation methods (mel-cepstral distortion and fundamental frequency comparison) and subjective ones (mean opinion score and semantically unpredictable sentences test) are carried out to compare these systems with each other and evaluation of their overall quality. The result of this work is a set of text to speech systems for Slovak language which are characterized by very good intelligibility and quite good naturalness of utterances at the output of these systems. In the subjective tests of intelligibility the STRAIGHT based female voice and AHOcoder based male voice reached the highest scores

    HMM-based synthesis of child speech

    Get PDF
    The synthesis of child speech presents challenges both in the collection of data and in the building of a synthesiser from that data. Because only limited data can be collected, and the domain of that data is constrained, it is difficult to obtain the type of phonetically-balanced corpus usually used in speech synthesis. As a consequence, building a synthesiser from this data is difficult. Concatenative synthesisers are not robust to corpora with many missing units (as is likely when the corpus content is not carefully designed), so we chose to build a statistical parametric synthesiser using the HMM-based system HTS. This technique has previously been shown to perform well for limited amounts of data, and for data collected under imperfect conditions. We compared 6 different configurations of the synthesiser, using both speaker-dependent and speaker-adaptive modelling techniques, and using varying amounts of data. The output from these systems was evaluated alongside natural and vocoded speech, in a Blizzard-style listening test

    Synthesis of listener vocalizations : towards interactive speech synthesis

    Get PDF
    Spoken and multi-modal dialogue systems start to use listener vocalizations, such as uh-huh and mm-hm, for natural interaction. Generation of listener vocalizations is one of the major objectives of emotionally colored conversational speech synthesis. Success in this endeavor depends on the answers to three questions: Where to synthesize a listener vocalization? What meaning should be conveyed through the synthesized vocalization? And, how to realize an appropriate listener vocalization with the intended meaning? This thesis addresses the latter question. The investigation starts with proposing a three-stage approach: (i) data collection, (ii) annotation, and (iii) realization. The first stage presents a method to collect natural listener vocalizations from German and British English professional actors in a recording studio. In the second stage, we explore a methodology for annotating listener vocalizations -- meaning and behavior (form) annotation. The third stage proposes a realization strategy that uses unit selection and signal modification techniques to generate appropriate listener vocalizations upon user requests. Finally, we evaluate naturalness and appropriateness of synthesized vocalizations using perception studies. The work is implemented in the open source MARY text-to-speech framework, and it is integrated into the SEMAINE project\u27s Sensitive Artificial Listener (SAL) demonstrator.Dialogsysteme nutzen zunehmend Hörer-Vokalisierungen, wie z.B. a-ha oder mm-hm, für natürliche Interaktion. Die Generierung von Hörer-Vokalisierungen ist eines der zentralen Ziele emotional gefärbter, konversationeller Sprachsynthese. Ein Erfolg in diesem Unterfangen hängt von den Antworten auf drei Fragen ab: Wo bzw. wann sollten Vokalisierungen synthetisiert werden? Welche Bedeutung sollte in den synthetisierten Vokalisierungen vermittelt werden? Und wie können angemessene Hörer-Vokalisierungen mit der intendierten Bedeutung realisiert werden? Diese Arbeit widmet sich der letztgenannten Frage. Die Untersuchung erfolgt in drei Schritten: (i) Korpuserstellung; (ii) Annotation; und (iii) Realisierung. Der erste Schritt präsentiert eine Methode zur Sammlung natürlicher Hörer-Vokalisierungen von deutschen und britischen Profi-Schauspielern in einem Tonstudio. Im zweiten Schritt wird eine Methodologie zur Annotation von Hörer-Vokalisierungen erarbeitet, die sowohl Bedeutung als auch Verhalten (Form) umfasst. Der dritte Schritt schlägt ein Realisierungsverfahren vor, die Unit-Selection-Synthese mit Signalmodifikationstechniken kombiniert, um aus Nutzeranfragen angemessene Hörer-Vokalisierungen zu generieren. Schließlich werden Natürlichkeit und Angemessenheit synthetisierter Vokalisierungen mit Hilfe von Hörtests evaluiert. Die Methode wurde im Open-Source-Sprachsynthesesystem MARY implementiert und in den Sensitive Artificial Listener-Demonstrator im Projekt SEMAINE integriert

    Statistical Parametric Methods for Articulatory-Based Foreign Accent Conversion

    Get PDF
    Foreign accent conversion seeks to transform utterances from a non-native speaker (L2) to appear as if they had been produced by the same speaker but with a native (L1) accent. Such accent-modified utterances have been suggested to be effective in pronunciation training for adult second language learners. Accent modification involves separating the linguistic gestures and voice-quality cues from the L1 and L2 utterances, then transposing them across the two speakers. However, because of the complex interaction between these two sources of information, their separation in the acoustic domain is not straightforward. As a result, vocoding approaches to accent conversion results in a voice that is different from both the L1 and L2 speakers. In contrast, separation in the articulatory domain is straightforward since linguistic gestures are readily available via articulatory data. However, because of the difficulty in collecting articulatory data, conventional synthesis techniques based on unit selection are ill-suited for accent conversion given the small size of articulatory corpora and the inability to interpolate missing native sounds in L2 corpus. To address these issues, this dissertation presents two statistical parametric methods to accent conversion that operate in the acoustic and articulatory domains, respectively. The acoustic method uses a cross-speaker statistical mapping to generate L2 acoustic features from the trajectories of L1 acoustic features in a reference utterance. Our results show significant reductions in the perceived non-native accents compared to the corresponding L2 utterance. The results also show a strong voice-similarity between accent conversions and the original L2 utterance. Our second (articulatory-based) approach consists of building a statistical parametric articulatory synthesizer for a non-native speaker, then driving the synthesizer with the articulators from the reference L1 speaker. This statistical approach not only has low data requirements but also has the flexibility to interpolate missing sounds in the L2 corpus. In a series of listening tests, articulatory accent conversions were rated more intelligible and less accented than their L2 counterparts. In the final study, we compare the two approaches: acoustic and articulatory. Our results show that the articulatory approach, despite the direct access to the native linguistic gestures, is less effective in reducing perceived non-native accents than the acoustic approach

    Methods for speaking style conversion from normal speech to high vocal effort speech

    Get PDF
    This thesis deals with vocal-effort-focused speaking style conversion (SSC). Specifically, we studied two topics on conversion of normal speech to high vocal effort. The first topic involves the conversion of normal speech to shouted speech. We employed this conversion in a speaker recognition system with vocal effort mismatch between test and enrollment utterances (shouted speech vs. normal speech). The mismatch causes a degradation of the system's speaker identification performance. As solution, we proposed a SSC system that included a novel spectral mapping, used along a statistical mapping technique, to transform the mel-frequency spectral energies of normal speech enrollment utterances towards their counterparts in shouted speech. We evaluated the proposed solution by comparing speaker identification rates for a state-of-the-art i-vector-based speaker recognition system, with and without applying SSC to the enrollment utterances. Our results showed that applying the proposed SSC pre-processing to the enrollment data improves considerably the speaker identification rates. The second topic involves a normal-to-Lombard speech conversion. We proposed a vocoder-based parametric SSC system to perform the conversion. This system first extracts speech features using the vocoder. Next, a mapping technique, robust to data scarcity, maps the features. Finally, the vocoder synthesizes the mapped features into speech. We used two vocoders in the conversion system, for comparison: a glottal vocoder and the widely used STRAIGHT. We assessed the converted speech from the two vocoder cases with two subjective listening tests that measured similarity to Lombard speech and naturalness. The similarity subjective test showed that, for both vocoder cases, our proposed SSC system was able to convert normal speech to Lombard speech. The naturalness subjective test showed that the converted samples using the glottal vocoder were clearly more natural than those obtained with STRAIGHT

    In search of the optimal acoustic features for statistical parametric speech synthesis

    Get PDF
    In the Statistical Parametric Speech Synthesis (SPSS) paradigm, speech is generally represented as acoustic features and the waveform is generated by a vocoder. A comprehensive summary of state-of-the-art vocoding techniques is presented, highlighting their characteristics, advantages, and drawbacks, primarily when used in SPSS. We conclude that state-of-the-art vocoding methods are suboptimal and are a cause of significant loss of quality, even though numerous vocoders have been proposed in the last decade. In fact, it seems that the most complicated methods perform worse than simpler ones based on more robust analysis/synthesis algorithms. Typical methods, based on the source-filter or sinusoidal models, rely on excessive simplifying assumptions. They perform what we call an "extreme decomposition" of speech (e.g., source+filter or sinusoids+ noise), which we believe to be a major drawback. Problems include: difficulties in the estimation of components; modelling of complex non-linear mechanisms; a lack of ground truth. In addition, the statistical dependence that exists between stochastic and deterministic components of speech is not modelled. We start by improving just the waveform generation stage of SPSS, using standard acoustic features. We propose a new method of waveform generation tailored for SPSS, based on neither source-filter separation nor sinusoidal modelling. The proposed waveform generator avoids unnecessary assumptions and decompositions as far as possible, and uses only the fundamental frequency and spectral envelope as acoustic features. A very small speech database is used as a source of base speech signals which are subsequently \reshaped" to match the specifications output by the acoustic model in the SPSS framework. All of this is done without any decomposition, such as source+filter or harmonics+noise. A comprehensive description of the waveform generation process is presented, along with implementation issues. Two SPSS voices, a female and a male, were built to test the proposed method by using a standard TTS toolkit, Merlin. In a subjective evaluation, listeners preferred the proposed waveform generator over a state-of-the-art vocoder, STRAIGHT. Even though the proposed \waveform reshaping" generator generates higher speech quality than STRAIGHT, the improvement is not large enough. Consequently, we propose a new acoustic representation, whose implementation involves feature extraction and waveform generation, i.e., a complete vocoder. The new representation encodes the complex spectrum derived from the Fourier Transform in a way explicitly designed for SPSS, rather than for speech coding or copy-synthesis. The feature set comprises four feature streams describing magnitude spectrum, phase spectrum, and fundamental frequency; all of these are represented by real numbers. It avoids heuristics or unstable methods for phase unwrapping. The new feature extraction does not attempt to decompose the speech structure and thus the "phasiness" and "buzziness" found in a typical vocoder, such as STRAIGHT, is dramatically reduced. Our method works at a lower frame rate than a typical vocoder. To demonstrate the proposed method, two DNN-based voices, a male and a female, were built using the Merlin toolkit. Subjective comparisons were performed with a state-of-the-art baseline. The proposed vocoder substantially outperformed the baseline for both voices and under all configurations tested. Furthermore, several enhancements were made over the original design, which are beneficial for either sound quality or compatibility with other tools. In addition to its use in SPSS, the proposed vocoder is also demonstrated being used for join smoothing in unit selection-based systems, and can be used for voice conversion or automatic speech recognition

    A review of state-of-the-art speech modelling methods for the parameterisation of expressive synthetic speech

    Get PDF
    This document will review a sample of available voice modelling and transformation techniques, in view of an application in expressive unit-selection based speech synthesis in the framework of the PAVOQUE project. The underlying idea is to introduce some parametric modification capabilities at the level of the synthesis system, in order to compensate for the sparsity and rigidity, in terms of available emotional speaking styles, of the databases used to define speech synthesis voices. For this work, emotion-related parametric modifications will be restricted to the domains of voice quality and prosody, as suggested by several reviews addressing the vocal correlates of emotions (Schröder, 2001; Schröder, 2004; Roehling et al., 2006). The present report will start with a review of some techniques related to voice quality modelling and modification. First, it will explore the techniques related to glottal flow modelling. Then, it will review the domain of cross-speaker voice transformations, in view of a transposition to the domain of cross-emotion voice transformations. This topic will be exposed from the perspective of the parametric spectral modelling of speech and then from the perspective of available spectral transformation techniques. Then, the domain of prosodic parameterisation and modification will be reviewed

    Effect of Contextual Speech Rate on Speech Comprehension

    Full text link
    Despite an extensive history of study, the effects of phonetic context are only known to affect small units of speech (e.g., formant transitions, function words). Critical aspects of speech perception, however, occur at larger scales. The series of experiments reported here investigated the effects of contextual speech rate on perception of a large unit of speech, namely sentences. In particular, there was an effect of relative rate on sentence comprehension – the rate of a sentence compared to the average rate of all other sentences within the same conversation-length period of speech – such that relatively slow sentences were better comprehended than relatively fast sentences (Experiment 1); however, differences in perceptual learning between the relatively slow and the relatively fast rates accounted for the effect of relative rate (Experiment 2). The results of these studies, therefore, do not support an effect of contextual speech rate on sentence comprehension. Finally, based on the results of a modified version of Experiment 1 in which context sentences were replaced with non-speech sounds (i.e., 1-channel noise vocoded speech), exposure to temporal information was not sufficient for generalization of perceptual learning (Experiment 3). These experiments are a novel investigation into both the effects of phonetic context on sentence comprehension, and the efficacy of non-speech sounds on generalization of perceptual learning
    • …
    corecore