739 research outputs found

    Statistical modelling of speech units in HMM-based speech synthesis for Arabic

    Get PDF
    International audienceThis paper investigates statistical parametric speech synthesis of Modern Standard Arabic (MSA). Hidden Markov Models (HMM)-based speech synthesis system relies on a description of speech segments corresponding to phonemes, with a large set of features that represent phonetic, phonologic, linguistic and contextual aspects. When applied to MSA two specific phenomena have to be taken in account, the vowel lengthening and the consonant gemination. This paper studies thoroughly the modeling of these phenomena through various approaches: as for example, the use of different units for modeling short vs. long vowels and the use of different units for modeling simple vs. geminated consonants. These approaches are compared to another one which merges short and long variants of a vowel into a single unit and, simple and geminated variants of a consonant into a single unit (these characteristics being handled through the features associated to the sound). Results of subjective evaluation show that there is no significant difference between using the same unit for simple and geminated consonant (as well as for short and long vowels) and using different units for simple vs. geminated consonants (as well for short vs. long vowels)

    DNN-Based Speech Synthesis for Arabic: Modelling and Evaluation

    Get PDF
    International audienceThis paper investigates the use of deep neural networks (DNN) for Arabic speech synthesis. In parametric speech synthesis, whether HMM-based or DNN-based, each speech segment is described with a set of contextual features. These contextual features correspond to linguistic, phonetic and prosodic information that may affect the pronunciation of the segments. Gemination and vowel quantity (short vowel vs. long vowel) are two particular and important phenomena in Arabic language. Hence, it is worth investigating if those phenomena must be handled by using specific speech units, or if their specification in the contextual features is enough. Consequently four modelling approaches are evaluated by considering geminated consonants (respectively long vowels) either as fully-fledged phoneme units or as the same phoneme as their simple (respectively short) counterparts. Although no significant difference has been observed in previous studies relying on HMM-based modelling, this paper examines these modelling variants in the framework of DNN-based speech synthesis. Listening tests are conducted to evaluate the four modelling approaches, and to assess the performance of DNN-based Arabic speech synthesis with respect to previous HMM-based approach

    Evaluation of speech unit modelling for HMM-based speech synthesis for Arabic

    Get PDF
    International audienceThis paper investigates the use of hidden Markov models (HMM) for Modern Standard Arabic speech synthesis. HMM-basedspeech synthesis systems require a description of each speech unit with a set of contextual features that specifies phonetic,phonological and linguistic aspects. To apply this method to Arabic language, a study of its particularities was conductedto extract suitable contextual features. Two phenomena are highlighted: vowel quantity and gemination. This work focuseson how to model geminated consonants (resp. long vowels), either considering them as fully-fledged phonemes or as thesame phonemes as their simple (resp. short) counterparts but with a different duration. Four modelling approaches have beenproposed for this purpose. Results of subjective and objective evaluations show that there is no important difference betweendifferentiating modelling units associated to geminated consonants (resp. long vowels) from modelling units associated tosimple consonants (resp. short vowels) and merging them as long as gemination and vowel quantity information is includedin the set of features

    Synthesis of listener vocalizations : towards interactive speech synthesis

    Get PDF
    Spoken and multi-modal dialogue systems start to use listener vocalizations, such as uh-huh and mm-hm, for natural interaction. Generation of listener vocalizations is one of the major objectives of emotionally colored conversational speech synthesis. Success in this endeavor depends on the answers to three questions: Where to synthesize a listener vocalization? What meaning should be conveyed through the synthesized vocalization? And, how to realize an appropriate listener vocalization with the intended meaning? This thesis addresses the latter question. The investigation starts with proposing a three-stage approach: (i) data collection, (ii) annotation, and (iii) realization. The first stage presents a method to collect natural listener vocalizations from German and British English professional actors in a recording studio. In the second stage, we explore a methodology for annotating listener vocalizations -- meaning and behavior (form) annotation. The third stage proposes a realization strategy that uses unit selection and signal modification techniques to generate appropriate listener vocalizations upon user requests. Finally, we evaluate naturalness and appropriateness of synthesized vocalizations using perception studies. The work is implemented in the open source MARY text-to-speech framework, and it is integrated into the SEMAINE project\u27s Sensitive Artificial Listener (SAL) demonstrator.Dialogsysteme nutzen zunehmend Hörer-Vokalisierungen, wie z.B. a-ha oder mm-hm, für natürliche Interaktion. Die Generierung von Hörer-Vokalisierungen ist eines der zentralen Ziele emotional gefärbter, konversationeller Sprachsynthese. Ein Erfolg in diesem Unterfangen hängt von den Antworten auf drei Fragen ab: Wo bzw. wann sollten Vokalisierungen synthetisiert werden? Welche Bedeutung sollte in den synthetisierten Vokalisierungen vermittelt werden? Und wie können angemessene Hörer-Vokalisierungen mit der intendierten Bedeutung realisiert werden? Diese Arbeit widmet sich der letztgenannten Frage. Die Untersuchung erfolgt in drei Schritten: (i) Korpuserstellung; (ii) Annotation; und (iii) Realisierung. Der erste Schritt präsentiert eine Methode zur Sammlung natürlicher Hörer-Vokalisierungen von deutschen und britischen Profi-Schauspielern in einem Tonstudio. Im zweiten Schritt wird eine Methodologie zur Annotation von Hörer-Vokalisierungen erarbeitet, die sowohl Bedeutung als auch Verhalten (Form) umfasst. Der dritte Schritt schlägt ein Realisierungsverfahren vor, die Unit-Selection-Synthese mit Signalmodifikationstechniken kombiniert, um aus Nutzeranfragen angemessene Hörer-Vokalisierungen zu generieren. Schließlich werden Natürlichkeit und Angemessenheit synthetisierter Vokalisierungen mit Hilfe von Hörtests evaluiert. Die Methode wurde im Open-Source-Sprachsynthesesystem MARY implementiert und in den Sensitive Artificial Listener-Demonstrator im Projekt SEMAINE integriert

    Étude comparative des paramètres d'entrée pour la synthèse expressive audiovisuelle de la parole par DNNs

    Get PDF
    National audienceDans le passé, les descripteurs contextuels pour la synthèse de la parole acoustique ont été étudiés pour l’entraînement des systèmes basés sur des HMMs. Dans ce travail, nous étudions l’impact de ces facteurs pour la synthèse de la parole audiovisuelle par DNNs. Nous analysons cet impact pour les trois aspects de la parole : la modalité acoustique, la modalité visuelle et les durées des phonèmes. Nous étudions également l’apport d’un entraînement joint et séparé des deux modalités acoustique et visuelle sur la qualité de la parole synthétique générée. Finalement, nous procédons à une validation croisée entre les résultats de la synthèse des différentes émotions. Cette validation croisée, nous a permis de vérifier la capacité des DNNs à apprendre des caractéristiques spécifiques à chaque émotion

    ACII 2009: Affective Computing and Intelligent Interaction. Proceedings of the Doctoral Consortium 2009

    Get PDF
    • …
    corecore