26 research outputs found

    Voice source characterization for prosodic and spectral manipulation

    Get PDF
    The objective of this dissertation is to study and develop techniques to decompose the speech signal into its two main components: voice source and vocal tract. Our main efforts are on the glottal pulse analysis and characterization. We want to explore the utility of this model in different areas of speech processing: speech synthesis, voice conversion or emotion detection among others. Thus, we will study different techniques for prosodic and spectral manipulation. One of our requirements is that the methods should be robust enough to work with the large databases typical of speech synthesis. We use a speech production model in which the glottal flow produced by the vibrating vocal folds goes through the vocal (and nasal) tract cavities and its radiated by the lips. Removing the effect of the vocal tract from the speech signal to obtain the glottal pulse is known as inverse filtering. We use a parametric model fo the glottal pulse directly in the source-filter decomposition phase. In order to validate the accuracy of the parametrization algorithm, we designed a synthetic corpus using LF glottal parameters reported in the literature, complemented with our own results from the vowel database. The results show that our method gives satisfactory results in a wide range of glottal configurations and at different levels of SNR. Our method using the whitened residual compared favorably to this reference, achieving high quality ratings (Good-Excellent). Our full parametrized system scored lower than the other two ranking in third place, but still higher than the acceptance threshold (Fair-Good). Next we proposed two methods for prosody modification, one for each of the residual representations explained above. The first method used our full parametrization system and frame interpolation to perform the desired changes in pitch and duration. The second method used resampling on the residual waveform and a frame selection technique to generate a new sequence of frames to be synthesized. The results showed that both methods are rated similarly (Fair-Good) and that more work is needed in order to achieve quality levels similar to the reference methods. As part of this dissertation, we have studied the application of our models in three different areas: voice conversion, voice quality analysis and emotion recognition. We have included our speech production model in a reference voice conversion system, to evaluate the impact of our parametrization in this task. The results showed that the evaluators preferred our method over the original one, rating it with a higher score in the MOS scale. To study the voice quality, we recorded a small database consisting of isolated, sustained Spanish vowels in four different phonations (modal, rough, creaky and falsetto) and were later also used in our study of voice quality. Comparing the results with those reported in the literature, we found them to generally agree with previous findings. Some differences existed, but they could be attributed to the difficulties in comparing voice qualities produced by different speakers. At the same time we conducted experiments in the field of voice quality identification, with very good results. We have also evaluated the performance of an automatic emotion classifier based on GMM using glottal measures. For each emotion, we have trained an specific model using different features, comparing our parametrization to a baseline system using spectral and prosodic characteristics. The results of the test were very satisfactory, showing a relative error reduction of more than 20% with respect to the baseline system. The accuracy of the different emotions detection was also high, improving the results of previously reported works using the same database. Overall, we can conclude that the glottal source parameters extracted using our algorithm have a positive impact in the field of automatic emotion classification

    Synthesis of listener vocalizations : towards interactive speech synthesis

    Get PDF
    Spoken and multi-modal dialogue systems start to use listener vocalizations, such as uh-huh and mm-hm, for natural interaction. Generation of listener vocalizations is one of the major objectives of emotionally colored conversational speech synthesis. Success in this endeavor depends on the answers to three questions: Where to synthesize a listener vocalization? What meaning should be conveyed through the synthesized vocalization? And, how to realize an appropriate listener vocalization with the intended meaning? This thesis addresses the latter question. The investigation starts with proposing a three-stage approach: (i) data collection, (ii) annotation, and (iii) realization. The first stage presents a method to collect natural listener vocalizations from German and British English professional actors in a recording studio. In the second stage, we explore a methodology for annotating listener vocalizations -- meaning and behavior (form) annotation. The third stage proposes a realization strategy that uses unit selection and signal modification techniques to generate appropriate listener vocalizations upon user requests. Finally, we evaluate naturalness and appropriateness of synthesized vocalizations using perception studies. The work is implemented in the open source MARY text-to-speech framework, and it is integrated into the SEMAINE project\u27s Sensitive Artificial Listener (SAL) demonstrator.Dialogsysteme nutzen zunehmend Hörer-Vokalisierungen, wie z.B. a-ha oder mm-hm, für natürliche Interaktion. Die Generierung von Hörer-Vokalisierungen ist eines der zentralen Ziele emotional gefärbter, konversationeller Sprachsynthese. Ein Erfolg in diesem Unterfangen hängt von den Antworten auf drei Fragen ab: Wo bzw. wann sollten Vokalisierungen synthetisiert werden? Welche Bedeutung sollte in den synthetisierten Vokalisierungen vermittelt werden? Und wie können angemessene Hörer-Vokalisierungen mit der intendierten Bedeutung realisiert werden? Diese Arbeit widmet sich der letztgenannten Frage. Die Untersuchung erfolgt in drei Schritten: (i) Korpuserstellung; (ii) Annotation; und (iii) Realisierung. Der erste Schritt präsentiert eine Methode zur Sammlung natürlicher Hörer-Vokalisierungen von deutschen und britischen Profi-Schauspielern in einem Tonstudio. Im zweiten Schritt wird eine Methodologie zur Annotation von Hörer-Vokalisierungen erarbeitet, die sowohl Bedeutung als auch Verhalten (Form) umfasst. Der dritte Schritt schlägt ein Realisierungsverfahren vor, die Unit-Selection-Synthese mit Signalmodifikationstechniken kombiniert, um aus Nutzeranfragen angemessene Hörer-Vokalisierungen zu generieren. Schließlich werden Natürlichkeit und Angemessenheit synthetisierter Vokalisierungen mit Hilfe von Hörtests evaluiert. Die Methode wurde im Open-Source-Sprachsynthesesystem MARY implementiert und in den Sensitive Artificial Listener-Demonstrator im Projekt SEMAINE integriert

    Tones of Lhasa Tibetan

    Get PDF
    The author of this thesis claims that Lhasa Tibetan has more tonal contrasts than has hitherto generally been recognized. The proposed tonal classification has interesting consequences for the segmental phonology, in particular for the voicing status of initial stops and for some aspects of the phonology of stem compounds. No attempt has been made to adhere strictly to a specific school of pho¬ nology; but the presentation of the material has been in¬ fluenced by classical phonemic, generative, and natural phonology theory. A special effort has been made through out the study to give a fair amount of phonetic data in support of the analysis proposed

    Tagungsband der 12. Tagung Phonetik und Phonologie im deutschsprachigen Raum

    Get PDF

    Mechanisms of vowel devoicing in Japanese

    Get PDF
    The processes of vowel devoicing in Standard Japanese were examined with respect to the phonetic and phonological environments and the syllable structure of Japanese, in comparison with vowel reduction processes in other languages, in most of which vowel reduction occurs optionally in fast or casual speech. This thesis examined whether Japanese vowel devoicing was a phonetic phenomenon caused by glottal assimilation between a high vowel and its adjacent voiceless consonants, or it was a more phonologically controlled compulsory process. Experimental results showed that Japanese high vowel devoicing must be analysed separately in two devoicing conditions, namely single and consecutive devoicing environments. Devoicing was almost compulsory regardless of the presence of proposed blocking factors such as type of preceding consonant, accentuation, position in an utterance, as long as there was no devoiceable vowel in adjacent morae (single devoicing condition). However, under consecutive devoicing conditions, blocking factors became effective and prevented some devoiceable vowels from becoming voiceless. The effect of speaking rate was also generally minimal in the single devoicing condition, but in the consecutive devoicing condition, the vowels were devoiced more at faster tempi than slower tempi, which created many examples of consecutively devoiced vowels over two morae. Durational observations found that vowel devoicing involves not only phonatory change, but also slight durational reduction. However, the shorter duration of devoiced syllables were adjusted at the word level, so that the whole duration of a word with devoiced vowels remained similar to the word without devoiced vowels, regardless of the number of devoiced vowels in the word. It must be noted that there was no clear-cut distinction between voiced and devoiced vowels, and the phonetic realisation of a devoiced vowel could vary from fully voiced to completely voiceless. A high vowel may be voiced in a typical devoicing environment, but its intensity is significantly weaker than those of vowels in a non-devoicing environment, at all speaking tempi. The mean differences of vowel intensities between these environments were generally higher at faster tempi. The results implied that even when the vowel was voiced, its production process moved in favour of devoicing. However, in consecutive devoicing conditions, this process did not always apply. When some of the devoiceable vowels were devoiced in the consecutive devoicing environment, the intensities of devoiceable vowels were not significantly lower than those of other vowels. The results of intensity measurements of voiced vowels in the devoicing and nondevoicing environments suggested that Japanese vowel devoicing was part of the overall process of complex vowel weakening, and that a completely devoiced vowel was the final state of the weakening process. Japanese vowel devoicing is primarily a process of glottal assimilation, but the results in the consecutive devoicing condition showed that this process was constrained by Japanese syllable structure

    Perception and production of english final stops by young brazilian efl students

    Get PDF
    Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro de Comunicação e Expressão. Programa de Pós-Graduação em Letras/Inglês e Literatura correspondenteThis research focuses on the investigation of the perception and production of English final-stops by young Brazilian EFL students. It was reported the quantitative results of one AX discrimination, one imitation and one free-production test. The discussion of the tendencies of production as well as the relationship between perception and production tested the hypothesis of markedness degree in relation to voicing and place of articulation of the target phonemes. In addition, the correlation between the perceptual sensitivity of CVC syllable pattern and the ability to produce the final-stops in a target-like fashion was also analyzed in the present study. Twelve learners (mean age 5.2 years) in their 4 ½ semesters of L2 instruction were tested. Following Koerich (2002), the six stops were investigated in terms of markedness of the consonants by: (1) voicing of the final-stops, and (2) place of articulation. In addition, it was examined the markedness of the CVC syllabic pattern and the simplification strategies applied by this sample. The relationship between perception and production was assessed in terms of syllabic complexity (CVC versus CVCi). The overall results revealed that the participants do apply simplification strategies to final-stops in CVC words. The voiced stops were not more mispronounced than the voiceless targets and the bilabials seemed to be the only ones that, if not modified by epenthesis, followed the prediction concerning place of articulation. A positive correlation was found between the results from the imitation and the production tests, and not from the AX discrimination results. Esta pesquisa tem o objetivo de investigar a percepção e a produção das obstruintes finais em palavras do inglês por crianças falantes do Português. Resultados quantitativos dos testes de percepção (AX), de imitação e de produção foram reportados conjuntamente com discussão sobre as tendências na produção e na relação entre produção e percepção, testando a hipótese da marcação em relação a vozeamento e ao ponto articulatório das consoantes-alvo, assim como a da correlação entre a percepção do padrão silábico CVC e a habilidade de produzir as obstruintes apropriadamente. Doze estudantes (M = 5 anos e 2 meses) em seu 4? semestre de instrução foram testados. Seguindo Koerich, 2002, as obstruintes foram investigadas em duas variáveis relacionadas à marcação das consoantes: (1) vozeamento das obstruintes e (2) ponto articulatório. Foram examinadas as estratégias de simplificação utilizadas pela população testada em relação à marcação do padrão silábico CVC. A relação entre percepção e produção foi verificada de acordo com o contraste entre CVC e CVCi. Os resultados revelaram que os participantes fizeram uso de estratégias de simplificação nas obstruintes finais. As obstruintes vozeadas não mostraram mais erros de pronúncia do que as surdas e as labiais foram as únicas que, quando não receberam a vogal epentética, seguiram a tendência em relação ao ponto articulatório. Foi verificada uma fraca correlação positiva apenas entre os resultados obtidos no teste de imitação e produção

    Articulatory features for conversational speech recognition

    Get PDF
    corecore