26 research outputs found
Voice source characterization for prosodic and spectral manipulation
The objective of this dissertation is to study and develop techniques to decompose the speech signal into its two main
components: voice source and vocal tract. Our main efforts are on the glottal pulse analysis and characterization. We want to
explore the utility of this model in different areas of speech processing: speech synthesis, voice conversion or emotion detection
among others. Thus, we will study different techniques for prosodic and spectral manipulation. One of our requirements is that
the methods should be robust enough to work with the large databases typical of speech synthesis. We use a speech production
model in which the glottal flow produced by the vibrating vocal folds goes through the vocal (and nasal) tract cavities and its
radiated by the lips. Removing the effect of the vocal tract from the speech signal to obtain the glottal pulse is known as inverse
filtering. We use a parametric model fo the glottal pulse directly in the source-filter decomposition phase.
In order to validate the accuracy of the parametrization algorithm, we designed a synthetic corpus using LF glottal parameters
reported in the literature, complemented with our own results from the vowel database. The results show that our method gives
satisfactory results in a wide range of glottal configurations and at different levels of SNR. Our method using the whitened
residual compared favorably to this reference, achieving high quality ratings (Good-Excellent). Our full parametrized system
scored lower than the other two ranking in third place, but still higher than the acceptance threshold (Fair-Good).
Next we proposed two methods for prosody modification, one for each of the residual representations explained above. The first
method used our full parametrization system and frame interpolation to perform the desired changes in pitch and duration. The
second method used resampling on the residual waveform and a frame selection technique to generate a new sequence of
frames to be synthesized. The results showed that both methods are rated similarly (Fair-Good) and that more work is needed in
order to achieve quality levels similar to the reference methods.
As part of this dissertation, we have studied the application of our models in three different areas: voice conversion, voice quality
analysis and emotion recognition. We have included our speech production model in a reference voice conversion system, to
evaluate the impact of our parametrization in this task. The results showed that the evaluators preferred our method over the
original one, rating it with a higher score in the MOS scale. To study the voice quality, we recorded a small database consisting of
isolated, sustained Spanish vowels in four different phonations (modal, rough, creaky and falsetto) and were later also used in
our study of voice quality. Comparing the results with those reported in the literature, we found them to generally agree with
previous findings. Some differences existed, but they could be attributed to the difficulties in comparing voice qualities produced
by different speakers. At the same time we conducted experiments in the field of voice quality identification, with very good
results. We have also evaluated the performance of an automatic emotion classifier based on GMM using glottal measures. For
each emotion, we have trained an specific model using different features, comparing our parametrization to a baseline system
using spectral and prosodic characteristics. The results of the test were very satisfactory, showing a relative error reduction of
more than 20% with respect to the baseline system. The accuracy of the different emotions detection was also high, improving
the results of previously reported works using the same database. Overall, we can conclude that the glottal source parameters
extracted using our algorithm have a positive impact in the field of automatic emotion classification
Synthesis of listener vocalizations : towards interactive speech synthesis
Spoken and multi-modal dialogue systems start to use listener vocalizations, such as uh-huh and mm-hm, for natural interaction. Generation of listener vocalizations is one of the major objectives of emotionally colored conversational speech synthesis. Success in this endeavor depends on the answers to three questions: Where to synthesize a listener vocalization? What meaning should be conveyed through the synthesized vocalization? And, how to realize an appropriate listener vocalization with the intended meaning? This thesis addresses the latter question. The investigation starts with proposing a three-stage approach: (i) data collection, (ii) annotation, and (iii) realization. The first stage presents a method to collect natural listener vocalizations from German and British English professional actors in a recording studio. In the second stage, we explore a methodology for annotating listener vocalizations -- meaning and behavior (form) annotation. The third stage proposes a realization strategy that uses unit selection and signal modification techniques to generate appropriate listener vocalizations upon user requests. Finally, we evaluate naturalness and appropriateness of synthesized vocalizations using perception studies. The work is implemented in the open source MARY text-to-speech framework, and it is integrated into the SEMAINE project\u27s Sensitive Artificial Listener (SAL) demonstrator.Dialogsysteme nutzen zunehmend Hörer-Vokalisierungen, wie z.B. a-ha oder mm-hm, für natürliche Interaktion. Die Generierung von Hörer-Vokalisierungen ist eines der zentralen Ziele emotional gefärbter, konversationeller Sprachsynthese. Ein Erfolg in diesem Unterfangen hängt von den Antworten auf drei Fragen ab: Wo bzw. wann sollten Vokalisierungen synthetisiert werden? Welche Bedeutung sollte in den synthetisierten Vokalisierungen vermittelt werden? Und wie können angemessene Hörer-Vokalisierungen mit der intendierten Bedeutung realisiert werden? Diese Arbeit widmet sich der letztgenannten Frage. Die Untersuchung erfolgt in drei Schritten: (i) Korpuserstellung; (ii) Annotation; und (iii) Realisierung. Der erste Schritt präsentiert eine Methode zur Sammlung natürlicher Hörer-Vokalisierungen von deutschen und britischen Profi-Schauspielern in einem Tonstudio. Im zweiten Schritt wird eine Methodologie zur Annotation von Hörer-Vokalisierungen erarbeitet, die sowohl Bedeutung als auch Verhalten (Form) umfasst. Der dritte Schritt schlägt ein Realisierungsverfahren vor, die Unit-Selection-Synthese mit Signalmodifikationstechniken kombiniert, um aus Nutzeranfragen angemessene Hörer-Vokalisierungen zu generieren. Schließlich werden Natürlichkeit und Angemessenheit synthetisierter Vokalisierungen mit Hilfe von Hörtests evaluiert. Die Methode wurde im Open-Source-Sprachsynthesesystem MARY implementiert und in den Sensitive Artificial Listener-Demonstrator im Projekt SEMAINE integriert
Tones of Lhasa Tibetan
The author of this thesis claims that Lhasa Tibetan
has more tonal contrasts than has hitherto generally been
recognized. The proposed tonal classification has interesting consequences for the segmental phonology, in particular for the voicing status of initial stops and for some
aspects of the phonology of stem compounds. No attempt has
been made to adhere strictly to a specific school of pho¬
nology; but the presentation of the material has been in¬
fluenced by classical phonemic, generative, and natural
phonology theory. A special effort has been made through out the study to give a fair amount of phonetic data in
support of the analysis proposed
Mechanisms of vowel devoicing in Japanese
The processes of vowel devoicing in Standard Japanese were examined with respect
to the phonetic and phonological environments and the syllable structure of Japanese, in
comparison with vowel reduction processes in other languages, in most of which vowel
reduction occurs optionally in fast or casual speech. This thesis examined whether
Japanese vowel devoicing was a phonetic phenomenon caused by glottal assimilation
between a high vowel and its adjacent voiceless consonants, or it was a more
phonologically controlled compulsory process.
Experimental results showed that Japanese high vowel devoicing must be analysed
separately in two devoicing conditions, namely single and consecutive devoicing
environments. Devoicing was almost compulsory regardless of the presence of
proposed blocking factors such as type of preceding consonant, accentuation, position
in an utterance, as long as there was no devoiceable vowel in adjacent morae (single
devoicing condition). However, under consecutive devoicing conditions, blocking
factors became effective and prevented some devoiceable vowels from becoming
voiceless.
The effect of speaking rate was also generally minimal in the single devoicing
condition, but in the consecutive devoicing condition, the vowels were devoiced more
at faster tempi than slower tempi, which created many examples of consecutively
devoiced vowels over two morae.
Durational observations found that vowel devoicing involves not only phonatory
change, but also slight durational reduction. However, the shorter duration of devoiced
syllables were adjusted at the word level, so that the whole duration of a word with
devoiced vowels remained similar to the word without devoiced vowels, regardless of
the number of devoiced vowels in the word.
It must be noted that there was no clear-cut distinction between voiced and
devoiced vowels, and the phonetic realisation of a devoiced vowel could vary from
fully voiced to completely voiceless. A high vowel may be voiced in a typical
devoicing environment, but its intensity is significantly weaker than those of vowels in
a non-devoicing environment, at all speaking tempi. The mean differences of vowel
intensities between these environments were generally higher at faster tempi.
The results implied that even when the vowel was voiced, its production process
moved in favour of devoicing. However, in consecutive devoicing conditions, this
process did not always apply. When some of the devoiceable vowels were devoiced in
the consecutive devoicing environment, the intensities of devoiceable vowels were not
significantly lower than those of other vowels.
The results of intensity measurements of voiced vowels in the devoicing and nondevoicing
environments suggested that Japanese vowel devoicing was part of the
overall process of complex vowel weakening, and that a completely devoiced vowel
was the final state of the weakening process. Japanese vowel devoicing is primarily a
process of glottal assimilation, but the results in the consecutive devoicing condition
showed that this process was constrained by Japanese syllable structure
Perception and production of english final stops by young brazilian efl students
Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro de Comunicação e Expressão. Programa de Pós-Graduação em Letras/Inglês e Literatura correspondenteThis research focuses on the investigation of the perception and production of English final-stops by young Brazilian EFL students. It was reported the quantitative results of one AX discrimination, one imitation and one free-production test. The discussion of the tendencies of production as well as the relationship between perception and production tested the hypothesis of markedness degree in relation to voicing and place of articulation of the target phonemes. In addition, the correlation between the perceptual sensitivity of CVC syllable pattern and the ability to produce the final-stops in a target-like fashion was also analyzed in the present study. Twelve learners (mean age 5.2 years) in their 4 ½ semesters of L2 instruction were tested. Following Koerich (2002), the six stops were investigated in terms of markedness of the consonants by: (1) voicing of the final-stops, and (2) place of articulation. In addition, it was examined the markedness of the CVC syllabic pattern and the simplification strategies applied by this sample. The relationship between perception and production was assessed in terms of syllabic complexity (CVC versus CVCi). The overall results revealed that the participants do apply simplification strategies to final-stops in CVC words. The voiced stops were not more mispronounced than the voiceless targets and the bilabials seemed to be the only ones that, if not modified by epenthesis, followed the prediction concerning place of articulation. A positive correlation was found between the results from the imitation and the production tests, and not from the AX discrimination results. Esta pesquisa tem o objetivo de investigar a percepção e a produção das obstruintes finais em palavras do inglês por crianças falantes do Português. Resultados quantitativos dos testes de percepção (AX), de imitação e de produção foram reportados conjuntamente com discussão sobre as tendências na produção e na relação entre produção e percepção, testando a hipótese da marcação em relação a vozeamento e ao ponto articulatório das consoantes-alvo, assim como a da correlação entre a percepção do padrão silábico CVC e a habilidade de produzir as obstruintes apropriadamente. Doze estudantes (M = 5 anos e 2 meses) em seu 4? semestre de instrução foram testados. Seguindo Koerich, 2002, as obstruintes foram investigadas em duas variáveis relacionadas à marcação das consoantes: (1) vozeamento das obstruintes e (2) ponto articulatório. Foram examinadas as estratégias de simplificação utilizadas pela população testada em relação à marcação do padrão silábico CVC. A relação entre percepção e produção foi verificada de acordo com o contraste entre CVC e CVCi. Os resultados revelaram que os participantes fizeram uso de estratégias de simplificação nas obstruintes finais. As obstruintes vozeadas não mostraram mais erros de pronúncia do que as surdas e as labiais foram as únicas que, quando não receberam a vogal epentética, seguiram a tendência em relação ao ponto articulatório. Foi verificada uma fraca correlação positiva apenas entre os resultados obtidos no teste de imitação e produção