6 research outputs found

    Automatic generation of fundamental frequency for text-to-speech synthesis

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1997.Includes bibliographical references (p. 82-86).by Aaron Seth Cohen.M.Eng

    A Phonetic model of English intonation

    Get PDF
    This thesis proposes a phonetic model of English intonation which is a system for linking the phonological and F₀, descriptions of an utterance.It is argued that such a model should take the form of a rigorously defined formal system which does not require any human intuition or expertise to operate. It is also argued that this model should be capable of both analysis (F₀ to phonology) and synthesis (phonology to F₀). Existing phonetic models are reviewed and it is shown that none meet the specification for the type of formal model required.A new phonetic model is presented that has three levels of description: the F₀ level, the intermediate level and the phonological level. The intermediate level uses the three basic elements of rise,fall and connection to model F₀ contours. A mathematical equation is specified for each of these elements so that a continuous lb contour can be created from a sequence of elements. The phonological system uses H and L to describe high and low pitch accents, C to describe connection elements and B to describe the rises that occur at phrase boundaries. A fully specified grammar is described which links the intermediate and F₀ levels. A grammar is specified for linking the phonological and intermediate levels, but this is only partly complete due to problems with the phonological level of description.A computer implementation of the model is described. Most of the implementation work concentrated on the relationship between the intermediate level and the F₀ level. Results are given showing that the computer analysis system labels F₀ contours quite accurately, but is significantly worse than a human labeller. It is shown that the synthesis system produces artificial F₀ contours that are very similar to naturally occurring F₀ contoursThe thesis concludes with some indications of further work and ideas on how the computer implementation of the model could be of practical benefit in speech synthesis and recognition

    Organisation of Japanese prosody

    Get PDF
    This thesis is an experimental phonological study of pitch in Tokyo Japanese. It comprises five chapters all discussing prosodic processes and phenomena relating to accent, tone or intonation on the basis of experimental evidence. The discussion in each chapter is developed essentially in the following three steps: (i) a critical review or overview of the past work on the subject discussed in the chapter or section; (ii) presentation of new evidence mostly from instrumental experiments; (iii) a discussion of the experimental evidence in theoretical contexts. After outlining the nature and function of word accent in Chapter One, I discuss in Chapter Two the prosodic compound formation process which has traditionally been described as an accent (re)assignment process. I analyze the linguistic structures of those compounds which are not subject to the compound accent rules, and propose several factors which constrain the prosodic compound formation process, defining them as the linguistic conditions on the process. Chapters Three through Five deal with word accent in a wider context of speech, discussing its roles, behavior and phonetic realization in phrase or sentence perspective. Chapter Three discusses the phonetics and phonology of 'accentual fall, ' 'accentual boost' and 'accent clash, ' for each of which the fallacies underlying the impressionistic descriptions in the literature are demonstrated. Four discusses various problems relating to intonational phrases and phrasing. The first part of the chapter focuses on the definition of the two intonational phrases, 'major phrase' and 'minor phrase' while the second part of the chapter explores the linguistic conditions on 'minor phrase formation, ' the intonational phrasing process whereby two or more syntactic/morphological units are combined to form one minor intonational phrase. Chapter Five examines the linguistic structure of 'downtrend, ' the phenomenon whereby pitch declines during the course of utterances. It is shown in the first part of the chapter that Poser's 'catathesis' (downstep) model is a largely adequate model of the intonational phenomenon. After confirming that the trigger of the downtrend phenomenon is largely attributable to accent, it is shown in the second part of the chapter that this accent-triggered process varies considerably depending on the syntactic structure of the phrase or sentence involved, or, in other words, that the configuration of downstep serves to disambiguate otherwise ambiguous syntactic structures. In the course of discussing the specific topics just mentioned, several more general theoretical issues are addressed, including the following four topics: the relation between syntactic structure and phonological structure; the organization of rhythmic structure; the abstractness of phonological (tonal) representation; and the nature of phonetic realization rules

    Prosody analysis and modeling for Cantonese text-to-speech.

    Get PDF
    Li Yu Jia.Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.Includes bibliographical references.Abstracts in English and Chinese.Chapter Chapter 1 --- Introduction --- p.1Chapter 1.1. --- TTS Technology --- p.1Chapter 1.2. --- Prosody --- p.2Chapter 1.2.1. --- What is Prosody --- p.2Chapter 1.2.2. --- Prosody from Different Perspectives --- p.3Chapter 1.2.3. --- Acoustical Parameters of Prosody --- p.3Chapter 1.2.4. --- Prosody in TTS --- p.5Chapter 1.2.4.1 --- Analysis --- p.5Chapter 1.2.4.2 --- Modeling --- p.6Chapter 1.2.4.3 --- Evaluation --- p.6Chapter 1.3. --- Thesis Objectives --- p.7Chapter 1.4. --- Thesis Outline --- p.7Reference --- p.8Chapter Chapter 2 --- Cantonese --- p.9Chapter 2.1. --- The Cantonese Dialect --- p.9Chapter 2.1.1. --- Phonology --- p.10Chapter 2.1.1.1 --- Initial --- p.11Chapter 2.1.1.2 --- Final --- p.12Chapter 2.1.1.3 --- Tone --- p.13Chapter 2.1.2. --- Phonological Constraints --- p.14Chapter 2.2. --- Tones in Cantonese --- p.15Chapter 2.2.1. --- Tone System --- p.15Chapter 2.2.2. --- Linguistic Significance --- p.18Chapter 2.2.3. --- Acoustical Realization --- p.18Chapter 2.3. --- Prosodic Variation in Continuous Cantonese Speech --- p.20Chapter 2.4. --- Cantonese Speech Corpus - CUProsody --- p.21Reference --- p.23Chapter Chapter 3 --- F0 Normalization --- p.25Chapter 3.1. --- F0 in Speech Production --- p.25Chapter 3.2. --- F0 Extraction --- p.27Chapter 3.3. --- Duration-normalized Tone Contour --- p.29Chapter 3.4. --- F0 Normalization --- p.30Chapter 3.4.1. --- Necessity and Motivation --- p.30Chapter 3.4.2. --- F0 Normalization --- p.33Chapter 3.4.2.1 --- Methodology --- p.33Chapter 3.4.2.2 --- Assumptions --- p.34Chapter 3.4.2.3 --- Estimation of Relative Tone Ratios --- p.35Chapter 3.4.2.4 --- Derivation of Phrase Curve --- p.37Chapter 3.4.2.5 --- Normalization of Absolute FO Values --- p.39Chapter 3.4.3. --- Experiments and Discussion --- p.39Chapter 3.5. --- Conclusions --- p.44Reference --- p.45Chapter Chapter 4 --- Acoustical FO Analysis --- p.48Chapter 4.1. --- Methodology of FO Analysis --- p.48Chapter 4.1.1. --- Analysis-by-Synthesis --- p.48Chapter 4.1.2. --- Acoustical Analysis --- p.51Chapter 4.2. --- Acoustical FO Analysis for Cantonese --- p.52Chapter 4.2.1. --- Analysis of Phrase Curves --- p.52Chapter 4.2.2. --- Analysis of Tone Contours --- p.55Chapter 4.2.2.1 --- Context-independent Single-tone Contours --- p.56Chapter 4.2.2.2 --- Contextual Variation --- p.58Chapter 4.2.2.3 --- Co-articulated Tone Contours of Disyllabic Word --- p.59Chapter 4.2.2.4 --- Cross-word Contours --- p.62Chapter 4.2.2.5 --- Phrase-initial Tone Contours --- p.65Chapter 4.3. --- Summary --- p.66Reference --- p.67Chapter Chapter5 --- Prosody Modeling for Cantonese Text-to-Speech --- p.70Chapter 5.1. --- Parametric Model and Non-parametric Model --- p.70Chapter 5.2. --- Cantonese Text-to-Speech: Baseline System --- p.72Chapter 5.2.1. --- Sub-syllable Unit --- p.72Chapter 5.2.2. --- Text Analysis Module --- p.73Chapter 5.2.3. --- Acoustical Synthesis --- p.74Chapter 5.2.4. --- Prosody Module --- p.74Chapter 5.3. --- Enhanced Prosody Model --- p.74Chapter 5.3.1. --- Modeling Tone Contours --- p.75Chapter 5.3.1.1 --- Word-level FO Contours --- p.76Chapter 5.3.1.2 --- Phrase-initial Tone Contours --- p.77Chapter 5.3.1.3 --- Tone Contours at Word Boundary --- p.78Chapter 5.3.2. --- Modeling Phrase Curves --- p.79Chapter 5.3.3. --- Generation of Continuous FO Contours --- p.81Chapter 5.4. --- Summary --- p.81Reference --- p.82Chapter Chapter 6 --- Performance Evaluation --- p.83Chapter 6.1. --- Introduction to Perceptual Test --- p.83Chapter 6.1.1. --- Aspects of Evaluation --- p.84Chapter 6.1.2. --- Methods of Judgment Test --- p.84Chapter 6.1.3. --- Problems in Perceptual Test --- p.85Chapter 6.2. --- Perceptual Tests for Cantonese TTS --- p.86Chapter 6.2.1. --- Intelligibility Tests --- p.86Chapter 6.2.1.1 --- Method --- p.86Chapter 6.2.1.2 --- Results --- p.88Chapter 6.2.1.3 --- Analysis --- p.89Chapter 6.2.2. --- Naturalness Tests --- p.90Chapter 6.2.2.1 --- Word-level --- p.90Chapter 6.2.2.1.1 --- Method --- p.90Chapter 6.2.2.1.2 --- Results --- p.91Chapter 6.2.3.1.3 --- Analysis --- p.91Chapter 6.2.2.2 --- Sentence-level --- p.92Chapter 6.2.2.2.1 --- Method --- p.92Chapter 6.2.2.2.2 --- Results --- p.93Chapter 6.2.2.2.3 --- Analysis --- p.94Chapter 6.3. --- Conclusions --- p.95Chapter 6.4. --- Summary --- p.95Reference --- p.96Chapter Chapter 7 --- Conclusions and Future Work --- p.97Chapter 7.1. --- Conclusions --- p.97Chapter 7.2. --- Suggested Future Work --- p.99Appendix --- p.100Appendix 1 Linear Regression --- p.100Appendix 2 36 Templates of Cross-word Contours --- p.101Appendix 3 Word List for Word-level Tests --- p.102Appendix 4 Syllable Occurrence in Word List of Intelligibility Test --- p.108Appendix 5 Wrongly Identified Word List --- p.112Appendix 6 Confusion Matrix --- p.115Appendix 7 Unintelligible Word List --- p.117Appendix 8 Noisy Word List --- p.119Appendix 9 Sentence List for Naturalness Test --- p.12

    A Study of Accomodation of Prosodic and Temporal Features in Spoken Dialogues in View of Speech Technology Applications

    Get PDF
    Inter-speaker accommodation is a well-known property of human speech and human interaction in general. Broadly it refers to the behavioural patterns of two (or more) interactants and the effect of the (verbal and non-verbal) behaviour of each to that of the other(s). Implementation of thisbehavior in spoken dialogue systems is desirable as an improvement on the naturalness of humanmachine interaction. However, traditional qualitative descriptions of accommodation phenomena do not provide sufficient information for such an implementation. Therefore, a quantitativedescription of inter-speaker accommodation is required. This thesis proposes a methodology of monitoring accommodation during a human or humancomputer dialogue, which utilizes a moving average filter over sequential frames for each speaker. These frames are time-aligned across the speakers, hence the name Time Aligned Moving Average (TAMA). Analysis of spontaneous human dialogue recordings by means of the TAMA methodology reveals ubiquitous accommodation of prosodic features (pitch, intensity and speech rate) across interlocutors, and allows for statistical (time series) modeling of the behaviour, in a way which is meaningful for implementation in spoken dialogue system (SDS) environments.In addition, a novel dialogue representation is proposed that provides an additional point of view to that of TAMA in monitoring accommodation of temporal features (inter-speaker pause length and overlap frequency). This representation is a percentage turn distribution of individual speakercontributions in a dialogue frame which circumvents strict attribution of speaker-turns, by considering both interlocutors as synchronously active. Both TAMA and turn distribution metrics indicate that correlation of average pause length and overlap frequency between speakers can be attributed to accommodation (a debated issue), and point to possible improvements in SDS “turntaking” behaviour. Although the findings of the prosodic and temporal analyses can directly inform SDS implementations, further work is required in order to describe inter-speaker accommodation sufficiently, as well as to develop an adequate testing platform for evaluating the magnitude ofperceived improvement in human-machine interaction. Therefore, this thesis constitutes a first step towards a convincingly useful implementation of accommodation in spoken dialogue systems

    The form and auditory control of downward trends in intonation

    Get PDF
    Of all the areas of intonational research, study of the tendency of the frequency of vocal fold vibration to decline during the course of an utterance - F0 declination - is likely initially to be the most fruitful in determining the interaction between perceptual and productive processes. A general introduction to the phenomenon is augmented by analysis of different methods of determining declination lines; theoretical treatments are then introduced. One particular local factor contributing to the downward trend, downstep, is discussed, and its pivotal role in the intonational phonology developed by Janet Pierrehumbert critically examined. In the light of the theoretical discussion, two competing hypotheses are presented as to the mediation of the declination effect, which is the effect that of two accented syllables in an utterance, the second has to have a lower peak F0 value than the first for them to be judged to have equal prominence. The Global Declination Hypothesis attributes this to the use by speakers and hearers of one or two abstract reference lines declining through the course of a tone-unit. The Local Declination Hypothesis attributes it to the disposition of F0 excursions surrounding the two accents as well as to the respective peak values. The Global Declination Hypothesis is tested by presenting listeners with pairs of dual-peak accented utterances with the two peaks identical in F0, without any physically present local declination, and asking them to rate the prominence of the second peak of each such utterance. No significant differences are found in the prominence ratings, so the Local Declination Hypothesis appears to be favoured. That hypothesis is itself tested through the development of a model of individual accent prominence, which incorporates terms for surrounding unaccented context. This is then used as the basis of a model of the perceptual constraints on the production of intonation in the scaling of target peaks. The model predicts that local slope between accents and slope of the context after the target accent, as well as other local variables, jointly determine the F0 value of a peak with a particular targetted prominence relationship with its predecessor. If the interaccentual stretch is declining, the declination effect is predicted to occur, ceteris paribus. The model is found to be initially acceptable. In addition, a global interpretation of downstep is made within the model. The mechanisms the model is suggested to represent are auditory feedback control loops of a variety of possible degrees of complexity. An experiment is devised to test for the basic existence of a feedback loop which is used to prevent local slope exceeding an arbitrary threshold value. Auditory feedback In subjects was disrupted by headphone-administration of low-pass filtered masking noise during their utterance of a sustained vowel, and a short and a long dual peak-accented sentence. The disruption was sufficient to alter the apparent mechanism controlling the production of the sustained vowel, but the Lombard effect, whereby subjects automatically raise the level of their voice in ambient noise, was found to be a vitiating factor. General conclusions are drawn on the nature of the declination phenomenon In intonation, and proposals made for future research
    corecore