21 research outputs found

    Stylisation of Thai tones using Prosogram

    Get PDF
    The aim of this study is to establish whether stylisation of F0 contours based on d'Alessandro and Mertens's model of tonal perception can be successfully applied to lexical tones of Central Thai. The percentage of correct responses to the manipulated stimuli was found to be significantly lower than the results for natural tones reported in literature on the subject.The aim of this study is to establish whether stylisation of F0 contours based on d'Alessandro and Mertens's model of tonal perception can be successfully applied to lexical tones of Central Thai. The percentage of correct responses to the manipulated stimuli was found to be significantly lower than the results for natural tones reported in literature on the subject

    Hierarchical Representation and Estimation of Prosody using Continuous Wavelet Transform

    Get PDF
    Prominences and boundaries are the essential constituents of prosodic struc- ture in speech. They provide for means to chunk the speech stream into linguis- tically relevant units by providing them with relative saliences and demarcating them within utterance structures. Prominences and boundaries have both been widely used in both basic research on prosody as well as in text-to-speech syn- thesis. However, there are no representation schemes that would provide for both estimating and modelling them in a unified fashion. Here we present an unsupervised unified account for estimating and representing prosodic promi- nences and boundaries using a scale-space analysis based on continuous wavelet transform. The methods are evaluated and compared to earlier work using the Boston University Radio News corpus. The results show that the proposed method is comparable with the best published supervised annotation methods.Peer reviewe

    Emphasis and tonal implementation in Standard Chinese

    Get PDF
    Abstract Despite the greatly improved understanding of tonal articulation in Standard Chinese, no consensus has been reached on the most appropriate model of tonal implementation Results showed comparable increases in syllable duration from the NoEmphasis condition to the Emphasis condition and from the latter to the MoreEmphasis condition. F 0 range expansion, however, was non-gradual: while there was a substantial increase in the F 0 range from the NoEmphasis to the Emphasis condition, the expansion from the Emphasis to the MoreEmphasis condition was marginal. Analyses of the F 0 patterns revealed that under emphasis, lexical tones were realized with magnified F 0 contours which were adapted to both the neighbouring tones and the durational increase of the tone-bearing syllables, and therefore maximally distinguishable from each other. Implications of these findings on models of tone and focus realization are discussed.

    Explaining the PENTA model: a reply to Arvaniti and Ladd

    Get PDF
    This paper presents an overview of the Parallel Encoding and Target Approximation (PENTA) model of speech prosody, in response to an extensive critique by Arvaniti & Ladd (2009). PENTA is a framework for conceptually and computationally linking communicative meanings to fine-grained prosodic details, based on an articulatory-functional view of speech. Target Approximation simulates the articulatory realisation of underlying pitch targets – the prosodic primitives in the framework. Parallel Encoding provides an operational scheme that enables simultaneous encoding of multiple communicative functions. We also outline how PENTA can be computationally tested with a set of software tools. With the help of one of the tools, we offer a PENTA-based hypothetical account of the Greek intonational patterns reported by Arvaniti & Ladd, showing how it is possible to predict the prosodic shapes of an utterance based on the lexical and postlexical meanings it conveys

    Explaining the PENTA mode: A reply to Arvaniti and Ladd (2009)

    Get PDF
    his paper presents an overview of the Parallel Encoding and Target Approximation (PENTA) model of speech prosody, in response to an extensive critique by Arvaniti & Ladd (2009). PENTA is a framework for conceptually and computationally linking communicative meanings to fine-grained prosodic details, based on an articulatory-functional view of speech. Target Approximation simulates the articulatory realisation of underlying pitch targets – the prosodic primitives in the framework. Parallel Encoding provides an operational scheme that enables simultaneous encoding of multiple communicative functions. We also outline how PENTA can be computationally tested with a set of software tools. With the help of one of the tools, we offer a PENTA-based hypothetical account of the Greek intonational patterns reported by Arvaniti & Ladd, showing how it is possible to predict the prosodic shapes of an utterance based on the lexical and postlexical meanings it conveys

    Prosody analysis and modeling for Cantonese text-to-speech.

    Get PDF
    Li Yu Jia.Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.Includes bibliographical references.Abstracts in English and Chinese.Chapter Chapter 1 --- Introduction --- p.1Chapter 1.1. --- TTS Technology --- p.1Chapter 1.2. --- Prosody --- p.2Chapter 1.2.1. --- What is Prosody --- p.2Chapter 1.2.2. --- Prosody from Different Perspectives --- p.3Chapter 1.2.3. --- Acoustical Parameters of Prosody --- p.3Chapter 1.2.4. --- Prosody in TTS --- p.5Chapter 1.2.4.1 --- Analysis --- p.5Chapter 1.2.4.2 --- Modeling --- p.6Chapter 1.2.4.3 --- Evaluation --- p.6Chapter 1.3. --- Thesis Objectives --- p.7Chapter 1.4. --- Thesis Outline --- p.7Reference --- p.8Chapter Chapter 2 --- Cantonese --- p.9Chapter 2.1. --- The Cantonese Dialect --- p.9Chapter 2.1.1. --- Phonology --- p.10Chapter 2.1.1.1 --- Initial --- p.11Chapter 2.1.1.2 --- Final --- p.12Chapter 2.1.1.3 --- Tone --- p.13Chapter 2.1.2. --- Phonological Constraints --- p.14Chapter 2.2. --- Tones in Cantonese --- p.15Chapter 2.2.1. --- Tone System --- p.15Chapter 2.2.2. --- Linguistic Significance --- p.18Chapter 2.2.3. --- Acoustical Realization --- p.18Chapter 2.3. --- Prosodic Variation in Continuous Cantonese Speech --- p.20Chapter 2.4. --- Cantonese Speech Corpus - CUProsody --- p.21Reference --- p.23Chapter Chapter 3 --- F0 Normalization --- p.25Chapter 3.1. --- F0 in Speech Production --- p.25Chapter 3.2. --- F0 Extraction --- p.27Chapter 3.3. --- Duration-normalized Tone Contour --- p.29Chapter 3.4. --- F0 Normalization --- p.30Chapter 3.4.1. --- Necessity and Motivation --- p.30Chapter 3.4.2. --- F0 Normalization --- p.33Chapter 3.4.2.1 --- Methodology --- p.33Chapter 3.4.2.2 --- Assumptions --- p.34Chapter 3.4.2.3 --- Estimation of Relative Tone Ratios --- p.35Chapter 3.4.2.4 --- Derivation of Phrase Curve --- p.37Chapter 3.4.2.5 --- Normalization of Absolute FO Values --- p.39Chapter 3.4.3. --- Experiments and Discussion --- p.39Chapter 3.5. --- Conclusions --- p.44Reference --- p.45Chapter Chapter 4 --- Acoustical FO Analysis --- p.48Chapter 4.1. --- Methodology of FO Analysis --- p.48Chapter 4.1.1. --- Analysis-by-Synthesis --- p.48Chapter 4.1.2. --- Acoustical Analysis --- p.51Chapter 4.2. --- Acoustical FO Analysis for Cantonese --- p.52Chapter 4.2.1. --- Analysis of Phrase Curves --- p.52Chapter 4.2.2. --- Analysis of Tone Contours --- p.55Chapter 4.2.2.1 --- Context-independent Single-tone Contours --- p.56Chapter 4.2.2.2 --- Contextual Variation --- p.58Chapter 4.2.2.3 --- Co-articulated Tone Contours of Disyllabic Word --- p.59Chapter 4.2.2.4 --- Cross-word Contours --- p.62Chapter 4.2.2.5 --- Phrase-initial Tone Contours --- p.65Chapter 4.3. --- Summary --- p.66Reference --- p.67Chapter Chapter5 --- Prosody Modeling for Cantonese Text-to-Speech --- p.70Chapter 5.1. --- Parametric Model and Non-parametric Model --- p.70Chapter 5.2. --- Cantonese Text-to-Speech: Baseline System --- p.72Chapter 5.2.1. --- Sub-syllable Unit --- p.72Chapter 5.2.2. --- Text Analysis Module --- p.73Chapter 5.2.3. --- Acoustical Synthesis --- p.74Chapter 5.2.4. --- Prosody Module --- p.74Chapter 5.3. --- Enhanced Prosody Model --- p.74Chapter 5.3.1. --- Modeling Tone Contours --- p.75Chapter 5.3.1.1 --- Word-level FO Contours --- p.76Chapter 5.3.1.2 --- Phrase-initial Tone Contours --- p.77Chapter 5.3.1.3 --- Tone Contours at Word Boundary --- p.78Chapter 5.3.2. --- Modeling Phrase Curves --- p.79Chapter 5.3.3. --- Generation of Continuous FO Contours --- p.81Chapter 5.4. --- Summary --- p.81Reference --- p.82Chapter Chapter 6 --- Performance Evaluation --- p.83Chapter 6.1. --- Introduction to Perceptual Test --- p.83Chapter 6.1.1. --- Aspects of Evaluation --- p.84Chapter 6.1.2. --- Methods of Judgment Test --- p.84Chapter 6.1.3. --- Problems in Perceptual Test --- p.85Chapter 6.2. --- Perceptual Tests for Cantonese TTS --- p.86Chapter 6.2.1. --- Intelligibility Tests --- p.86Chapter 6.2.1.1 --- Method --- p.86Chapter 6.2.1.2 --- Results --- p.88Chapter 6.2.1.3 --- Analysis --- p.89Chapter 6.2.2. --- Naturalness Tests --- p.90Chapter 6.2.2.1 --- Word-level --- p.90Chapter 6.2.2.1.1 --- Method --- p.90Chapter 6.2.2.1.2 --- Results --- p.91Chapter 6.2.3.1.3 --- Analysis --- p.91Chapter 6.2.2.2 --- Sentence-level --- p.92Chapter 6.2.2.2.1 --- Method --- p.92Chapter 6.2.2.2.2 --- Results --- p.93Chapter 6.2.2.2.3 --- Analysis --- p.94Chapter 6.3. --- Conclusions --- p.95Chapter 6.4. --- Summary --- p.95Reference --- p.96Chapter Chapter 7 --- Conclusions and Future Work --- p.97Chapter 7.1. --- Conclusions --- p.97Chapter 7.2. --- Suggested Future Work --- p.99Appendix --- p.100Appendix 1 Linear Regression --- p.100Appendix 2 36 Templates of Cross-word Contours --- p.101Appendix 3 Word List for Word-level Tests --- p.102Appendix 4 Syllable Occurrence in Word List of Intelligibility Test --- p.108Appendix 5 Wrongly Identified Word List --- p.112Appendix 6 Confusion Matrix --- p.115Appendix 7 Unintelligible Word List --- p.117Appendix 8 Noisy Word List --- p.119Appendix 9 Sentence List for Naturalness Test --- p.12

    Toward invariant functional representations of variable surface fundamental frequency contours: Synthesizing speech melody via model-based stochastic learning

    Get PDF
    Variability has been one of the major challenges for both theoretical understanding and computer synthesis of speech prosody. In this paper we show that economical representation of variability is the key to effective modeling of prosody. Specifically, we report the development of PENTAtrainer—A trainable yet deterministic prosody synthesizer based on an articulatory–functional view of speech. We show with testing results on Thai, Mandarin and English that it is possible to achieve high-accuracy predictive synthesis of fundamental frequency contours with very small sets of parameters obtained through stochastic learning from real speech data. The first key component of this system is syllable-synchronized sequential target approximation—implemented as the qTA model, which is designed to simulate, for each tonal unit, a wide range of contextual variability with a single invariant target. The second key component is the automatic learning of function-specific targets through stochastic global optimization, guided by a layered pseudo-hierarchical functional annotation scheme, which requires the manual labeling of only the temporal domains of the functional units. The results in terms of synthesis accuracy demonstrate that effective modeling of the contextual variability is the key also to effective modeling of function-related variability. Additionally, we show that, being both theory-based and trainable (hence data-driven), computational systems like PENTAtrainer can serve as an effective modeling tool in basic research, with which the level of falsifiability in theory testing can be raised, and also a closer link between basic and applied research in speech science can be developed

    Automatic Conversion of Emotions in Speech within a Speaker Independent Framework

    Get PDF
    Emotions in speech are a fundamental part of a natural dialog. In everyday life, vocal interaction with people often implies emotions as an intrinsic part of the conversation to a greater or lesser extent. Thus, the inclusion of emotions in human-machine dialog systems is crucial to achieve an acceptable degree of naturalness in the communication. This thesis focuses on automatic emotion conversion of speech, a technique whose aim is to transform an utterance produced in neutral style to a certain emotion state in a speaker independent context. Conversion of emotions represents a challenge in the sense that emotions a affect significantly all the parts of the human vocal production system, and in the conversion process all these factors must be taken into account carefully. The techniques used in the literature are based on voice conversion approaches, with minor modifications to create the sensation of emotion. In this thesis, the idea of voice conversion systems is used as well, but the usual regression process is divided in a two-step procedure that provides additional speaker normalization to remove the intrinsic speaker dependency of this kind of systems, using vocal tract length normalization as a pre-processing technique. In addition, a new method to convert the duration trend of the utterance and the intonation contour is proposed, taking into account the contextual information
    corecore