494 research outputs found
Explaining the PENTA model: a reply to Arvaniti and Ladd
This paper presents an overview of the Parallel Encoding and Target Approximation (PENTA) model of speech prosody, in response to an extensive critique by Arvaniti & Ladd (2009). PENTA is a framework for conceptually and computationally linking communicative meanings to fine-grained prosodic details, based on an articulatory-functional view of speech. Target Approximation simulates the articulatory realisation of underlying pitch targets â the prosodic primitives in the framework. Parallel Encoding provides an operational scheme that enables simultaneous encoding of multiple communicative functions. We also outline how PENTA can be computationally tested with a set of software tools. With the help of one of the tools, we offer a PENTA-based hypothetical account of the Greek intonational patterns reported by Arvaniti & Ladd, showing how it is possible to predict the prosodic shapes of an utterance based on the lexical and postlexical meanings it conveys
Recommended from our members
Chapter 2: The Original ToBI System and the Evolution of the ToBI Framework
In this chapter, the authors will try to identify the essential properties of a ToBI framework annotation system by describing the development and design of the original ToBI conventions. In this description, the authors will overview the general phonological theory and the specific theory of Mainstream American English intonation and prosody that the authors decided to incorporate in the original ToBI tags. The authors will also state the practical principles that led us to make the decisions that the authors did. The chapter is organised as follows. Section 2.2 briefly chronicles how the MAE_ToBI system came into being. Section 2.3 briefly describes the consensus account of English intonation and prosody on which the MAE_ToBI system is based. Section 2.4 catalogues the different components of a MAE_ToBI transcription and lists the salient rules which constrain the relationships between different components. This section also expands upon the theoretical foundations and practical consequences of adopting the general structure of multiple labelling tiers, and particularly the separation of the labels for tones from the labels for indexing prosodic boundary strength. Section 2.5 then describes some of the extensions of the basic ToBI tiers that have been adopted by some sites. This section also compares our decisions about the number of tiers and about inter-tier constraints with the analogous decisions for some of the other ToBI systems described in this book. Section 2.6 discusses the status of the symbolic labels relative to the continuous phonetic records that are also an obligatory component of the MAE_ToBI transcription. Section 2.7 then closes by listing several open research questions that the authors would like to see addressed by MAE_ToBI users and the larger ToBI community
Nonparallel Emotional Speech Conversion
We propose a nonparallel data-driven emotional speech conversion method. It
enables the transfer of emotion-related characteristics of a speech signal
while preserving the speaker's identity and linguistic content. Most existing
approaches require parallel data and time alignment, which is not available in
most real applications. We achieve nonparallel training based on an
unsupervised style transfer technique, which learns a translation model between
two distributions instead of a deterministic one-to-one mapping between paired
examples. The conversion model consists of an encoder and a decoder for each
emotion domain. We assume that the speech signal can be decomposed into an
emotion-invariant content code and an emotion-related style code in latent
space. Emotion conversion is performed by extracting and recombining the
content code of the source speech and the style code of the target emotion. We
tested our method on a nonparallel corpora with four emotions. Both subjective
and objective evaluations show the effectiveness of our approach.Comment: Published in INTERSPEECH 2019, 5 pages, 6 figures. Simulation
available at http://www.jian-gao.org/emoga
Fundamental frequency modelling: an articulatory perspective with target approximation and deep learning
Current statistical parametric speech synthesis (SPSS) approaches typically aim at state/frame-level acoustic modelling, which leads to a problem of frame-by-frame independence. Besides that, whichever learning technique is used, hidden Markov model (HMM), deep neural network (DNN) or recurrent neural network (RNN), the fundamental idea is to set up a direct mapping from linguistic to acoustic features. Although progress is frequently reported, this idea is questionable in terms of biological plausibility. This thesis aims at addressing the above issues by integrating dynamic mechanisms of human speech production as a core component of F0 generation and thus developing a more human-like F0 modelling paradigm. By introducing an articulatory F0 generation model â target approximation (TA) â between text and speech that controls syllable-synchronised F0 generation, contextual F0 variations are processed in two separate yet integrated stages: linguistic to motor, and motor to acoustic. With the goal of demonstrating that human speech movement can be considered as a dynamic process of target approximation and that the TA model is a valid F0 generation model to be used at the motor-to-acoustic stage, a TA-based pitch control experiment is conducted first to simulate the subtle human behaviour of online compensation for pitch-shifted auditory feedback. Then, the TA parameters are collectively controlled by linguistic features via a deep or recurrent neural network (DNN/RNN) at the linguistic-to-motor stage. We trained the systems on a Mandarin Chinese dataset consisting of both statements and questions. The TA-based systems generally outperformed the baseline systems in both objective and subjective evaluations. Furthermore, the amount of required linguistic features were reduced first to syllable level only (with DNN) and then with all positional information removed (with RNN). Fewer linguistic features as input with limited number of TA parameters as output led to less training data and lower model complexity, which in turn led to more efficient training and faster synthesis
Explaining the PENTA mode: A reply to Arvaniti and Ladd (2009)
his paper presents an overview of the Parallel Encoding and Target Approximation (PENTA) model of speech prosody, in response to an extensive critique by Arvaniti & Ladd (2009). PENTA is a framework for conceptually and computationally linking communicative meanings to fine-grained prosodic details, based on an articulatory-functional view of speech. Target Approximation simulates the articulatory realisation of underlying pitch targets â the prosodic primitives in the framework. Parallel Encoding provides an operational scheme that enables simultaneous encoding of multiple communicative functions. We also outline how PENTA can be computationally tested with a set of software tools. With the help of one of the tools, we offer a PENTA-based hypothetical account of the Greek intonational patterns reported by Arvaniti & Ladd, showing how it is possible to predict the prosodic shapes of an utterance based on the lexical and postlexical meanings it conveys
Temporal Coding of Voice Pitch Contours in Mandarin Tones
Accurate perception of time-variant pitch is important for speech recognition, particularly for tonal languages with different lexical tones such as Mandarin, in which different tones convey different semantic information. Previous studies reported that the auditory nerve and cochlear nucleus can encode different pitches through phase-locked neural activities. However, little is known about how the inferior colliculus (IC) encodes the time-variant periodicity pitch of natural speech. In this study, the Mandarin syllable /ba/ pronounced with four lexical tones (flat, rising, falling then rising and falling) were used as stimuli. Local field potentials (LFPs) and single neuron activity were simultaneously recorded from 90 sites within contralateral IC of six urethane-anesthetized and decerebrate guinea pigs in response to the four stimuli. Analysis of the temporal information of LFPs showed that 93% of the LFPs exhibited robust encoding of periodicity pitch. Pitch strength of LFPs derived from the autocorrelogram was significantly (p < 0.001) stronger for rising tones than flat and falling tones. Pitch strength are also significantly increased (p < 0.05) with the characteristic frequency (CF). On the other hand, only 47% (42 or 90) of single neuron activities were significantly synchronized to the fundamental frequency of the stimulus suggesting that the temporal spiking pattern of single IC neuron could encode the time variant periodicity pitch of speech robustly. The difference between the number of LFPs and single neurons that encode the time-variant F0 voice pitch supports the notion of a transition at the level of IC from direct temporal coding in the spike trains of individual neurons to other form of neural representation
Investigating the build-up of precedence effect using reflection masking
The auditory processing level involved in the buildâup of precedence [Freyman et al., J. Acoust. Soc. Am. 90, 874â884 (1991)] has been investigated here by employing reflection masked threshold (RMT) techniques. Given that RMT techniques are generally assumed to address lower levels of the auditory signal processing, such an approach represents a bottomâup approach to the buildup of precedence. Three conditioner configurations measuring a possible buildup of reflection suppression were compared to the baseline RMT for four reflection delays ranging from 2.5â15 ms. No buildup of reflection suppression was observed for any of the conditioner configurations. Buildup of template (decrease in RMT for two of the conditioners), on the other hand, was found to be delay dependent. For five of six listeners, with reflection delay=2.5 and 15 ms, RMT decreased relative to the baseline. For 5â and 10âms delay, no change in threshold was observed. It is concluded that the lowâlevel auditory processing involved in RMT is not sufficient to realize a buildup of reflection suppression. This confirms suggestions that higher level processing is involved in PE buildup. The observed enhancement of reflection detection (RMT) may contribute to active suppression at higher processing levels
Prosody analysis and modeling for Cantonese text-to-speech.
Li Yu Jia.Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.Includes bibliographical references.Abstracts in English and Chinese.Chapter Chapter 1 --- Introduction --- p.1Chapter 1.1. --- TTS Technology --- p.1Chapter 1.2. --- Prosody --- p.2Chapter 1.2.1. --- What is Prosody --- p.2Chapter 1.2.2. --- Prosody from Different Perspectives --- p.3Chapter 1.2.3. --- Acoustical Parameters of Prosody --- p.3Chapter 1.2.4. --- Prosody in TTS --- p.5Chapter 1.2.4.1 --- Analysis --- p.5Chapter 1.2.4.2 --- Modeling --- p.6Chapter 1.2.4.3 --- Evaluation --- p.6Chapter 1.3. --- Thesis Objectives --- p.7Chapter 1.4. --- Thesis Outline --- p.7Reference --- p.8Chapter Chapter 2 --- Cantonese --- p.9Chapter 2.1. --- The Cantonese Dialect --- p.9Chapter 2.1.1. --- Phonology --- p.10Chapter 2.1.1.1 --- Initial --- p.11Chapter 2.1.1.2 --- Final --- p.12Chapter 2.1.1.3 --- Tone --- p.13Chapter 2.1.2. --- Phonological Constraints --- p.14Chapter 2.2. --- Tones in Cantonese --- p.15Chapter 2.2.1. --- Tone System --- p.15Chapter 2.2.2. --- Linguistic Significance --- p.18Chapter 2.2.3. --- Acoustical Realization --- p.18Chapter 2.3. --- Prosodic Variation in Continuous Cantonese Speech --- p.20Chapter 2.4. --- Cantonese Speech Corpus - CUProsody --- p.21Reference --- p.23Chapter Chapter 3 --- F0 Normalization --- p.25Chapter 3.1. --- F0 in Speech Production --- p.25Chapter 3.2. --- F0 Extraction --- p.27Chapter 3.3. --- Duration-normalized Tone Contour --- p.29Chapter 3.4. --- F0 Normalization --- p.30Chapter 3.4.1. --- Necessity and Motivation --- p.30Chapter 3.4.2. --- F0 Normalization --- p.33Chapter 3.4.2.1 --- Methodology --- p.33Chapter 3.4.2.2 --- Assumptions --- p.34Chapter 3.4.2.3 --- Estimation of Relative Tone Ratios --- p.35Chapter 3.4.2.4 --- Derivation of Phrase Curve --- p.37Chapter 3.4.2.5 --- Normalization of Absolute FO Values --- p.39Chapter 3.4.3. --- Experiments and Discussion --- p.39Chapter 3.5. --- Conclusions --- p.44Reference --- p.45Chapter Chapter 4 --- Acoustical FO Analysis --- p.48Chapter 4.1. --- Methodology of FO Analysis --- p.48Chapter 4.1.1. --- Analysis-by-Synthesis --- p.48Chapter 4.1.2. --- Acoustical Analysis --- p.51Chapter 4.2. --- Acoustical FO Analysis for Cantonese --- p.52Chapter 4.2.1. --- Analysis of Phrase Curves --- p.52Chapter 4.2.2. --- Analysis of Tone Contours --- p.55Chapter 4.2.2.1 --- Context-independent Single-tone Contours --- p.56Chapter 4.2.2.2 --- Contextual Variation --- p.58Chapter 4.2.2.3 --- Co-articulated Tone Contours of Disyllabic Word --- p.59Chapter 4.2.2.4 --- Cross-word Contours --- p.62Chapter 4.2.2.5 --- Phrase-initial Tone Contours --- p.65Chapter 4.3. --- Summary --- p.66Reference --- p.67Chapter Chapter5 --- Prosody Modeling for Cantonese Text-to-Speech --- p.70Chapter 5.1. --- Parametric Model and Non-parametric Model --- p.70Chapter 5.2. --- Cantonese Text-to-Speech: Baseline System --- p.72Chapter 5.2.1. --- Sub-syllable Unit --- p.72Chapter 5.2.2. --- Text Analysis Module --- p.73Chapter 5.2.3. --- Acoustical Synthesis --- p.74Chapter 5.2.4. --- Prosody Module --- p.74Chapter 5.3. --- Enhanced Prosody Model --- p.74Chapter 5.3.1. --- Modeling Tone Contours --- p.75Chapter 5.3.1.1 --- Word-level FO Contours --- p.76Chapter 5.3.1.2 --- Phrase-initial Tone Contours --- p.77Chapter 5.3.1.3 --- Tone Contours at Word Boundary --- p.78Chapter 5.3.2. --- Modeling Phrase Curves --- p.79Chapter 5.3.3. --- Generation of Continuous FO Contours --- p.81Chapter 5.4. --- Summary --- p.81Reference --- p.82Chapter Chapter 6 --- Performance Evaluation --- p.83Chapter 6.1. --- Introduction to Perceptual Test --- p.83Chapter 6.1.1. --- Aspects of Evaluation --- p.84Chapter 6.1.2. --- Methods of Judgment Test --- p.84Chapter 6.1.3. --- Problems in Perceptual Test --- p.85Chapter 6.2. --- Perceptual Tests for Cantonese TTS --- p.86Chapter 6.2.1. --- Intelligibility Tests --- p.86Chapter 6.2.1.1 --- Method --- p.86Chapter 6.2.1.2 --- Results --- p.88Chapter 6.2.1.3 --- Analysis --- p.89Chapter 6.2.2. --- Naturalness Tests --- p.90Chapter 6.2.2.1 --- Word-level --- p.90Chapter 6.2.2.1.1 --- Method --- p.90Chapter 6.2.2.1.2 --- Results --- p.91Chapter 6.2.3.1.3 --- Analysis --- p.91Chapter 6.2.2.2 --- Sentence-level --- p.92Chapter 6.2.2.2.1 --- Method --- p.92Chapter 6.2.2.2.2 --- Results --- p.93Chapter 6.2.2.2.3 --- Analysis --- p.94Chapter 6.3. --- Conclusions --- p.95Chapter 6.4. --- Summary --- p.95Reference --- p.96Chapter Chapter 7 --- Conclusions and Future Work --- p.97Chapter 7.1. --- Conclusions --- p.97Chapter 7.2. --- Suggested Future Work --- p.99Appendix --- p.100Appendix 1 Linear Regression --- p.100Appendix 2 36 Templates of Cross-word Contours --- p.101Appendix 3 Word List for Word-level Tests --- p.102Appendix 4 Syllable Occurrence in Word List of Intelligibility Test --- p.108Appendix 5 Wrongly Identified Word List --- p.112Appendix 6 Confusion Matrix --- p.115Appendix 7 Unintelligible Word List --- p.117Appendix 8 Noisy Word List --- p.119Appendix 9 Sentence List for Naturalness Test --- p.12
- âŠ