1,859 research outputs found
Identifying prosodic prominence patterns for English text-to-speech synthesis
This thesis proposes to improve and enrich the expressiveness of English Text-to-Speech (TTS) synthesis by identifying and generating natural patterns of prosodic
prominence.
In most state-of-the-art TTS systems the prediction from text of prosodic prominence
relations between words in an utterance relies on features that very loosely account
for the combined effects of syntax, semantics, word informativeness and salience,
on prosodic prominence.
To improve prosodic prominence prediction we first follow up the classic approach
in which prosodic prominence patterns are flattened into binary sequences of pitch accented
and pitch unaccented words. We propose and motivate statistic and syntactic
dependency based features that are complementary to the most predictive features proposed
in previous works on automatic pitch accent prediction and show their utility on
both read and spontaneous speech.
Different accentuation patterns can be associated to the same sentence. Such variability
rises the question on how evaluating pitch accent predictors when more patterns
are allowed. We carry out a study on prosodic symbols variability on a speech corpus
where different speakers read the same text and propose an information-theoretic definition
of optionality of symbolic prosodic events that leads to a novel evaluation metric
in which prosodic variability is incorporated as a factor affecting prediction accuracy.
We additionally propose a method to take advantage of the optionality of prosodic
events in unit-selection speech synthesis.
To better account for the tight links between the prosodic prominence of a word and
the discourse/sentence context, part of this thesis goes beyond the accent/no-accent dichotomy
and is devoted to a novel task, the automatic detection of contrast, where
contrast is meant as a (Information Structure’s) relation that ties two words that explicitly
contrast with each other. This task is mainly motivated by the fact that contrastive
words tend to be prosodically marked with particularly prominent pitch accents.
The identification of contrastive word pairs is achieved by combining lexical information,
syntactic information (which mainly aims to identify the syntactic parallelism
that often activates contrast) and semantic information (mainly drawn from the Word-
Net semantic lexicon), within a Support Vector Machines classifier.
Once we have identified patterns of prosodic prominence we propose methods to
incorporate such information in TTS synthesis and test its impact on synthetic speech
naturalness trough some large scale perceptual experiments. The results of these experiments cast some doubts on the utility of a simple accent/no-accent
distinction in Hidden Markov Model based speech synthesis while highlight the
importance of contrastive accents
Towards Hierarchical Prosodic Prominence Generation in TTS Synthesis
We address the problem of identification (from text) and generation of pitch accents in HMM-based English TTS synthesis. We show, through a large scale perceptual test, that a large improvement of the binary discrimination between pitch accented and non-accented words has no effect on the quality of the speech generated by the system. On the other side adding a third accent type that emphatically marks words that convey ”contrastive” focus (automatically identified from text) produces beneficial effects on the synthesized speech. These results support the accounts on prosodic prominence that consider the prosodic patterns of utterances as hierarchical structured and point out the limits of a flattening of such structure resulting from a simple accent/non-accent distinction. Index Terms: speech synthesis, HMM, pitch accents, focus detection 1
Tagging Prosody and Discourse Structure in Elicited Spontaneous Speech
This paper motivates and describes the annotation and analysis of prosody and discourse structure for several large spoken language corpora. The annotation schema are of two types: tags for prosody and intonation, and tags for several aspects of discourse structure. The choice of the particular tagging schema in each domain is based in large part on the insights they provide in corpus-based studies of the relationship between discourse structure and the accenting of referring expressions in American English. We first describe these results and show that the same models account for the accenting of pronouns in an extended passage from one of the Speech Warehouse hotel-booking dialogues. We then turn to corpora described in Venditti [Ven00], which adapts the same models to Tokyo Japanese. Japanese is interesting to compare to English, because accent is lexically specified and so cannot mark discourse focus in the same way. Analyses of these corpora show that local pitch range expansion serves the analogous focusing function in Japanese. The paper concludes with a section describing several outstanding questions in the annotation of Japanese intonation which corpus studies can help to resolve.Work reported in this paper was supported in part by a grant from the Ohio State University Office of Research, to Mary E. Beckman and co-principal investigators on the OSU Speech Warehouse project, and by an Ohio State University Presidential Fellowship to Jennifer J. Venditti
Hierarchical Representation and Estimation of Prosody using Continuous Wavelet Transform
Prominences and boundaries are the essential constituents of prosodic struc- ture in speech. They provide for means to chunk the speech stream into linguis- tically relevant units by providing them with relative saliences and demarcating them within utterance structures. Prominences and boundaries have both been widely used in both basic research on prosody as well as in text-to-speech syn- thesis. However, there are no representation schemes that would provide for both estimating and modelling them in a unified fashion. Here we present an unsupervised unified account for estimating and representing prosodic promi- nences and boundaries using a scale-space analysis based on continuous wavelet transform. The methods are evaluated and compared to earlier work using the Boston University Radio News corpus. The results show that the proposed method is comparable with the best published supervised annotation methods.Peer reviewe
A COMPREHENSIVE REVIEW OF INTONATION: PSYCHOACOUSTICS MODELING OF PROSODIC PROMINENCE
Bolinger (1978:475), one of the foremost authorities on prosody of a generation ago, said that “Intonation is a half-tamed savage. To understand the tamed or linguistically harnessed half of him, one has to make friends with the wild half.” This review provides a brief explanation for the tamed and untamed halves of intonation. It is argued here that the pitch-centered approach that has been used for several decades is responsible for why one half of intonation remains untamed. To tame intonation completely, a holistic acoustic approach is required that takes intensity and duration as seriously as it does pitch. Speech is a three-dimensional physical entity in which all three correlates work independently and interdependently. Consequently, a methodology that addresses intonation comprehensively is more likely to yield better results. Psychoacoustics seems to be well positioned for this task. Nearly 100 years of experimentations have led to the discoveries of Just Noticeable Difference (JNDs) thresholds that can be summoned to help tame intonation completely. The framework discussed here expands the analytical resources and facilitates an optimal description of intonation. It calculates and ranks the relative functional load (RFL) of pitch, intensity, and duration, and uses the results to compute the melodicity score of utterances. The findings replicate, based on JNDs, how the naked ear perceives intonation on a four-point Likert melodicity scale
Prosodic Annotation in a Thai Text-to-speech System
PACLIC 21 / Seoul National University, Seoul, Korea / November 1-3, 200
Recommended from our members
Chapter 2: The Original ToBI System and the Evolution of the ToBI Framework
In this chapter, the authors will try to identify the essential properties of a ToBI framework annotation system by describing the development and design of the original ToBI conventions. In this description, the authors will overview the general phonological theory and the specific theory of Mainstream American English intonation and prosody that the authors decided to incorporate in the original ToBI tags. The authors will also state the practical principles that led us to make the decisions that the authors did. The chapter is organised as follows. Section 2.2 briefly chronicles how the MAE_ToBI system came into being. Section 2.3 briefly describes the consensus account of English intonation and prosody on which the MAE_ToBI system is based. Section 2.4 catalogues the different components of a MAE_ToBI transcription and lists the salient rules which constrain the relationships between different components. This section also expands upon the theoretical foundations and practical consequences of adopting the general structure of multiple labelling tiers, and particularly the separation of the labels for tones from the labels for indexing prosodic boundary strength. Section 2.5 then describes some of the extensions of the basic ToBI tiers that have been adopted by some sites. This section also compares our decisions about the number of tiers and about inter-tier constraints with the analogous decisions for some of the other ToBI systems described in this book. Section 2.6 discusses the status of the symbolic labels relative to the continuous phonetic records that are also an obligatory component of the MAE_ToBI transcription. Section 2.7 then closes by listing several open research questions that the authors would like to see addressed by MAE_ToBI users and the larger ToBI community
- …