889 research outputs found

    Automatic labeling of contrastive word pairs from spontaneous spoken English

    Get PDF
    This paper addresses the problem of automatically labeling contrast in spontaneous spoken speech, where contrast here is meant as a relation that ties two words that explicitly contrast with each other. Detection of contrast is certainly relevant in the analysis of discourse and information structure and also, because of the prosodic correlates of contrast, could play an important role in speech applications, such as text-to-speech synthesis, that need an accurate and discourse context related modeling of prosody. With this prospect we investigate the feasibility of automatic contrast labeling by training and evaluating on the Switchboard corpus a novel contrast tagger, based on Support Vector Machines (SVM), that combines lexical features, syntactic dependencies and WordNet semantic relations

    Automatic Detection of Contrastive Elements in Spontaneous Speech

    Get PDF
    In natural speech people use different levels of prominence to signal which parts of an utterance are especially important. Contrastive elements are often produced with stronger than usual prominence and their presence modifies the meaning of the utterance in subtle but important ways. We use a richly annotated corpus of conversational speech to study the acoustic characteristics of contrastive elements and the differences between them and words at other levels of prominence. We report our results for automatic detection of contrastive elements based on acoustic and textual features, finding that a baseline predicting nouns and adjectives as contrastive performs on par with the best combination of features. We achieve a much better performance in a modified task of detecting contrastive elements among words that are predicted to bear pitch accent

    Towards Hierarchical Prosodic Prominence Generation in TTS Synthesis

    Get PDF
    We address the problem of identification (from text) and generation of pitch accents in HMM-based English TTS synthesis. We show, through a large scale perceptual test, that a large improvement of the binary discrimination between pitch accented and non-accented words has no effect on the quality of the speech generated by the system. On the other side adding a third accent type that emphatically marks words that convey ”contrastive” focus (automatically identified from text) produces beneficial effects on the synthesized speech. These results support the accounts on prosodic prominence that consider the prosodic patterns of utterances as hierarchical structured and point out the limits of a flattening of such structure resulting from a simple accent/non-accent distinction. Index Terms: speech synthesis, HMM, pitch accents, focus detection 1

    The Role of Prosodic Stress and Speech Perturbation on the Temporal Synchronization of Speech and Deictic Gestures

    Get PDF
    Gestures and speech converge during spoken language production. Although the temporal relationship of gestures and speech is thought to depend upon factors such as prosodic stress and word onset, the effects of controlled alterations in the speech signal upon the degree of synchrony between manual gestures and speech is uncertain. Thus, the precise nature of the interactive mechanism of speech-gesture production, or lack thereof, is not agreed upon or even frequently postulated. In Experiment 1, syllable position and contrastive stress were manipulated during sentence production to investigate the synchronization of speech and pointing gestures. An additional aim of Experiment 2 was to investigate the temporal relationship of speech and pointing gestures when speech is perturbed with delayed auditory feedback (DAF). Comparisons between the time of gesture apex and vowel midpoint (GA-VM) for each of the conditions were made for both Experiment 1 and Experiment 2. Additional comparisons of the interval between gesture launch midpoint to vowel midpoint (GLM-VM), total gesture time, gesture launch time, and gesture return time were made for Experiment 2. The results for the first experiment indicated that gestures were more synchronized with first position syllables and neutral syllables as measured GA-VM intervals. The first position syllable effect was also found in the second experiment. However, the results from Experiment 2 supported an effect of contrastive pitch effect. GLM-VM was shorter for first position targets and accented syllables. In addition, gesture launch times and total gesture times were longer for contrastive pitch accented syllables, especially when in the second position of words. Contrary to the predictions, significantly longer GA-VM and GLM-VM intervals were observed when individuals responded under provided delayed auditory feedback (DAF). Vowel and sentence durations increased both with (DAF) and when a contrastive accented syllable was produced. Vowels were longest for accented, second position syllables. These findings provide evidence that the timing of gesture is adjusted based upon manipulations of the speech stream. A potential mechanism of entrainment of the speech and gesture system is offered as an explanation for the observed effects

    Identifying prosodic prominence patterns for English text-to-speech synthesis

    Get PDF
    This thesis proposes to improve and enrich the expressiveness of English Text-to-Speech (TTS) synthesis by identifying and generating natural patterns of prosodic prominence. In most state-of-the-art TTS systems the prediction from text of prosodic prominence relations between words in an utterance relies on features that very loosely account for the combined effects of syntax, semantics, word informativeness and salience, on prosodic prominence. To improve prosodic prominence prediction we first follow up the classic approach in which prosodic prominence patterns are flattened into binary sequences of pitch accented and pitch unaccented words. We propose and motivate statistic and syntactic dependency based features that are complementary to the most predictive features proposed in previous works on automatic pitch accent prediction and show their utility on both read and spontaneous speech. Different accentuation patterns can be associated to the same sentence. Such variability rises the question on how evaluating pitch accent predictors when more patterns are allowed. We carry out a study on prosodic symbols variability on a speech corpus where different speakers read the same text and propose an information-theoretic definition of optionality of symbolic prosodic events that leads to a novel evaluation metric in which prosodic variability is incorporated as a factor affecting prediction accuracy. We additionally propose a method to take advantage of the optionality of prosodic events in unit-selection speech synthesis. To better account for the tight links between the prosodic prominence of a word and the discourse/sentence context, part of this thesis goes beyond the accent/no-accent dichotomy and is devoted to a novel task, the automatic detection of contrast, where contrast is meant as a (Information Structure’s) relation that ties two words that explicitly contrast with each other. This task is mainly motivated by the fact that contrastive words tend to be prosodically marked with particularly prominent pitch accents. The identification of contrastive word pairs is achieved by combining lexical information, syntactic information (which mainly aims to identify the syntactic parallelism that often activates contrast) and semantic information (mainly drawn from the Word- Net semantic lexicon), within a Support Vector Machines classifier. Once we have identified patterns of prosodic prominence we propose methods to incorporate such information in TTS synthesis and test its impact on synthetic speech naturalness trough some large scale perceptual experiments. The results of these experiments cast some doubts on the utility of a simple accent/no-accent distinction in Hidden Markov Model based speech synthesis while highlight the importance of contrastive accents

    Acoustic Correlates of Information Structure.

    Get PDF
    This paper reports three studies aimed at addressing three questions about the acoustic correlates of information structure in English: (1) do speakers mark information structure prosodically, and, to the extent they do; (2) what are the acoustic features associated with different aspects of information structure; and (3) how well can listeners retrieve this information from the signal? The information structure of subject-verb-object sentences was manipulated via the questions preceding those sentences: elements in the target sentences were either focused (i.e., the answer to a wh-question) or given (i.e., mentioned in prior discourse); furthermore, focused elements had either an implicit or an explicit contrast set in the discourse; finally, either only the object was focused (narrow object focus) or the entire event was focused (wide focus). The results across all three experiments demonstrated that people reliably mark (1) focus location (subject, verb, or object) using greater intensity, longer duration, and higher mean and maximum F0, and (2) focus breadth, such that narrow object focus is marked with greater intensity, longer duration, and higher mean and maximum F0 on the object than wide focus. Furthermore, when participants are made aware of prosodic ambiguity present across different information structures, they reliably mark focus type, so that contrastively focused elements are produced with greater intensity, longer duration, and lower mean and maximum F0 than noncontrastively focused elements. In addition to having important theoretical consequences for accounts of semantics and prosody, these experiments demonstrate that linear residualisation successfully removes individual differences in people's productions thereby revealing cross-speaker generalisations. Furthermore, discriminant modelling allows us to objectively determine the acoustic features that underlie meaning differences

    Methods in prosody

    Get PDF
    This book presents a collection of pioneering papers reflecting current methods in prosody research with a focus on Romance languages. The rapid expansion of the field of prosody research in the last decades has given rise to a proliferation of methods that has left little room for the critical assessment of these methods. The aim of this volume is to bridge this gap by embracing original contributions, in which experts in the field assess, reflect, and discuss different methods of data gathering and analysis. The book might thus be of interest to scholars and established researchers as well as to students and young academics who wish to explore the topic of prosody, an expanding and promising area of study

    Consistency of prosodic transcriptions : labelling experiments with trained and untrained transcribers

    Get PDF
    • 

    corecore