1,010 research outputs found
Recommended from our members
The Importance of Sub-Utterance Prosody in Predicting Level of Certainty
We present an experiment aimed at understanding how to optimally use acoustic and prosodic information to predict a speaker's level of certainty. With a corpus of utterances where we can isolate a single word or phrase that is responsible for the speaker's level of certainty we use different sets of sub-utterance prosodic features to train models for predicting an utterance's perceived level of certainty. Our results suggest that using prosodic features of the word or phrase responsible for the level of certainty and of its surrounding context improves the prediction accuracy without increasing the total number of features when compared to using only features taken from the utterance as a whole.Engineering and Applied Science
Recommended from our members
Identifying Uncertain Words within an Utterance via Prosodic Features
We describe an experiment that investigates whether sub-utterance prosodic features can be used to detect uncertainty at the wordlevel. That is, given an utterance that is classified as uncertain, we want to determine which word or phrase the speaker is uncertain about. We have a corpus of utterances spoken under varying degrees of certainty. Using combinations of sub-utterance prosodic features we train models to predict the level of certainty of an utterance. On a set of utterances that were perceived to be uncertain, we compare the predictions of our models for two candidate target word segmentations: (a) one with the actual word causing uncertainty as the proposed target word, and (b) one with a control word as the proposed target word. Our best model correctly identifies the word causing the uncertainty rather than the control word 91% of the time.Engineering and Applied Science
Controlling for Confounders in Multimodal Emotion Classification via Adversarial Learning
Various psychological factors affect how individuals express emotions. Yet,
when we collect data intended for use in building emotion recognition systems,
we often try to do so by creating paradigms that are designed just with a focus
on eliciting emotional behavior. Algorithms trained with these types of data
are unlikely to function outside of controlled environments because our
emotions naturally change as a function of these other factors. In this work,
we study how the multimodal expressions of emotion change when an individual is
under varying levels of stress. We hypothesize that stress produces modulations
that can hide the true underlying emotions of individuals and that we can make
emotion recognition algorithms more generalizable by controlling for variations
in stress. To this end, we use adversarial networks to decorrelate stress
modulations from emotion representations. We study how stress alters acoustic
and lexical emotional predictions, paying special attention to how modulations
due to stress affect the transferability of learned emotion recognition models
across domains. Our results show that stress is indeed encoded in trained
emotion classifiers and that this encoding varies across levels of emotions and
across the lexical and acoustic modalities. Our results also show that emotion
recognition models that control for stress during training have better
generalizability when applied to new domains, compared to models that do not
control for stress during training. We conclude that is is necessary to
consider the effect of extraneous psychological factors when building and
testing emotion recognition models.Comment: 10 pages, ICMI 201
Unusual Prosodic Descriptors in Young, Verbal Children with Autism Spectrum Disorders
This study aimed to determine which prosodic descriptors best characterized the speech of children with autism spectrum disorders (ASD) and whether these descriptors (e.g., sing-song and monotone) are acoustically different. Two listeners\u27 auditory perceptions of the speech of the children with ASD and the pitch of the speech samples were analyzed. The results suggest that individual children are characterized by a variety of prosodic descriptors. Some thought groups were described as both sing-song and monotone, however, most children appear to be either more monotone or more sing-song. Furthermore, the subjective and acoustic data suggest a strong relationship between atypical intonation and sing-song perceptions as well as atypical rhythm and monotone perceptions. Implications for an earlier diagnosis of ASD and for the development of therapy tasks to target these deficits are discussed
Recommended from our members
Inferring Speaker Affect in Spoken Natural Language Communication
The field of spoken language processing is concerned with creating computer programs that can understand human speech and produce human-like speech. Regarding the problem of understanding human speech, there is currently growing interest in moving beyond speech recognition (the task of transcribing the words in an audio stream) and towards machine listening—interpreting the full spectrum of information in an audio stream. One part of machine listening, the problem that this thesis focuses on, is the task of using information in the speech signal to infer a person’s emotional or mental state. In this dissertation, our approach is to assess the utility of prosody, or manner of speaking, in classifying speaker affect. Prosody refers to the acoustic features of natural speech: rhythm, stress, intonation, and energy. Affect refers to a person’s emotions and attitudes such as happiness, frustration, or uncertainty. We focus on one specific dimension of affect: level of certainty. Our goal is to automatically infer whether a person is confident or uncertain based on the prosody of his or her speech. Potential applications include conversational dialogue systems (e.g., in educational technology) and voice search (e.g., smartphone personal assistants). There are three main contributions of this thesis. The first contribution is a method for eliciting uncertain speech that binds a speaker’s uncertainty to a single phrase within the larger utterance, allowing us to compare the utility of contextually-based prosodic features. Second, we devise a technique for computing prosodic features from utterance segments that both improves uncertainty classification and can be used to determine which phrase a speaker is uncertain about. The level of certainty classifier achieves an accuracy of 75%. Third, we examine the differences between perceived, self-reported, and internal level of certainty, concluding that perceived certainty is aligned with internal certainty for some but not all speakers and that self-reports are a good proxy for internal certainty.Engineering and Applied Science
Prepositional Phrase Attachment Ambiguities in Declarative and Interrogative Contexts: Oral Reading Data
Certain English sentences containing multiple prepositional phrases (e.g., She had planned to cram the paperwork in the drawer into her briefcase) have been reported to be prone to mis-parsing of a kind that is standardly called a “garden path.” The mis-parse stems from the temporary ambiguity of the first prepositional phrase (PP1: in the drawer), which tends to be interpreted initially as the goal argument of the verb cram. If the sentence ended there, that would be correct. But that analysis is overridden when the second prepositional phrase (PP2: into her briefcase) is encountered, since the into phrase can only be interpreted as the goal argument of the verb. Thus, PP2 necessarily supplants PP1’s initially assigned position as goal, and PP1 must be reanalyzed as a modifier of the object NP (the paperwork).
Interrogative versions of the same sentence structure (Had she planned to cram the paperwork in the drawer into her briefcase?) may have a different profile. They have been informally judged to be easier to process than their declarative counterparts, because they are less susceptible to the initial garden path analysis. The study presented here represents an attempt to find a behavioral correlate of this intuitive difference in processing difficulty.
The experiment employs the Double Reading Paradigm (Fodor, Macaulay, Ronkos, Callahan, and Peckenpaugh, 2019). Participants were asked to read aloud a visually presented sentence twice, first without taking any time at all to preview the sentence content (Reading 1), and then again after unlimited preview (Reading 2). The experimental items were created in a 2 x 2 design with one factor being Speech Act (declarative vs. interrogative) and the other being PP2 Status, i.e., PP2 could only be an argument of the verb iv (Arg), as above, or else PP2 could be interpreted as a modifier (Mod) of the NP within the preceding PP, as in She had / Had she planned to cram the paperwork in the drawer of her filing cabinet(?).
Participants’ recordings of Reading 1 and Reading 2 were subjected to prosodic coding by a linguist who was naive to the research question. Distributions of prosodic boundaries were statistically analyzed to extract any significant differences in prosodic boundary patterns as a function of Speech Act, Reading, or PP2 Status. Logistic mixed effect regression models indicated, as anticipated, a significant effect of PP2 Status across all analyses of prosodic phrasing, and a significant effect of Reading for both analyses of prosodic phrasing that included boundary strength. Speech Act was a significant predictor in one of prosodic phrasing, but the hypothesized interaction (between Speech Act and PP2 Status) was not significant in any model.
Another analysis concerned the amount of time a participant spent silently studying a sentence after Reading 1 to be confident they had understood it before reading it aloud again (Reading 2). The time between readings is referred to as the inter-reading time (IRT). It was assumed that a longer IRT signifies greater processing difficulty of the sentence. Thus, IRT was hypothesized to provide a behavioral correlate of the intuitive judgement that the interrogative garden paths are easier to process than the declarative ones. If a correlate had been found, it would have taken the form of an interaction between the two factors (Speech Act and PP2 Status) such that the IRT difference between Arg and Mod sentence versions was smaller for interrogatives than for declaratives. Ultimately, however, no statistically significant interaction between Speech Act and PP2 Status was found.
Further studies seeking behavioral evidence of the informal intuition motivating this research are proposed. Also offered are possible explanations for why the intuition is apparently so strong for some English speakers, and why, if so, it is not reflected in IRT. Significant ancillary findings are that interrogatives are in general more difficult to process than corresponding declaratives. Also, inter-reading time (IRT) in the Double Reading paradigm is confirmed as a useful measure of sentence processing difficulty given that within the declarative sentences, the garden-path (Arg) versions showed significantly longer IRTs than the non-garden-path (Mod) versions
Speech monitoring and phonologically-mediated eye gaze in language perception and production: a comparison using printed word eye-tracking
The Perceptual Loop Theory of speech monitoring assumes that speakers routinely inspect their inner speech. In contrast, Huettig and Hartsuiker (2010) observed that listening to one's own speech during language production drives eye-movements to phonologically related printed words with a similar time-course as listening to someone else's speech does in speech perception experiments. This suggests that speakers use their speech perception system to listen to their own overt speech, but not to their inner speech. However, a direct comparison between production and perception with the same stimuli and participants is lacking so far. The current printed word eye-tracking experiment therefore used a within-subjects design, combining production and perception. Displays showed four words, of which one, the target, either had to be named or was presented auditorily. Accompanying words were phonologically related, semantically related, or unrelated to the target. There were small increases in looks to phonological competitors with a similar time-course in both production and perception. Phonological effects in perception however lasted longer and had a much larger magnitude. We conjecture that this difference is related to a difference in predictability of one's own and someone else's speech, which in turn has consequences for lexical competition in other-perception and possibly suppression of activation in self-perception
The development of children's ability to track and predict turn structure in conversation
Children begin developing turn-taking skills in infancy but take several years to fluidly integrate their growing knowledge of language into their turn-taking behavior. In two eye-tracking experiments, we measured children’s anticipatory gaze to upcoming responders while controlling linguistic cues to turn structure. In Experiment 1, we showed English and non-English conversations to English-speaking adults and children. In Experiment 2, we phonetically controlled lexicosyntactic and prosodic cues in English-only speech. Children spontaneously made anticipatory gaze switches by age two and continued improving through age six. In both experiments, children and adults made more anticipatory switches after hearing questions. Consistent with prior findings on adult turn prediction, prosodic information alone did not increase children’s anticipatory gaze shifts. But, unlike prior work with adults, lexical information alone was not sucient either—children’s performance was best overall with lexicosyntax and prosody together. Our findings support an account in which turn tracking and turn prediction emerge in infancy and then gradually become integrated with children’s online linguistic processing
- …