63 research outputs found

    Paralinguistic event detection in children's speech

    Get PDF
    Paralinguistic events are useful indicators of the affective state of a speaker. These cues, in children's speech, are used to form social bonds with their caregivers. They have also been found to be useful in the very early detection of developmental disorders such as autism spectrum disorder (ASD) in children's speech. Prior work on children's speech has focused on the use of a limited number of subjects which don't have sufficient diversity in the type of vocalizations that are produced. Also, the features that are necessary to understand the production of paralinguistic events is not fully understood. To account for the lack of an off-the-shelf solution to detect instances of laughter and crying in children's speech, the focus of the thesis is to investigate and develop signal processing algorithms to extract acoustic features and use machine learning algorithms on various corpora. Results obtained using baseline spectral and prosodic features indicate the ability of the combination of spectral, prosodic, and dysphonation-related features that are needed to detect laughter and whining in toddlers' speech with different age groups and recording environments. The use of long-term features were found to be useful to capture the periodic properties of laughter in adults' and children's speech and detected instances of laughter to a high degree of accuracy. Finally, the thesis focuses on the use of multi-modal information using acoustic features and computer vision-based smile-related features to detect instances of laughter and to reduce the instances of false positives in adults' and children's speech. The fusion of the features resulted in an improvement of the accuracy and recall rates than when using either of the two modalities on their own.Ph.D

    c

    Get PDF
    In this article, we describe and interpret a set of acoustic and linguistic features that characterise emotional/emotion-related user states – confined to the one database processed: four classes in a German corpus of children interacting with a pet robot. To this end, we collected a very large feature vector consisting of more than 4000 features extracted at different sites. We performed extensive feature selection (Sequential Forward Floating Search) for seven acoustic and four linguistic types of features, ending up in a small number of ‘most important ’ features which we try to interpret by discussing the impact of different feature and extraction types. We establish different measures of impact and discuss the mutual influence of acoustics and linguistics

    A Study of Accomodation of Prosodic and Temporal Features in Spoken Dialogues in View of Speech Technology Applications

    Get PDF
    Inter-speaker accommodation is a well-known property of human speech and human interaction in general. Broadly it refers to the behavioural patterns of two (or more) interactants and the effect of the (verbal and non-verbal) behaviour of each to that of the other(s). Implementation of thisbehavior in spoken dialogue systems is desirable as an improvement on the naturalness of humanmachine interaction. However, traditional qualitative descriptions of accommodation phenomena do not provide sufficient information for such an implementation. Therefore, a quantitativedescription of inter-speaker accommodation is required. This thesis proposes a methodology of monitoring accommodation during a human or humancomputer dialogue, which utilizes a moving average filter over sequential frames for each speaker. These frames are time-aligned across the speakers, hence the name Time Aligned Moving Average (TAMA). Analysis of spontaneous human dialogue recordings by means of the TAMA methodology reveals ubiquitous accommodation of prosodic features (pitch, intensity and speech rate) across interlocutors, and allows for statistical (time series) modeling of the behaviour, in a way which is meaningful for implementation in spoken dialogue system (SDS) environments.In addition, a novel dialogue representation is proposed that provides an additional point of view to that of TAMA in monitoring accommodation of temporal features (inter-speaker pause length and overlap frequency). This representation is a percentage turn distribution of individual speakercontributions in a dialogue frame which circumvents strict attribution of speaker-turns, by considering both interlocutors as synchronously active. Both TAMA and turn distribution metrics indicate that correlation of average pause length and overlap frequency between speakers can be attributed to accommodation (a debated issue), and point to possible improvements in SDS “turntaking” behaviour. Although the findings of the prosodic and temporal analyses can directly inform SDS implementations, further work is required in order to describe inter-speaker accommodation sufficiently, as well as to develop an adequate testing platform for evaluating the magnitude ofperceived improvement in human-machine interaction. Therefore, this thesis constitutes a first step towards a convincingly useful implementation of accommodation in spoken dialogue systems

    Multi‐speaker experimental designs: Methodological considerations

    Get PDF
    Research on language use has become increasingly interested in the multimodal and interactional aspects of language – theoretical models of dialogue, such as the Communication Accommodation Theory and the Interactive Alignment Model are examples of this. In addition, researchers have started to give more consideration to the relationship between physiological processes and language use. This article aims to contribute to the advancement in studies of physiological and/or multimodal language use in naturalistic settings. It does so by providing methodological recommendations for such multi-speaker experimental designs. It covers the topics of (a) speaker preparation and logistics, (b) experimental tasks and (c) data synchronisation and post-processing. The types of data that will be considered in further detail include audio and video, electroencephalography, respiratory data and electromagnetic articulography. This overview with recommendations is based on the answers to a questionnaire that was sent amongst the members of the Horizon 2020 research network ‘Conversational Brains’, several researchers in the field and interviews with three additional experts.H2020 Marie Skłodowska‐Curie Actions http://dx.doi.org/10.13039/100010665Peer Reviewe

    Automatic vocal recognition of a child's perceived emotional state within the Speechome corpus

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 137-149).With over 230,000 hours of audio/video recordings of a child growing up in the home setting from birth to the age of three, the Human Speechome Project has pioneered a comprehensive, ecologically valid observational dataset that introduces far-reaching new possibilities for the study of child development. By offering In vivo observation of a child's daily life experience at ultra-dense, longitudinal time scales, the Speechome corpus holds great potential for discovering developmental insights that have thus far eluded observation. The work of this thesis aspires to enable the use of the Speechome corpus for empirical study of emotional factors in early child development. To fully harness the benefits of Speechome for this purpose, an automated mechanism must be created to perceive the child's emotional state within this medium. Due to the latent nature of emotion, we sought objective, directly measurable correlates of the child's perceived emotional state within the Speechome corpus, focusing exclusively on acoustic features of the child's vocalizations and surrounding caretaker speech. Using Partial Least Squares regression, we applied these features to build a model that simulates human perceptual heuristics for determining a child's emotional state. We evaluated the perceptual accuracy of models built across child-only, adult-only, and combined feature sets within the overall sampled dataset, as well as controlling for social situations, vocalization behaviors (e.g. crying, laughing, babble), individual caretakers, and developmental age between 9 and 24 months. Child and combined models consistently demonstrated high perceptual accuracy, with overall adjusted R-squared values of 0.54 and 0.58, respectively, and an average of 0.59 and 0.67 per month. Comparative analysis across longitudinal and socio-behavioral contexts yielded several notable developmental and dyadic insights. In the process, we have developed a data mining and analysis methodology for modeling perceived child emotion and quantifying caretaker intersubjectivity that we hope to extend to future datasets across multiple children, as new deployments of the Speechome recording technology are established. Such large-scale comparative studies promise an unprecedented view into the nature of emotional processes in early childhood and potentially enlightening discoveries about autism and other developmental disorders.by Sophia Yuditskaya.S.M

    Neural and perceptual processing of emotional speech prosody in typically developed children and in children with autism spectrum disorder

    Get PDF
    Human voice conveys information about a speaker’s emotions and intention via speech prosody; changes in the speaker’s intonation, stress, rhythm, and tone of voice. The ability to recognize and interpret speech prosody is essential for successful social communication and has been associated with social competence throughout childhood. Autism spectrum disorders (ASD) are characterized by impaired social communication and social interaction skills as well as repetitive patterns of behavior. Semantic-pragmatic language deficits, including deficient speech prosody production and comprehension, are often found to be impaired in individuals with ASD. The present thesis investigated processing of words and speech prosody in school-aged typically developed children and two groups of children with ASD: those with accompanying language impairments (children with ASD [LI]) and those with no accompanying language impairments (children with ASD [no LI]). To this end, auditory event-related potentials (ERPs) were recorded for a Finnish word uttered with different emotional connotations (neutral, scornful, sad, and commanding). Two of the thesis studies included a behavioral prosody discrimination task, and in one study, facial electromyographic (facial EMG) reactions were recorded for the above-mentioned speech stimuli. In typically developed children, changes in speech prosody elicited mismatch negativity (MMN) / late discriminative negativity (LDN) responses, demonstrating that the auditory system of school-aged children automatically detects prosodic changes in speech. Further, these prosodic changes in speech activated an involuntary attention mechanism in typically developed children, as reflected by a P3a component. However, no reliable facial EMG reactions were found for these non-attended prosodic changes in speech. Both groups of children with ASD had diminished ERP responses to words, suggesting that initial stages of sound encoding were deficient in children with ASD, but these processes were more impaired in children with ASD (LI) than in children with ASD (no LI). In both groups of children with ASD, MMN/LDN responses to the scornful stimulus were diminished, suggesting abnormal auditory discrimination mechanisms in children with ASD. In addition, P3a responses were diminished and atypically distributed in children with ASD, suggesting that these children have difficulties in orienting to speech sound changes. Finally, children with ASD (no LI) were slower in behaviorally discriminating prosodic changes in speech compared to the typically developed children. Taken together, these and the ERP results show that processing of natural speech prosody is impaired in children with ASD at various information processing levels, including aberrant discrimination of and orienting to, as well as sluggish responding to prosodic changes. These speech processing deficits might contribute to the observed difficulties in comprehending another person’s emotional state based on his/her tone of voice in individuals with ASD.Puhujan tunnetilaa koskeva tieto välittyy puheen prosodian, eli puheen intonaation, painotusten, rytmin ja äänensävyn vaihteluiden kautta. Prosodian havaitseminen ja ymmärtäminen ovat erittäin tärkeitä sosiaaliselle vuorovaikutukselle, ja näiden taitojen on havaittu olevan yhteydessä sosiaaliseen kompetenssiin läpi lapsuusiän. Autismikirjolla tarkoitetaan kehityksellisiä neurobiologisia häiriöitä, joiden keskeisiä piirteitä ovat pulmat sosiaalisessa vuorovaikutuksessa ja toistavat, stereotyyppiset käyttäytymisen mallit. Autismikirjoon liittyy kielen semantiikan ja pragmatiikan pulmia, ml. vaikeuksia puheen prosodian havaitsemisessa ja tuottamisessa. Väitöskirjatyössä selvitettiin prosodisten piirteiden havaitsemista tyypillisesti kehittyneillä kouluikäisillä lapsilla ja autismikirjoon kuuluvilla lapsilla. Tutkimukseen osallistui sekä ryhmä autismikirjoon kuuluvia lapsia, joilla oli kielenkehityksen vaikeuksia että ryhmä autismikirjon lapsia, joilla ei ollut kielenkehityksen viivettä/vaikeuksia. Prosodian havaitsemista selvitettiin rekisteröimällä aivojen tapahtumasidonnaisia kuuloherätevasteita luonnollisille sanaärsykkeille, joita esitettiin eri tunnetiloissa lausuttuna (neutraali, surullinen, halveksiva ja käskevä). Äänen esitietoisen erottelun mittarina käytettiin poikkeavuusnegatiivisuus l. mismatch negativity (MMN) / late discriminative negativity (LDN) -vasteita. Tahattoman tarkkaavaisuuden kääntymistä äänimuutoksiin mitattiin P3a-vasteen avulla. Kahteen osatutkimukseen sisältyi äänten erottelutehtävä. Tyypillisesti kehittyneillä lapsilla tutkittiin lisäksi kasvojen lihasreaktioita (kasvolihasten EMG-rekisteröinti) edellä mainittuihin sanaärsykkeisiin. Tulosten mukaan tyypillisesti kehittyneet lapset erottelevat puheesta puhujan äänensävyn muutoksia kuulotiedon esitietoisella tasolla ja heidän tahaton tarkkaavuutensa kääntyy näihin äänimuutoksiin. Tutkimuksessa ei kuitenkaan löydetty selkeitä kasvolihasreaktoita näille sanaärsykkeille. Autismikirjon lapsilla havaittiin äänipiirteiden peruskäsittelyssä poikkeavuutta. Näitä vaikeuksia oli enemmän niillä lapsilla, joilla oli autismikirjon häiriön lisäksi kielenkehityksen vaikeutta. Tulosten mukaan autismikirjoon kuuluvilla lapsilla myös puheen prosodiikan erottelu on tyypillisesti kehittyneisiin lapsiin verrattuna heikentynyttä. Puheen prosodisille muutoksille syntyneet vaimentuneet MMN/LDN -vasteet osoittavat, että autismikirjoon kuuluvat lapset havaitsevat kuulotiedon käsittelyn esitietoisella tasolla tyypillisesti kehittyneitä lapsia heikommin puheen äänensävyn muutoksia. Vaimentuneet ja epätyypillisesti jakautuneet P3a-vasteet osoittavat lisäksi, että autismikirjoon kuuluvien lasten tahaton tarkkaavuus ei kääntynyt prosodisille muutoksille yhtä hyvin kuin tyypillisesti kehittyneillä lapsilla. Tulokset viittaavat vaikeuteen ja hitauteen huomioida puhujan tunnetilaa heijastavia äänen muutoksia autismikirjossa. Tutkimuksessa esiin tulleet puheäänen käsittelyn vaikeudet voivat heijastua autismikirjossa havaittuihin vaikeuksiin tehdä päätelmiä puhujan tunnetilasta hänen äänensävynsä perusteella

    A description of the rhythm of Barunga Kriol using rhythm metrics and an analysis of vowel reduction

    Get PDF
    Kriol is an English-lexifier creole language spoken by over 20,000 children and adults in the Northern parts of Australia, yet much about the prosody of this language remains unknown. This thesis provides a preliminary description of the rhythm and patterns of vowel reduction of Barunga Kriol - a variety of Kriol local to Barunga Community, NT – and compares it to a relatively standard variety of Australian English. The thesis is divided into two studies. Study 1, the Rhythm Metric Study, describes the rhythm of Barunga Kriol and Australian English using rhythm metrics. Study 2, the Vowel Reduction Study, compared patterns of vowel reduction in Barunga Kriol and Australian English. This thesis contributes the first in depth studies of vowel reduction patterns and rhythm using rhythm metrics in any variety of Kriol or Australian English. The research also sets an adult baseline for metric results and patterns of vowel reduction for Barunga Kriol and Australian English, useful for future studies of child speech in these varieties. As rhythm is a major contributor to intelligibility, the findings of this thesis have the potential to inform teaching practice in English as a Second Language
    corecore