215 research outputs found

    The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism

    Get PDF
    The INTERSPEECH 2013 Computational Paralinguistics Challenge provides for the first time a unified test-bed for Social Signals such as laughter in speech. It further introduces conflict in group discussions as new tasks and picks up on autism and its manifestations in speech. Finally, emotion is revisited as task, albeit with a broader ranger of overall twelve emotional states. In this paper, we describe these four Sub-Challenges, Challenge conditions, baselines, and a new feature set by the openSMILE toolkit, provided to the participants. \em Bj\"orn Schuller1^1, Stefan Steidl2^2, Anton Batliner1^1, Alessandro Vinciarelli3,4^{3,4}, Klaus Scherer5^5}\\ {\em Fabien Ringeval6^6, Mohamed Chetouani7^7, Felix Weninger1^1, Florian Eyben1^1, Erik Marchi1^1, }\\ {\em Hugues Salamin3^3, Anna Polychroniou3^3, Fabio Valente4^4, Samuel Kim4^4

    An Analysis of Rhythmic Staccato-Vocalization Based on Frequency Demodulation for Laughter Detection in Conversational Meetings

    Get PDF
    Human laugh is able to convey various kinds of meanings in human communications. There exists various kinds of human laugh signal, for example: vocalized laugh and non vocalized laugh. Following the theories of psychology, among all the vocalized laugh type, rhythmic staccato-vocalization significantly evokes the positive responses in the interactions. In this paper we attempt to exploit this observation to detect human laugh occurrences, i.e., the laughter, in multiparty conversations from the AMI meeting corpus. First, we separate the high energy frames from speech, leaving out the low energy frames through power spectral density estimation. We borrow the algorithm of rhythm detection from the area of music analysis to use that on the high energy frames. Finally, we detect rhythmic laugh frames, analyzing the candidate rhythmic frames using statistics. This novel approach for detection of `positive' rhythmic human laughter performs better than the standard laughter classification baseline.Comment: 5 pages, 1 figure, conference pape

    The SSPNet-Mobile Corpus: from the detection of non-verbal cues to the inference of social behaviour during mobile phone conversations

    Get PDF
    Mobile phones are one of the main channels of communication in contemporary society. However, the effect of the mobile phone on both the process of and, also, the non-verbal behaviours used during conversations mediated by this technology, remain poorly understood. This thesis aims to investigate the role of the phone on the negotiation process as well as, the automatic analysis of non-verbal behavioural cues during conversations using mobile telephones, by following the Social Signal Processing approach. The work in this thesis includes the collection of a corpus of 60 mobile phone conversations involving 120 subjects, development of methods for the detection of non-verbal behavioural events (laughter, fillers, speech and silence) and the inference of characteristics influencing social interactions (personality traits and conflict handling style) from speech and movements while using the mobile telephone, as well as the analysis of several factors that influence the outcome of decision-making processes while using mobile phones (gender, age, personality, conflict handling style and caller versus receiver role). The findings show that it is possible to recognise behavioural events at levels well above chance level, by employing statistical language models, and that personality traits and conflict handling styles can be partially recognised. Among the factors analysed, participant role (caller versus receiver) was the most important in determining the outcome of negotiation processes in the case of disagreement between parties. Finally, the corpus collected for the experiments (the SSPNet-Mobile Corpus) has been used in an international benchmarking campaign and constitutes a valuable resource for future research in Social Signal Processing and more generally in the area of human-human communication

    The DAIS-C: a small, specialised, spoken, schizophrenia corpus

    Get PDF
    This paper describes the design and development of the DAIS-C (Discussing Abstract Ideas in Schizophrenia Corpus), a small, specialised corpus of spoken language in which speakers with a diagnosis of schizophrenia and those with no self-reported psychiatric or neuroleptic history were interviewed on the same topics. The corpus was constructed to allow for comparative analyses of speech behaviour in relation to linguistic creativity and formal thought disorder (FTD), but additional steps were taken to ensure that the corpus could be of use to other researchers and research questions. The present paper covers design decisions relevant to the construction of clinical corpora alongside information about the corpus of potential use to researchers interested in its use

    Does emotion shape language?:Studies on the influence of affective state on interactive language production

    Get PDF
    Publiekssamenvatting promotie Charlotte Out Mensen zijn emotionele wezens. Door bijvoorbeeld met een vriend te praten over onze gevoelens, kunnen we taal gebruiken om deze emoties te uiten. Maar ook als we onze gevoelens niet (expliciet) benoemen, beïnvloeden emoties de manier waarop we communiceren. Eerder onderzoek heeft bijvoorbeeld laten zien dat verdrietige mensen, vergeleken met blije mensen, doorgaans met een zachtere stem praten en zich meer richten op hun gesprekspartner. Er is nog niet zoveel bekend over de invloed van emoties op ons taalgedrag, al lijkt er een bijzonder en essentieel verband te zijn tussen emoties en gesproken taal. Om dit verder te onderzoeken, hebben we voor dit proefschrift vier experimenten gedaan, waarbij we hebben gekeken naar de invloed van emoties op de communicatie tussen gesprekspartners in een dialoog. Deze vier experimenten zijn geïnspireerd op eerder onderzoek, waarbij we aan de ene kant proberen de resultaten van dat onderzoek te bevestigen (replicatie), en aan de andere kant er op voort te bouwen. Omdat het grootste deel van onze experimenten in meer natuurlijke situaties plaatsvonden, zoals spontane gesprekken tussen twee proefpersonen, zijn onze resultaten beter te generaliseren naar het dagelijkse leven dan het meeste onderzoek waarop ons proefschrift is gebaseerd. Voor onze experimenten hebben we het taalgedrag bestudeerd van studenten, waarbij we bij één experiment ook een steekproef van mensen met autisme hebben onderzocht. De bevindingen van dit proefschrift laat zien dat er een belangrijke, maar soms subtiele relatie is tussen emotie en gesproken taal. Zo hebben we gevonden dat mensen die objecten aan elkaar beschrijven zich iets meer aan elkaars woordkeuze aanpassen als ze de emotie walging ervaren, dan wanneer ze zich geamuseerd voelen. Ook vonden we dat emoties invloed hebben op hoe mensen met elkaar communiceren over gevoelige onderwerpen zoals pesten, waarbij mensen in een negatieve stemming vaak op een meer indirecte manier vragen stellen aan hun gesprekspartner dan mensen in een positieve stemming. We vonden dit effect echter alleen bij een groep mensen met autisme en niet bij mensen zonder autisme. Gebaseerd op de resultaten van onze experimenten kunnen we concluderen dat gevoelens een invloed hebben op hoe mensen (met elkaar) communiceren, waarbij zowel onze sociale gedrag als ons taalgebruik wordt beïnvloed door onze emoties

    Paralinguistic event detection in children's speech

    Get PDF
    Paralinguistic events are useful indicators of the affective state of a speaker. These cues, in children's speech, are used to form social bonds with their caregivers. They have also been found to be useful in the very early detection of developmental disorders such as autism spectrum disorder (ASD) in children's speech. Prior work on children's speech has focused on the use of a limited number of subjects which don't have sufficient diversity in the type of vocalizations that are produced. Also, the features that are necessary to understand the production of paralinguistic events is not fully understood. To account for the lack of an off-the-shelf solution to detect instances of laughter and crying in children's speech, the focus of the thesis is to investigate and develop signal processing algorithms to extract acoustic features and use machine learning algorithms on various corpora. Results obtained using baseline spectral and prosodic features indicate the ability of the combination of spectral, prosodic, and dysphonation-related features that are needed to detect laughter and whining in toddlers' speech with different age groups and recording environments. The use of long-term features were found to be useful to capture the periodic properties of laughter in adults' and children's speech and detected instances of laughter to a high degree of accuracy. Finally, the thesis focuses on the use of multi-modal information using acoustic features and computer vision-based smile-related features to detect instances of laughter and to reduce the instances of false positives in adults' and children's speech. The fusion of the features resulted in an improvement of the accuracy and recall rates than when using either of the two modalities on their own.Ph.D

    Recognizing emotions in spoken dialogue with acoustic and lexical cues

    Get PDF
    Automatic emotion recognition has long been a focus of Affective Computing. It has become increasingly apparent that awareness of human emotions in Human-Computer Interaction (HCI) is crucial for advancing related technologies, such as dialogue systems. However, performance of current automatic emotion recognition is disappointing compared to human performance. Current research on emotion recognition in spoken dialogue focuses on identifying better feature representations and recognition models from a data-driven point of view. The goal of this thesis is to explore how incorporating prior knowledge of human emotion recognition in the automatic model can improve state-of-the-art performance of automatic emotion recognition in spoken dialogue. Specifically, we study this by proposing knowledge-inspired features representing occurrences of disfluency and non-verbal vocalisation in speech, and by building a multimodal recognition model that combines acoustic and lexical features in a knowledge-inspired hierarchical structure. In our study, emotions are represented with the Arousal, Expectancy, Power, and Valence emotion dimensions. We build unimodal and multimodal emotion recognition models to study the proposed features and modelling approach, and perform emotion recognition on both spontaneous and acted dialogue. Psycholinguistic studies have suggested that DISfluency and Non-verbal Vocalisation (DIS-NV) in dialogue is related to emotions. However, these affective cues in spoken dialogue are overlooked by current automatic emotion recognition research. Thus, we propose features for recognizing emotions in spoken dialogue which describe five types of DIS-NV in utterances, namely filled pause, filler, stutter, laughter, and audible breath. Our experiments show that this small set of features is predictive of emotions. Our DIS-NV features achieve better performance than benchmark acoustic and lexical features for recognizing all emotion dimensions in spontaneous dialogue. Consistent with Psycholinguistic studies, the DIS-NV features are especially predictive of the Expectancy dimension of emotion, which relates to speaker uncertainty. Our study illustrates the relationship between DIS-NVs and emotions in dialogue, which contributes to Psycholinguistic understanding of them as well. Note that our DIS-NV features are based on manual annotations, yet our long-term goal is to apply our emotion recognition model to HCI systems. Thus, we conduct preliminary experiments on automatic detection of DIS-NVs, and on using automatically detected DIS-NV features for emotion recognition. Our results show that DIS-NVs can be automatically detected from speech with stable accuracy, and auto-detected DIS-NV features remain predictive of emotions in spontaneous dialogue. This suggests that our emotion recognition model can be applied to a fully automatic system in the future, and holds the potential to improve the quality of emotional interaction in current HCI systems. To study the robustness of the DIS-NV features, we conduct cross-corpora experiments on both spontaneous and acted dialogue. We identify how dialogue type influences the performance of DIS-NV features and emotion recognition models. DIS-NVs contain additional information beyond acoustic characteristics or lexical contents. Thus, we study the gain of modality fusion for emotion recognition with the DIS-NV features. Previous work combines different feature sets by fusing modalities at the same level using two types of fusion strategies: Feature-Level (FL) fusion, which concatenates feature sets before recognition; and Decision-Level (DL) fusion, which makes the final decision based on outputs of all unimodal models. However, features from different modalities may describe data at different time scales or levels of abstraction. Moreover, Cognitive Science research indicates that when perceiving emotions, humans make use of information from different modalities at different cognitive levels and time steps. Therefore, we propose a HierarchicaL (HL) fusion strategy for multimodal emotion recognition, which incorporates features that describe data at a longer time interval or which are more abstract at higher levels of its knowledge-inspired hierarchy. Compared to FL and DL fusion, HL fusion incorporates both inter- and intra-modality differences. Our experiments show that HL fusion consistently outperforms FL and DL fusion on multimodal emotion recognition in both spontaneous and acted dialogue. The HL model combining our DIS-NV features with benchmark acoustic and lexical features improves current performance of multimodal emotion recognition in spoken dialogue. To study how other emotion-related tasks of spoken dialogue can benefit from the proposed approaches, we apply the DIS-NV features and the HL fusion strategy to recognize movie-induced emotions. Our experiments show that although designed for recognizing emotions in spoken dialogue, DIS-NV features and HL fusion remain effective for recognizing movie-induced emotions. This suggests that other emotion-related tasks can also benefit from the proposed features and model structure