605 research outputs found

    Sensing, interpreting, and anticipating human social behaviour in the real world

    Get PDF
    Low-level nonverbal social signals like glances, utterances, facial expressions and body language are central to human communicative situations and have been shown to be connected to important high-level constructs, such as emotions, turn-taking, rapport, or leadership. A prerequisite for the creation of social machines that are able to support humans in e.g. education, psychotherapy, or human resources is the ability to automatically sense, interpret, and anticipate human nonverbal behaviour. While promising results have been shown in controlled settings, automatically analysing unconstrained situations, e.g. in daily-life settings, remains challenging. Furthermore, anticipation of nonverbal behaviour in social situations is still largely unexplored. The goal of this thesis is to move closer to the vision of social machines in the real world. It makes fundamental contributions along the three dimensions of sensing, interpreting and anticipating nonverbal behaviour in social interactions. First, robust recognition of low-level nonverbal behaviour lays the groundwork for all further analysis steps. Advancing human visual behaviour sensing is especially relevant as the current state of the art is still not satisfactory in many daily-life situations. While many social interactions take place in groups, current methods for unsupervised eye contact detection can only handle dyadic interactions. We propose a novel unsupervised method for multi-person eye contact detection by exploiting the connection between gaze and speaking turns. Furthermore, we make use of mobile device engagement to address the problem of calibration drift that occurs in daily-life usage of mobile eye trackers. Second, we improve the interpretation of social signals in terms of higher level social behaviours. In particular, we propose the first dataset and method for emotion recognition from bodily expressions of freely moving, unaugmented dyads. Furthermore, we are the first to study low rapport detection in group interactions, as well as investigating a cross-dataset evaluation setting for the emergent leadership detection task. Third, human visual behaviour is special because it functions as a social signal and also determines what a person is seeing at a given moment in time. Being able to anticipate human gaze opens up the possibility for machines to more seamlessly share attention with humans, or to intervene in a timely manner if humans are about to overlook important aspects of the environment. We are the first to propose methods for the anticipation of eye contact in dyadic conversations, as well as in the context of mobile device interactions during daily life, thereby paving the way for interfaces that are able to proactively intervene and support interacting humans.Blick, Gesichtsausdrücke, Körpersprache, oder Prosodie spielen als nonverbale Signale eine zentrale Rolle in menschlicher Kommunikation. Sie wurden durch vielzählige Studien mit wichtigen Konzepten wie Emotionen, Sprecherwechsel, Führung, oder der Qualität des Verhältnisses zwischen zwei Personen in Verbindung gebracht. Damit Menschen effektiv während ihres täglichen sozialen Lebens von Maschinen unterstützt werden können, sind automatische Methoden zur Erkennung, Interpretation, und Antizipation von nonverbalem Verhalten notwendig. Obwohl die bisherige Forschung in kontrollierten Studien zu ermutigenden Ergebnissen gekommen ist, bleibt die automatische Analyse nonverbalen Verhaltens in weniger kontrollierten Situationen eine Herausforderung. Darüber hinaus existieren kaum Untersuchungen zur Antizipation von nonverbalem Verhalten in sozialen Situationen. Das Ziel dieser Arbeit ist, die Vision vom automatischen Verstehen sozialer Situationen ein Stück weit mehr Realität werden zu lassen. Diese Arbeit liefert wichtige Beiträge zur autmatischen Erkennung menschlichen Blickverhaltens in alltäglichen Situationen. Obwohl viele soziale Interaktionen in Gruppen stattfinden, existieren unüberwachte Methoden zur Augenkontakterkennung bisher lediglich für dyadische Interaktionen. Wir stellen einen neuen Ansatz zur Augenkontakterkennung in Gruppen vor, welcher ohne manuelle Annotationen auskommt, indem er sich den statistischen Zusammenhang zwischen Blick- und Sprechverhalten zu Nutze macht. Tägliche Aktivitäten sind eine Herausforderung für Geräte zur mobile Augenbewegungsmessung, da Verschiebungen dieser Geräte zur Verschlechterung ihrer Kalibrierung führen können. In dieser Arbeit verwenden wir Nutzerverhalten an mobilen Endgeräten, um den Effekt solcher Verschiebungen zu korrigieren. Neben der Erkennung verbessert diese Arbeit auch die Interpretation sozialer Signale. Wir veröffentlichen den ersten Datensatz sowie die erste Methode zur Emotionserkennung in dyadischen Interaktionen ohne den Einsatz spezialisierter Ausrüstung. Außerdem stellen wir die erste Studie zur automatischen Erkennung mangelnder Verbundenheit in Gruppeninteraktionen vor, und führen die erste datensatzübergreifende Evaluierung zur Detektion von sich entwickelndem Führungsverhalten durch. Zum Abschluss der Arbeit präsentieren wir die ersten Ansätze zur Antizipation von Blickverhalten in sozialen Interaktionen. Blickverhalten hat die besondere Eigenschaft, dass es sowohl als soziales Signal als auch der Ausrichtung der visuellen Wahrnehmung dient. Somit eröffnet die Fähigkeit zur Antizipation von Blickverhalten Maschinen die Möglichkeit, sich sowohl nahtloser in soziale Interaktionen einzufügen, als auch Menschen zu warnen, wenn diese Gefahr laufen wichtige Aspekte der Umgebung zu übersehen. Wir präsentieren Methoden zur Antizipation von Blickverhalten im Kontext der Interaktion mit mobilen Endgeräten während täglicher Aktivitäten, als auch während dyadischer Interaktionen mittels Videotelefonie

    Multimodal Data Analysis of Dyadic Interactions for an Automated Feedback System Supporting Parent Implementation of Pivotal Response Treatment

    Get PDF
    abstract: Parents fulfill a pivotal role in early childhood development of social and communication skills. In children with autism, the development of these skills can be delayed. Applied behavioral analysis (ABA) techniques have been created to aid in skill acquisition. Among these, pivotal response treatment (PRT) has been empirically shown to foster improvements. Research into PRT implementation has also shown that parents can be trained to be effective interventionists for their children. The current difficulty in PRT training is how to disseminate training to parents who need it, and how to support and motivate practitioners after training. Evaluation of the parents’ fidelity to implementation is often undertaken using video probes that depict the dyadic interaction occurring between the parent and the child during PRT sessions. These videos are time consuming for clinicians to process, and often result in only minimal feedback for the parents. Current trends in technology could be utilized to alleviate the manual cost of extracting data from the videos, affording greater opportunities for providing clinician created feedback as well as automated assessments. The naturalistic context of the video probes along with the dependence on ubiquitous recording devices creates a difficult scenario for classification tasks. The domain of the PRT video probes can be expected to have high levels of both aleatory and epistemic uncertainty. Addressing these challenges requires examination of the multimodal data along with implementation and evaluation of classification algorithms. This is explored through the use of a new dataset of PRT videos. The relationship between the parent and the clinician is important. The clinician can provide support and help build self-efficacy in addition to providing knowledge and modeling of treatment procedures. Facilitating this relationship along with automated feedback not only provides the opportunity to present expert feedback to the parent, but also allows the clinician to aid in personalizing the classification models. By utilizing a human-in-the-loop framework, clinicians can aid in addressing the uncertainty in the classification models by providing additional labeled samples. This will allow the system to improve classification and provides a person-centered approach to extracting multimodal data from PRT video probes.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    Deep Emotion Recognition in Textual Conversations: A Survey

    Full text link
    While Emotion Recognition in Conversations (ERC) has seen a tremendous advancement in the last few years, new applications and implementation scenarios present novel challenges and opportunities. These range from leveraging the conversational context, speaker and emotion dynamics modelling, to interpreting common sense expressions, informal language and sarcasm, addressing challenges of real time ERC, recognizing emotion causes, different taxonomies across datasets, multilingual ERC to interpretability. This survey starts by introducing ERC, elaborating on the challenges and opportunities pertaining to this task. It proceeds with a description of the emotion taxonomies and a variety of ERC benchmark datasets employing such taxonomies. This is followed by descriptions of the most prominent works in ERC with explanations of the Deep Learning architectures employed. Then, it provides advisable ERC practices towards better frameworks, elaborating on methods to deal with subjectivity in annotations and modelling and methods to deal with the typically unbalanced ERC datasets. Finally, it presents systematic review tables comparing several works regarding the methods used and their performance. The survey highlights the advantage of leveraging techniques to address unbalanced data, the exploration of mixed emotions and the benefits of incorporating annotation subjectivity in the learning phase

    Towards Video Transformers for Automatic Human Analysis

    Full text link
    [eng] With the aim of creating artificial systems capable of mirroring the nuanced understanding and interpretative powers inherent to human cognition, this thesis embarks on an exploration of the intersection between human analysis and Video Transformers. The objective is to harness the potential of Transformers, a promising architectural paradigm, to comprehend the intricacies of human interaction, thus paving the way for the development of empathetic and context-aware intelligent systems. In order to do so, we explore the whole Computer Vision pipeline, from data gathering, to deeply analyzing recent developments, through model design and experimentation. Central to this study is the creation of UDIVA, an expansive multi-modal, multi-view dataset capturing dyadic face-to-face human interactions. Comprising 147 participants across 188 sessions, UDIVA integrates audio-visual recordings, heart-rate measurements, personality assessments, socio- demographic metadata, and conversational transcripts, establishing itself as the largest dataset for dyadic human interaction analysis up to this date. This dataset provides a rich context for probing the capabilities of Transformers within complex environments. In order to validate its utility, as well as to elucidate Transformers' ability to assimilate diverse contextual cues, we focus on addressing the challenge of personality regression within interaction scenarios. We first adapt an existing Video Transformer to handle multiple contextual sources and conduct rigorous experimentation. We empirically observe a progressive enhancement in model performance as more context is added, reinforcing the potential of Transformers to decode intricate human dynamics. Building upon these findings, the Dyadformer emerges as a novel architecture, adept at long-range modeling of dyadic interactions. By jointly modeling both participants in the interaction, as well as embedding multi- modal integration into the model itself, the Dyadformer surpasses the baseline and other concurrent approaches, underscoring Transformers' aptitude in deciphering multifaceted, noisy, and challenging tasks such as the analysis of human personality in interaction. Nonetheless, these experiments unveil the ubiquitous challenges when training Transformers, particularly in managing overfitting due to their demand for extensive datasets. Consequently, we conclude this thesis with a comprehensive investigation into Video Transformers, analyzing topics ranging from architectural designs and training strategies, to input embedding and tokenization, traversing through multi-modality and specific applications. Across these, we highlight trends which optimally harness spatio-temporal representations that handle video redundancy and high dimensionality. A culminating performance comparison is conducted in the realm of video action classification, spotlighting strategies that exhibit superior efficacy, even compared to traditional CNN-based methods.[cat] Aquesta tesi busca crear sistemes artificials que reflecteixin les habilitats de comprensió i interpretació humanes a través de l'ús de Transformers per a vídeo. L'objectiu és utilitzar aquestes arquitectures per comprendre millor la interacció humana i desenvolupar sistemes intel·ligents i conscients de l'entorn. Això implica explorar àmplies àrees de la Visió per Computador, des de la recopilació de dades fins a l'anàlisi de l'estat de l'art i la prova experimental d'aquests models. Una part essencial d'aquest estudi és la creació d'UDIVA, un ampli conjunt de dades multimodal i multivista que enregistra interaccions humanes cara a cara. Amb 147 participants i 188 sessions, UDIVA inclou contingut audiovisual, freqüència cardíaca, perfils de personalitat, dades sociodemogràfiques i transcripcions de les converses. És el conjunt de dades més gran conegut per a l'anàlisi de la interacció humana diàdica i proporciona un context ric per a l'estudi de les capacitats dels Transformers en entorns complexos. Per tal de validar la seva utilitat i les habilitats dels Transformers, ens centrem en la regressió de la personalitat. Inicialment, adaptem un Transformer de vídeo per integrar diverses fonts de context. Mitjançant experiments exhaustius, observem millores progressives en els resultats amb la inclusió de més context, confirmant la capacitat dels Transformers. Motivats per aquests resultats, desenvolupem el Dyadformer, una arquitectura per interaccions diàdiques de llarga duració. Aquesta nova arquitectura considera simultàniament els dos participants en la interacció i incorpora la multimodalitat en un sol model. El Dyadformer supera la nostra proposta inicial i altres treballs similars, destacant la capacitat dels Transformers per abordar tasques complexes. No obstant això, aquestos experiments revelen reptes d'entrenament dels Transformers, com el sobreajustament, per la seva necessitat de grans conjunts de dades. La tesi conclou amb una anàlisi profunda dels Transformers per a vídeo, incloent dissenys arquitectònics, estratègies d'entrenament, preprocessament de vídeos, tokenització i multimodalitat. S'identifiquen tendències per gestionar la redundància i alta dimensionalitat de vídeos i es realitza una comparació de rendiment en la classificació d'accions a vídeo, destacant estratègies d'eficàcia superior als mètodes tradicionals basats en convolucions

    Music in the brain

    Get PDF
    Music is ubiquitous across human cultures — as a source of affective and pleasurable experience, moving us both physically and emotionally — and learning to play music shapes both brain structure and brain function. Music processing in the brain — namely, the perception of melody, harmony and rhythm — has traditionally been studied as an auditory phenomenon using passive listening paradigms. However, when listening to music, we actively generate predictions about what is likely to happen next. This enactive aspect has led to a more comprehensive understanding of music processing involving brain structures implicated in action, emotion and learning. Here we review the cognitive neuroscience literature of music perception. We show that music perception, action, emotion and learning all rest on the human brain’s fundamental capacity for prediction — as formulated by the predictive coding of music model. This Review elucidates how this formulation of music perception and expertise in individuals can be extended to account for the dynamics and underlying brain mechanisms of collective music making. This in turn has important implications for human creativity as evinced by music improvisation. These recent advances shed new light on what makes music meaningful from a neuroscientific perspective

    Multimodal sentiment analysis in real-life videos

    Get PDF
    This thesis extends the emerging field of multimodal sentiment analysis of real-life videos, taking two components into consideration: the emotion and the emotion's target. The emotion component of media is traditionally represented as a segment-based intensity model of emotion classes. This representation is replaced here by a value- and time-continuous view. Adjacent research fields, such as affective computing, have largely neglected the linguistic information available from automatic transcripts of audio-video material. As is demonstrated here, this text modality is well-suited for time- and value-continuous prediction. Moreover, source-specific problems, such as trustworthiness, have been largely unexplored so far. This work examines perceived trustworthiness of the source, and its quantification, in user-generated video data and presents a possible modelling path. Furthermore, the transfer between the continuous and discrete emotion representations is explored in order to summarise the emotional context at a segment level. The other component deals with the target of the emotion, for example, the topic the speaker is addressing. Emotion targets in a video dataset can, as is shown here, be coherently extracted based on automatic transcripts without limiting a priori parameters, such as the expected number of targets. Furthermore, alternatives to purely linguistic investigation in predicting targets, such as knowledge-bases and multimodal systems, are investigated. A new dataset is designed for this investigation, and, in conjunction with proposed novel deep neural networks, extensive experiments are conducted to explore the components described above. The developed systems show robust prediction results and demonstrate strengths of the respective modalities, feature sets, and modelling techniques. Finally, foundations are laid for cross-modal information prediction systems with applications to the correction of corrupted in-the-wild signals from real-life videos

    Learning from Audio, Vision and Language Modalities for Affect Recognition Tasks

    Get PDF
    The world around us as well as our responses to worldly events are multimodal in nature. For intelligent machines to integrate seamlessly into our world, it is imperative that they can process and derive useful information from multimodal signals. Such capabilities can be provided to machines by employing multimodal learning algorithms that consider both the individual characteristics of unimodal signals as well as the complementariness provided by multimodal signals. Based on the number of modalities available during the training and testing phases, learning algorithms can be of three categories: unimodal trained and unimodal tested, multimodal trained and multimodal tested, and multimodal trained and unimodal tested algorithms. This thesis provides three contributions, one for each category and focuses on three modalities that are important for human-human and human-machine communication, namely, audio (paralinguistic speech), vision (facial expressions) and language (linguistic speech) signals. For several applications, either due to hardware limitations or deployment specifications, unimodal trained and tested systems suffice. Our first contribution, for the unimodal trained and unimodal tested category, is an end-to-end deep neural network that uses raw speech signals as input for a computational paralinguistic task, namely, verbal conflict intensity estimation. Our model, which uses a convolutional recurrent architecture equipped with attention mechanism to focus on task-relevant instances of the input speech signal, eliminates the need for task-specific meta data or domain knowledge based manual refinement of hand-crafted generic features. The second contribution, for the multimodal trained and multimodal tested category, is a multimodal fusion framework that exploits both cross (inter) and intra-modal interactions for categorical emotion recognition from audiovisual clips. We explore the effectiveness of two types of attention mechanisms, namely, intra and cross-modal attention by creating two versions of our fusion framework. In many applications, multimodal signals might be available during model training phase, yet we cannot expect the availability of all modality signals during testing phase. Our third contribution addresses this situation wherein we propose a framework for cross-modal learning where paired audio-visual instances are used during training to develop test-time stand-alone unimodal models

    深層学習に基づく感情会話分析に関する研究

    Get PDF
    Owning the capability to express specific emotions by a chatbot during a conversation is one of the key parts of artificial intelligence, which has an intuitive and quantifiable impact on the improvement of chatbot’s usability and user satisfaction. Enabling machines to emotion recognition in conversation is challenging, mainly because the information in human dialogue innately conveys emotions by long-term experience, abundant knowledge, context, and the intricate patterns between the affective states. Recently, many studies on neural emotional conversational models have been conducted. However, enabling the chatbot to control what kind of emotion to respond to upon its own characters in conversation is still underexplored. At this stage, people are no longer satisfied with using a dialogue system to solve specific tasks, and are more eager to achieve spiritual communication. In the chat process, if the robot can perceive the user's emotions and can accurately process them, it can greatly enrich the content of the dialogue and make the user empathize. In the process of emotional dialogue, our ultimate goal is to make the machine understand human emotions and give matching responses. Based on these two points, this thesis explores and in-depth emotion recognition in conversation task and emotional dialogue generation task. In the past few years, although considerable progress has been made in emotional research in dialogue, there are still some difficulties and challenges due to the complex nature of human emotions. The key contributions in this thesis are summarized as below: (1) Researchers have paid more attention to enhancing natural language models with knowledge graphs these days, since knowledge graph has gained a lot of systematic knowledge. A large number of studies had shown that the introduction of external commonsense knowledge is very helpful to improve the characteristic information. We address the task of emotion recognition in conversations using external knowledge to enhance semantics. In this work, we employ an external knowledge graph ATOMIC to extract the knowledge sources. We proposed KES model, a new framework that incorporates different elements of external knowledge and conversational semantic role labeling, where build upon them to learn interactions between interlocutors participating in a conversation. The conversation is a sequence of coherent and orderly discourses. For neural networks, the capture of long-range context information is a weakness. We adopt Transformer a structure composed of self-attention and feed forward neural network, instead of the traditional RNN model, aiming at capturing remote context information. We design a self-attention layer specialized for enhanced semantic text features with external commonsense knowledge. Then, two different networks composed of LSTM are responsible for tracking individual internal state and context external state. In addition, the proposed model has experimented on three datasets in emotion detection in conversation. The experimental results show that our model outperforms the state-of-the-art approaches on most of the tested datasets. (2) We proposed an emotional dialogue model based on Seq2Seq, which is improved from three aspects: model input, encoder structure, and decoder structure, so that the model can generate responses with rich emotions, diversity, and context. In terms of model input, emotional information and location information are added based on word vectors. In terms of the encoder, the proposed model first encodes the current input and sentence sentiment to generate a semantic vector, and additionally encodes the context and sentence sentiment to generate a context vector, adding contextual information while ensuring the independence of the current input. On the decoder side, attention is used to calculate the weights of the two semantic vectors separately and then decode, to fully integrate the local emotional semantic information and the global emotional semantic information. We used seven objective evaluation indicators to evaluate the model's generation results, context similarity, response diversity, and emotional response. Experimental results show that the model can generate diverse responses with rich sentiment, contextual associations
    corecore