22 research outputs found

    Survey on Emotional Body Gesture Recognition

    Get PDF
    Automatic emotion recognition has become a trending research topic in the past decade. While works based on facial expressions or speech abound, recognizing affect from body gestures remains a less explored topic. We present a new comprehensive survey hoping to boost research in the field. We first introduce emotional body gestures as a component of what is commonly known as "body language" and comment general aspects as gender differences and culture dependence. We then define a complete framework for automatic emotional body gesture recognition. We introduce person detection and comment static and dynamic body pose estimation methods both in RGB and 3D. We then comment the recent literature related to representation learning and emotion recognition from images of emotionally expressive gestures. We also discuss multi-modal approaches that combine speech or face with body gestures for improved emotion recognition. While pre-processing methodologies (e.g., human detection and pose estimation) are nowadays mature technologies fully developed for robust large scale analysis, we show that for emotion recognition the quantity of labelled data is scarce. There is no agreement on clearly defined output spaces and the representations are shallow and largely based on naive geometrical representations

    Analysis and automatic identification of spontaneous emotions in speech from human-human and human-machine communication

    Get PDF
    383 p.This research mainly focuses on improving our understanding of human-human and human-machineinteractions by analysing paricipants¿ emotional status. For this purpose, we have developed andenhanced Speech Emotion Recognition (SER) systems for both interactions in real-life scenarios,explicitly emphasising the Spanish language. In this framework, we have conducted an in-depth analysisof how humans express emotions using speech when communicating with other persons or machines inactual situations. Thus, we have analysed and studied the way in which emotional information isexpressed in a variety of true-to-life environments, which is a crucial aspect for the development of SERsystems. This study aimed to comprehensively understand the challenge we wanted to address:identifying emotional information on speech using machine learning technologies. Neural networks havebeen demonstrated to be adequate tools for identifying events in speech and language. Most of themaimed to make local comparisons between some specific aspects; thus, the experimental conditions weretailored to each particular analysis. The experiments across different articles (from P1 to P19) are hardlycomparable due to our continuous learning of dealing with the difficult task of identifying emotions inspeech. In order to make a fair comparison, additional unpublished results are presented in the Appendix.These experiments were carried out under identical and rigorous conditions. This general comparisonoffers an overview of the advantages and disadvantages of the different methodologies for the automaticrecognition of emotions in speech

    Video Analysis for Understanding Human Actions and Interactions

    Get PDF
    Each time that we act, our actions are not just conditioned by the spatial information, e.g., objects, people, and the scene where we are involved. These actions are also conditioned temporally with the previous actions that we have done. Indeed, we live in an evolving and dynamic world. To understand what a person is doing, we reason jointly over spatial and temporal information. Intelligent systems that interact with people and perform useful tasks will also require this ability. In light of this need, video analysis has become, in recent years, an essential field in computer vision, providing to the community a wide range of tasks to solve. In this thesis, we make several contributions to the literature of video analysis, exploring different tasks that aim to understand human actions and interactions. We begin by considering the challenging problem of human action anticipation. In this task, we seek to predict a person's action as early as possible before it is completed. This task is critical for applications where machines have to react to human actions. We introduce a novel approach that forecasts the most plausible future human motion by hallucinating motion representations. Then, we address the challenging problem of temporal moment localization. It consists of finding the temporal localization of a natural-language query in a long untrimmed video. Although the queries could be anything that is happening within the video, the vast majority of them describe human actions. In contrast with the propose and rank approaches, where methods create or use predefined clips as candidates, we introduce a proposal-free approach that localizes the query by looking at the whole video at once. We also consider the temporal annotations' subjectivity and propose a soft-labelling using a categorical distribution centred on the annotated start and end. Equipped with a proposal-free architecture, we tackle the temporal moment localization introducing a spatial-temporal graph. We found that one of the limitations of the existing methods is the lack of spatial cues involved in the video and the query, i.e., objects and people. We create six semantically meaningful nodes. Three that are feed with visual features of people, objects, and activities, and the other three that capture the relationship at the language level of the "subject-object,'' "subject-verb," and "verb-object." We use a language-conditional message-passing algorithm to capture the relationship between nodes and create an improved representation of the activity. A temporal graph uses this new representation to determine the start and end of the query. Last, we study the problem of fine-grained opinion mining in video review using a multi-modal setting. There is increasing use of video as a source of information for guidance in the shopping process. People use video reviews as a guide to answering what, why, and where to buy something. We tackle this problem using the three different modalities inherently present in a video ---audio, frames, and transcripts--- to determine the most relevant aspect of the product under review and the sentiment polarity of the reviewer upon that aspect. We propose an early fusion mechanism of the three modalities. In this approach, we fuse the three different modalities at the sentence level. It is a general framework that does not lay in any strict constraints on the individual encodings of the audio, video frames and transcripts

    Gaze-Based Human-Robot Interaction by the Brunswick Model

    Get PDF
    We present a new paradigm for human-robot interaction based on social signal processing, and in particular on the Brunswick model. Originally, the Brunswick model copes with face-to-face dyadic interaction, assuming that the interactants are communicating through a continuous exchange of non verbal social signals, in addition to the spoken messages. Social signals have to be interpreted, thanks to a proper recognition phase that considers visual and audio information. The Brunswick model allows to quantitatively evaluate the quality of the interaction using statistical tools which measure how effective is the recognition phase. In this paper we cast this theory when one of the interactants is a robot; in this case, the recognition phase performed by the robot and the human have to be revised w.r.t. the original model. The model is applied to Berrick, a recent open-source low-cost robotic head platform, where the gazing is the social signal to be considered

    Proceedings of the 3rd International Mobile Brain/Body Imaging Conference : Berlin, July 12th to July 14th 2018

    Get PDF
    The 3rd International Mobile Brain/Body Imaging (MoBI) conference in Berlin 2018 brought together researchers from various disciplines interested in understanding the human brain in its natural environment and during active behavior. MoBI is a new imaging modality, employing mobile brain imaging methods like the electroencephalogram (EEG) or near infrared spectroscopy (NIRS) synchronized to motion capture and other data streams to investigate brain activity while participants actively move in and interact with their environment. Mobile Brain / Body Imaging allows to investigate brain dynamics accompanying more natural cognitive and affective processes as it allows the human to interact with the environment without restriction regarding physical movement. Overcoming the movement restrictions of established imaging modalities like functional magnetic resonance tomography (MRI), MoBI can provide new insights into the human brain function in mobile participants. This imaging approach will lead to new insights into the brain functions underlying active behavior and the impact of behavior on brain dynamics and vice versa, it can be used for the development of more robust human-machine interfaces as well as state assessment in mobile humans.DFG, GR2627/10-1, 3rd International MoBI Conference 201

    The analysis of breathing and rhythm in speech

    Get PDF
    Speech rhythm can be described as the temporal patterning by which speech events, such as vocalic onsets, occur. Despite efforts to quantify and model speech rhythm across languages, it remains a scientifically enigmatic aspect of prosody. For instance, one challenge lies in determining how to best quantify and analyse speech rhythm. Techniques range from manual phonetic annotation to the automatic extraction of acoustic features. It is currently unclear how closely these differing approaches correspond to one another. Moreover, the primary means of speech rhythm research has been the analysis of the acoustic signal only. Investigations of speech rhythm may instead benefit from a range of complementary measures, including physiological recordings, such as of respiratory effort. This thesis therefore combines acoustic recording with inductive plethysmography (breath belts) to capture temporal characteristics of speech and speech breathing rhythms. The first part examines the performance of existing phonetic and algorithmic techniques for acoustic prosodic analysis in a new corpus of rhythmically diverse English and Mandarin speech. The second part addresses the need for an automatic speech breathing annotation technique by developing a novel function that is robust to the noisy plethysmography typical of spontaneous, naturalistic speech production. These methods are then applied in the following section to the analysis of English speech and speech breathing in a second, larger corpus. Finally, behavioural experiments were conducted to investigate listeners' perception of speech breathing using a novel gap detection task. The thesis establishes the feasibility, as well as limits, of automatic methods in comparison to manual annotation. In the speech breathing corpus analysis, they help show that speakers maintain a normative, yet contextually adaptive breathing style during speech. The perception experiments in turn demonstrate that listeners are sensitive to the violation of these speech breathing norms, even if unconsciously so. The thesis concludes by underscoring breathing as a necessary, yet often overlooked, component in speech rhythm planning and production
    corecore