Automatic Emotion Recognition: Quantifying Dynamics and Structure in Human Behavior.

Abstract

Emotion is a central part of human interaction, one that has a huge influence on its overall tone and outcome. Today's human-centered interactive technology can greatly benefit from automatic emotion recognition, as the extracted affective information can be used to measure, transmit, and respond to user needs. However, developing such systems is challenging due to the complexity of emotional expressions and their dynamics in terms of the inherent multimodality between audio and visual expressions, as well as the mixed factors of modulation that arise when a person speaks. To overcome these challenges, this thesis presents data-driven approaches that can quantify the underlying dynamics in audio-visual affective behavior. The first set of studies lay the foundation and central motivation of this thesis. We discover that it is crucial to model complex non-linear interactions between audio and visual emotion expressions, and that dynamic emotion patterns can be used in emotion recognition. Next, the understanding of the complex characteristics of emotion from the first set of studies leads us to examine multiple sources of modulation in audio-visual affective behavior. Specifically, we focus on how speech modulates facial displays of emotion. We develop a framework that uses speech signals which alter the temporal dynamics of individual facial regions to temporally segment and classify facial displays of emotion. Finally, we present methods to discover regions of emotionally salient events in a given audio-visual data. We demonstrate that different modalities, such as the upper face, lower face, and speech, express emotion with different timings and time scales, varying for each emotion type. We further extend this idea into another aspect of human behavior: human action events in videos. We show how transition patterns between events can be used for automatically segmenting and classifying action events. Our experimental results on audio-visual datasets show that the proposed systems not only improve performance, but also provide descriptions of how affective behaviors change over time. We conclude this dissertation with the future directions that will innovate three main research topics: machine adaptation for personalized technology, human-human interaction assistant systems, and human-centered multimedia content analysis.PhDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/133459/1/yelinkim_1.pd

    Similar works

    Full text

    thumbnail-image