5 research outputs found

    ICMI'12:Proceedings of the ACM SIGCHI 14th International Conference on Multimodal Interaction

    Get PDF

    Investigating multi-modal features for continuous affect recognition using visual sensing

    Get PDF
    Emotion plays an essential role in human cognition, perception and rational decisionmaking. In the information age, people spend more time then ever before interacting with computers, however current technologies such as Artificial Intelligence (AI) and Human-Computer Interaction (HCI) have largely ignored the implicit information of a user’s emotional state leading to an often frustrating and cold user experience. To bridge this gap between human and computer, the field of affective computing has become a popular research topic. Affective computing is an interdisciplinary field encompassing computer, social, cognitive, psychology and neural science. This thesis focuses on human affect recognition, which is one of the most commonly investigated areas in affective computing. Although from a psychology point of view, emotion is usually defined differently from affect, for this thesis the terms emotion, affect, emotional state and affective state are used interchangeably. Both visual and vocal cues have been used in previous research to recognise a human’s affective states. For visual cues, information from the face is often used. Although these systems achieved good performance under laboratory settings, it has proved a challenging task to translate these to unconstrained environments due to variations in head pose and lighting conditions. Since a human face is a threedimensional (3D) object whose 2D projection is sensitive to the aforementioned variations, recent trends have shifted towards using 3D facial information to improve the accuracy and robustness of the systems. However these systems are still focused on recognising deliberately displayed affective states, mainly prototypical expressions of six basic emotions (happiness, sadness, fear, anger, surprise and disgust). To our best knowledge, no research has been conducted towards continuous recognition of spontaneous affective states using 3D facial information. The main goal of this thesis is to investigate the use of 2D (colour) and 3D (depth) facial information to recognise spontaneous affective states continuously. Due to a lack of an existing continuous annotated spontaneous data set, which contains both colour and depth information, such a data set was created. To better understand the processes in affect recognition and to compare results of the proposed methods, a baseline system was implemented. Then the use of colour and depth information for affect recognition were examined separately. For colour information, an investigation was carried out to explore the performance of various state-of-art 2D facial features using different publicly available data sets as well as the captured data set. Experiments were also carried out to study if it is possible to predict a human’s affective state using 2D features extracted from individual facial parts (E.g. eyes and mouth). For depth information, a number of histogram based features were used and their performance was evaluated. Finally a multi-modal affect recognition framework utilising both colour and depth information is proposed and its performance was evaluated using the captured data set

    Affective Speech Recognition

    Get PDF
    Speech, as a medium of interaction, carries two different streams of information. Whereas one stream carries explicit messages, the other one contains implicit information about speakers themselves. Affective speech recognition is a set of theories and tools that intend to automate unfolding the part of the implicit stream that has to do with humans emotion. Application of affective speech recognition is to human computer interaction; a machine that is able to recognize humans emotion could engage the user in a more effective interaction. This thesis proposes a set of analyses and methodologies that advance automatic recognition of affect from speech. The proposed solution spans two dimensions of the problem: speech signal processing, and statistical learning. At the speech signal processing dimension, extraction of speech low-level descriptors is dis- cussed, and a set of descriptors that exploit the spectrum of the signal are proposed, which have shown to be particularly practical for capturing affective qualities of speech. Moreover, consider- ing the non-stationary property of the speech signal, further proposed is a measure of dynamicity that captures that property of speech by quantifying changes of the signal over time. Furthermore, based on the proposed set of low-level descriptors, it is shown that individual human beings are different in conveying emotions, and that parts of the spectrum that hold the affective information are different from one person to another. Therefore, the concept of emotion profile is proposed that formalizes those differences by taking into account different factors such as cultural and gender-specific differences, as well as those distinctions that have to do with individual human beings. At the statistical learning dimension, variable selection is performed to identify speech features that are most imperative to extracting affective information. In doing so, low-level descriptors are distinguished from statistical functionals, therefore, effectiveness of each of the two are studied dependently and independently. The major importance of variable selection as a standalone component of a solution is to real-time application of affective speech recognition. Although thousands of speech features are commonly used to tackle this problem in theory, extracting that many features in a real-time manner is unrealistic, especially for mobile applications. Results of the conducted investigations show that the required number of speech features is far less than the number that is commonly used in the literature of the problem. At the core of an affective speech recognition solution is a statistical model that uses speech features to recognize emotions. Such a model comes with a set of parameters that are estimated through a learning process. Proposed in this thesis is a learning algorithm, developed based on the notion of Hilbert-Schmidt independence criterion and named max-dependence regression, that maximizes the dependence between predicted and actual values of affective qualities. Pearson’s correlation coefficient is commonly used as the measure of goodness of a fit in the literature of affective computing, therefore max-dependence regression is proposed to make the learning and hypothesis testing criteria consistent with one another. Results of this research show that doing so results in higher prediction accuracy. Lastly, sparse representation for affective speech datasets is considered in this thesis. For this purpose, the application of a dictionary learning algorithm based on Hilbert-Schmidt independence criterion is proposed. Dictionary learning is used to identify the most important bases of the data in order to improve the generalization capability of the proposed solution to affective speech recognition. Based on the dictionary learning approach of choice, fusion of feature vectors is proposed. It is shown that sparse representation leads to higher generalization capability for affective speech recognition
    corecore