4 research outputs found

    Investigating word affect features and fusion of probabilistic predictions incorporating uncertainty in AVEC 2017

    Full text link
    © 2017 Association for Computing Machinery. Predicting emotion intensity and severity of depression are both challenging and important problems within the broader field of affective computing. As part of the AVEC 2017, we developed a number of systems to accomplish these tasks. In particular, word affect features, which derive human affect ratings (e.g. arousal and valence) from transcripts, were investigated for predicting depression severity and liking, showing great promise. A simple system based on the word affect features achieved an RMSE of 6.02 on the test set, yielding a relative improvement of 13.6% over the baseline. For the emotion prediction sub-challenge, we investigated multimodal fusion, which incorporated a measure of uncertainty associated with each prediction within an Output-Associative fusion framework for arousal and valence prediction, whilst liking prediction systems mainly focused on text-based features. Our best emotion prediction systems provided significant relative improvements over the baseline on the test set of 39.5%, 17.6%, and 29.3% for arousal, valence, and liking. Of particular note is that consistent improvements were observed when incorporating prediction uncertainty across various system configurations for predicting arousal and valence, suggesting the importance of taking into consideration prediction uncertainty for fusion and more broadly the advantages of probabilistic predictions

    An Ordinal Approach to Affective Computing

    Full text link
    Both depression prediction and emotion recognition systems are often based on ordinal ground truth due to subjectively annotated datasets. Yet, both have so far been posed as classification or regression problems. These naive approaches have fundamental issues because they are not focused on ordering, unlike ordinal regression, which is the most appropriate for truly ordinal ground truth. Ordinal regression to date offers comparatively fewer, more limited methods when compared with other branches in machine learning, and its usage has been limited to specific research domains. Accordingly, this thesis presents investigations into ordinal approaches for affective computing by describing a consistent framework to understand all ordinal system designs, proposing ordinal systems for large datasets, and introducing tools and principles to select suitable system designs and evaluation methods. First, three learning approaches are compared using the support vector framework to establish the empirical advantages of ordinal regression, which is lacking from the current literature. Results on depression and emotion corpora indicate that ordinal regression with proper tuning can improve existing depression and emotion systems. Ordinal logistic regression (OLR), which is an extension of logistic regression for ordinal scales, contributes to a number of model structures, from which the best structure must be chosen. Exploiting the newly proposed computationally efficient greedy algorithm for model structure selection (GREP), OLR outperformed or was comparable with state-of-the-art depression systems on two benchmark depression speech datasets. Deep learning has dominated many affective computing fields, and hence ordinal deep learning is an attractive prospect. However, it is under-studied even in the machine learning literature, which motivates an in-depth analysis of appropriate network architectures and loss functions. One of the significant outcomes of this analysis is the introduction of RankCNet, a novel ordinal network which utilises a surrogate loss function of rank correlation. Not only the modelling algorithm but the choice of evaluation measure depends on the nature of the ground truth. Rank correlation measures, which are sensitive to ordering, are more apt for ordinal problems than common classification or regression measures that ignore ordering information. Although rank-based evaluation for ordinal problems is not new, so far in affective computing, ordinality of the ground truth has been widely ignored during evaluation. Hence, a systematic analysis in the affective computing context is presented, to provide clarity and encourage careful choice of evaluation measures. Another contribution is a neural network framework with a novel multi-term loss function to assess the ordinality of ordinally-annotated datasets, which can guide the selection of suitable learning and evaluation methods. Experiments on multiple synthetic and affective speech datasets reveal that the proposed system can offer reliable and meaningful predictions about the ordinality of a given dataset. Overall, the novel contributions and findings presented in this thesis not only improve prediction accuracy but also encourage future research towards ordinal affective computing: a different paradigm, but often the most appropriate

    Improving the Generalizability of Speech Emotion Recognition: Methods for Handling Data and Label Variability

    Full text link
    Emotion is an essential component in our interaction with others. It transmits information that helps us interpret the content of what others say. Therefore, detecting emotion from speech is an important step towards enabling machine understanding of human behaviors and intentions. Researchers have demonstrated the potential of emotion recognition in areas such as interactive systems in smart homes and mobile devices, computer games, and computational medical assistants. However, emotion communication is variable: individuals may express emotion in a manner that is uniquely their own; different speech content and environments may shape how emotion is expressed and recorded; individuals may perceive emotional messages differently. Practically, this variability is reflected in both the audio-visual data and the labels used to create speech emotion recognition (SER) systems. SER systems must be robust and generalizable to handle the variability effectively. The focus of this dissertation is on the development of speech emotion recognition systems that handle variability in emotion communications. We break the dissertation into three parts, according to the type of variability we address: (I) in the data, (II) in the labels, and (III) in both the data and the labels. Part I: The first part of this dissertation focuses on handling variability present in data. We approximate variations in environmental properties and expression styles by corpus and gender of the speakers. We find that training on multiple corpora and controlling for the variability in gender and corpus using multi-task learning result in more generalizable models, compared to the traditional single-task models that do not take corpus and gender variability into account. Another source of variability present in the recordings used in SER is the phonetic modulation of acoustics. On the other hand, phonemes also provide information about the emotion expressed in speech content. We discover that we can make more accurate predictions of emotion by explicitly considering both roles of phonemes. Part II: The second part of this dissertation addresses variability present in emotion labels, including the differences between emotion expression and perception, and the variations in emotion perception. We discover that it is beneficial to jointly model both the perception of others and how one perceives one’s own expression, compared to focusing on either one. Further, we show that the variability in emotion perception is a modelable signal and can be captured using probability distributions that describe how groups of evaluators perceive emotional messages. Part III: The last part of this dissertation presents methods that handle variability in both data and labels. We reduce the data variability due to non-emotional factors using deep metric learning and model the variability in emotion perception using soft labels. We propose a family of loss functions and show that by pairing examples that potentially vary in expression styles and lexical content and preserving the real-valued emotional similarity between them, we develop systems that generalize better across datasets and are more robust to over-training. These works demonstrate the importance of considering data and label variability in the creation of robust and generalizable emotion recognition systems. We conclude this dissertation with the following future directions: (1) the development of real-time SER systems; (2) the personalization of general SER systems.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/147639/1/didizbq_1.pd

    Continuous Emotion Prediction from Speech: Modelling Ambiguity in Emotion

    Full text link
    There is growing interest in emotion research to model perceived emotion labelled as intensities along the affect dimensions such as arousal and valence. These labels are typically obtained from multiple annotators who would have their individualistic perceptions of emotional speech. Consequently, emotion prediction models that incorporate variation in individual perceptions as ambiguity in the emotional state would be more realistic. This thesis develops the modelling framework necessary to achieve continuous prediction of ambiguous emotional states from speech. Besides, emotion labels, feature space distribution and encoding are an integral part of the prediction system. The first part of this thesis examines the limitations of current low-level feature distributions and their minimalistic statistical descriptions. Specifically, front-end paralinguistic acoustic features are reflective of speech production mechanisms. However, discriminatively learnt features have frequently outperformed acoustic features in emotion prediction tasks, but provide no insights into the physical significance of these features. One of the contributions of this thesis is the development of a framework that can modify the acoustic feature representation based on emotion label information. Another investigation in this thesis indicates that emotion perception is language-dependent and in turn, helped develop a framework for cross-language emotion prediction. Furthermore, this investigation supported the hypothesis that emotion perception is highly individualistic and is better modelled as a distribution rather than a point estimate to encode information about the ambiguity in the perceived emotion. Following this observation, the thesis proposes measures to quantify the appropriateness of distribution types in modelling ambiguity in dimensional emotion labels which are then employed to compare well-known bounded parametric distributions. These analyses led to the conclusion that the beta distribution was the most appropriate parametric model of ambiguity in emotion labels. Finally, the thesis focuses on developing a deep learning framework for continuous emotion prediction as a temporal series of beta distributions, examining various parameterizations of the beta distributions as well as loss functions. Furthermore, distribution over the parameter spaces is examined and priors from kernel density estimation are employed to shape the posteriors over the parameter space which significantly improved valence ambiguity predictions. The proposed frameworks and methods have been extensively evaluated on multiple state of-the-art databases and the results demonstrate both the viability of predicting ambiguous emotion states and the validity of the proposed systems
    corecore