3,242 research outputs found

    Speech-based recognition of self-reported and observed emotion in a dimensional space

    Get PDF
    The differences between self-reported and observed emotion have only marginally been investigated in the context of speech-based automatic emotion recognition. We address this issue by comparing self-reported emotion ratings to observed emotion ratings and look at how differences between these two types of ratings affect the development and performance of automatic emotion recognizers developed with these ratings. A dimensional approach to emotion modeling is adopted: the ratings are based on continuous arousal and valence scales. We describe the TNO-Gaming Corpus that contains spontaneous vocal and facial expressions elicited via a multiplayer videogame and that includes emotion annotations obtained via self-report and observation by outside observers. Comparisons show that there are discrepancies between self-reported and observed emotion ratings which are also reflected in the performance of the emotion recognizers developed. Using Support Vector Regression in combination with acoustic and textual features, recognizers of arousal and valence are developed that can predict points in a 2-dimensional arousal-valence space. The results of these recognizers show that the self-reported emotion is much harder to recognize than the observed emotion, and that averaging ratings from multiple observers improves performance

    Multimodal assessment of emotional responses by physiological monitoring: novel auditory and visual elicitation strategies in traditional and virtual reality environments

    Get PDF
    This doctoral thesis explores novel strategies to quantify emotions and listening effort through monitoring of physiological signals. Emotions are a complex aspect of the human experience, playing a crucial role in our survival and adaptation to the environment. The study of emotions fosters important applications, such as Human-Computer and Human-Robot interaction or clinical assessment and treatment of mental health conditions such as depression, anxiety, stress, chronic anger, and mood disorders. Listening effort is also an important area of study, as it provides insight into the listeners’ challenges that are usually not identified by traditional audiometric measures. The research is divided into three lines of work, each with a unique emphasis on the methods of emotion elicitation and the stimuli that are most effective in producing emotional responses, with a specific focus on auditory stimuli. The research fostered the creation of three experimental protocols, as well as the use of an available online protocol for studying emotional responses including monitoring of both peripheral and central physiological signals, such as skin conductance, respiration, pupil dilation, electrocardiogram, blood volume pulse, and electroencephalography. An emotional protocol was created for the study of listening effort using a speech-in-noise test designed to be short and not induce fatigue. The results revealed that the listening effort is a complex problem that cannot be studied with a univariate approach, thus necessitating the use of multiple physiological markers to study different physiological dimensions. Specifically, the findings demonstrate a strong association between the level of auditory exertion, the amount of attention and involvement directed towards stimuli that are readily comprehensible compared to those that demand greater exertion. Continuing with the auditory domain, peripheral physiological signals were studied in order to discriminate four emotions elicited in a subject who listened to music for 21 days, using a previously designed and publicly available protocol. Surprisingly, the processed physiological signals were able to clearly separate the four emotions at the physiological level, demonstrating that music, which is not typically studied extensively in the literature, can be an effective stimulus for eliciting emotions. Following these results, a flat-screen protocol was created to compare physiological responses to purely visual, purely auditory, and combined audiovisual emotional stimuli. The results show that auditory stimuli are more effective in separating emotions at the physiological level. The subjects were found to be much more attentive during the audio-only phase. In order to overcome the limitations of emotional protocols carried out in a laboratory environment, which may elicit fewer emotions due to being an unnatural setting for the subjects under study, a final emotional elicitation protocol was created using virtual reality. Scenes similar to reality were created to elicit four distinct emotions. At the physiological level, it was noted that this environment is more effective in eliciting emotions. To our knowledge, this is the first protocol specifically designed for virtual reality that elicits diverse emotions. Furthermore, even in terms of classification, the use of virtual reality has been shown to be superior to traditional flat-screen protocols, opening the doors to virtual reality for the study of conditions related to emotional control

    Recognition of Human Emotion using Radial Basis Function Neural Networks with Inverse Fisher Transformed Physiological Signals

    Get PDF
    Emotion is a complex state of human mind influenced by body physiological changes and interdependent external events thus making an automatic recognition of emotional state a challenging task. A number of recognition methods have been applied in recent years to recognize human emotion. The motivation for this study is therefore to discover a combination of emotion features and recognition method that will produce the best result in building an efficient emotion recognizer in an affective system. We introduced a shifted tanh normalization scheme to realize the inverse Fisher transformation applied to the DEAP physiological dataset and consequently performed series of experiments using the Radial Basis Function Artificial Neural Networks (RBFANN). In our experiments, we have compared the performances of digital image based feature extraction techniques such as the Histogram of Oriented Gradient (HOG), Local Binary Pattern (LBP) and the Histogram of Images (HIM). These feature extraction techniques were utilized to extract discriminatory features from the multimodal DEAP dataset of physiological signals. Experimental results obtained indicate that the best recognition accuracy was achieved with the EEG modality data using the HIM features extraction technique and classification done along the dominance emotion dimension. The result is very remarkable when compared with existing results in the literature including deep learning studies that have utilized the DEAP corpus and also applicable to diverse fields of engineering studies

    Can Spontaneous Emotions be Detected from Speech on TV Political Debates?

    Get PDF
    Accepted paperDecoding emotional states from multimodal signals is an increasingly active domain, within the framework of affective computing, which aims to a better understanding of Human-Human Communication as well as to improve Human- Computer Interaction. But the automatic recognition of sponta- neous emotions from speech is a very complex task due to the lack of a certainty of the speaker states as well as to the difficulty to identify a variety of emotions in real scenarios. In this work we explore the extent to which emotional states can be decoded from speech signals extracted from TV political debates. The labelling procedure was supported by perception experiments where only a small set of emotions has been identified. In addition, some scaled judgements of valence, arousal and dominance were also provided. In this framework the paper shows meaningful comparisons between both, the dimensional and the categorical models of emotions, which is a new con- tribution when dealing with spontaneous emotions. To this end Support Vector Machines (SVM) as well as Feedforward Neural Networks (FNN) have been proposed to develop classifiers and predictors. The experimental evaluation over a Spanish corpus has shown the ability of both models to be identified in speech segments by the proposed artificial systems.This work has been partially funded by the Spanish Government under grant TIN2017-85854-C4-3-R (AEI/FEDER,UE) and conducted in the project EMPATHIC (Grant n769872) funded by the European Union’s H2020 research andinnovation program

    MoralStrength: Exploiting a Moral Lexicon and Embedding Similarity for Moral Foundations Prediction

    Get PDF
    Moral rhetoric plays a fundamental role in how we perceive and interpret the information we receive, greatly influencing our decision-making process. Especially when it comes to controversial social and political issues, our opinions and attitudes are hardly ever based on evidence alone. The Moral Foundations Dictionary (MFD) was developed to operationalize moral values in the text. In this study, we present MoralStrength, a lexicon of approximately 1,000 lemmas, obtained as an extension of the Moral Foundations Dictionary, based on WordNet synsets. Moreover, for each lemma it provides with a crowdsourced numeric assessment of Moral Valence, indicating the strength with which a lemma is expressing the specific value. We evaluated the predictive potentials of this moral lexicon, defining three utilization approaches of increased complexity, ranging from lemmas' statistical properties to a deep learning approach of word embeddings based on semantic similarity. Logistic regression models trained on the features extracted from MoralStrength, significantly outperformed the current state-of-the-art, reaching an F1-score of 87.6% over the previous 62.4% (p-value<0.01), and an average F1-Score of 86.25% over six different datasets. Such findings pave the way for further research, allowing for an in-depth understanding of moral narratives in text for a wide range of social issues

    Improve Affective Learning with EEG Approach

    Get PDF
    With the development of computer science, cognitive science and psychology, a new paradigm, affective learning, has emerged into e-learning domain. Although scientists and researchers have achieved fruitful outcomes in exploring the ways of detecting and understanding learners affect, e.g. eyes motion, facial expression etc., it sounds still necessary to deepen the recognition of learners affect in learning procedure with innovative methodologies. Our research focused on using bio-signals based methodology to explore learner's affect and the study was primarily made on Electroencephalography (EEG). After the EEG signals were collected from EEG equipment, we tidied the EEG data with signal processing algorithms and then extracted some features. We applied k-Nearest-Neighbor classifier and Naive Bayes classifier to these features to find out a combination, which may mostly contribute to reflect learners' affect, for example, Attention. In the classification algorithm, we presented a different way of using the Self-Assessment Manikin (SAM) model to classify and analyze learners attention, although the SAM was normally used for classifying emotions, for example, happiness etc. For the purpose of evaluating our findings, we also developed an affective learning prototype based on university e-learning web site. A real time EEG feedback window and an attention report were integrated into the system. The result of the experiment was encouraging and further discussion was also included in this paper

    Towards Understanding Confusion and Affective States Under Communication Failures in Voice-Based Human-Machine Interaction

    Full text link
    We present a series of two studies conducted to understand user's affective states during voice-based human-machine interactions. Emphasis is placed on the cases of communication errors or failures. In particular, we are interested in understanding "confusion" in relation with other affective states. The studies consist of two types of tasks: (1) related to communication with a voice-based virtual agent: speaking to the machine and understanding what the machine says, (2) non-communication related, problem-solving tasks where the participants solve puzzles and riddles but are asked to verbally explain the answers to the machine. We collected audio-visual data and self-reports of affective states of the participants. We report results of two studies and analysis of the collected data. The first study was analyzed based on the annotator's observation, and the second study was analyzed based on the self-report

    EEG-based multi-modal emotion recognition using bag of deep features: An optimal feature selection approach

    Get PDF
    Much attention has been paid to the recognition of human emotions with the help of electroencephalogram (EEG) signals based on machine learning technology. Recognizing emotions is a challenging task due to the non-linear property of the EEG signal. This paper presents an advanced signal processing method using the deep neural network (DNN) for emotion recognition based on EEG signals. The spectral and temporal components of the raw EEG signal are first retained in the 2D Spectrogram before the extraction of features. The pre-trained AlexNet model is used to extract the raw features from the 2D Spectrogram for each channel. To reduce the feature dimensionality, spatial, and temporal based, bag of deep features (BoDF) model is proposed. A series of vocabularies consisting of 10 cluster centers of each class is calculated using the k-means cluster algorithm. Lastly, the emotion of each subject is represented using the histogram of the vocabulary set collected from the raw-feature of a single channel. Features extracted from the proposed BoDF model have considerably smaller dimensions. The proposed model achieves better classification accuracy compared to the recently reported work when validated on SJTU SEED and DEAP data sets. For optimal classification performance, we use a support vector machine (SVM) and k-nearest neighbor (k-NN) to classify the extracted features for the different emotional states of the two data sets. The BoDF model achieves 93.8% accuracy in the SEED data set and 77.4% accuracy in the DEAP data set, which is more accurate compared to other state-of-the-art methods of human emotion recognition. - 2019 by the authors. Licensee MDPI, Basel, Switzerland.Funding: This research was funded by Higher Education Commission (HEC): Tdf/67/2017.Scopu

    Enhancing the Prediction of Emotional Experience in Movies using Deep Neural Networks: The Significance of Audio and Language

    Full text link
    Our paper focuses on making use of deep neural network models to accurately predict the range of human emotions experienced during watching movies. In this certain setup, there exist three clear-cut input modalities that considerably influence the experienced emotions: visual cues derived from RGB video frames, auditory components encompassing sounds, speech, and music, and linguistic elements encompassing actors' dialogues. Emotions are commonly described using a two-factor model including valence (ranging from happy to sad) and arousal (indicating the intensity of the emotion). In this regard, a Plethora of works have presented a multitude of models aiming to predict valence and arousal from video content. However, non of these models contain all three modalities, with language being consistently eliminated across all of them. In this study, we comprehensively combine all modalities and conduct an analysis to ascertain the importance of each in predicting valence and arousal. Making use of pre-trained neural networks, we represent each input modality in our study. In order to process visual input, we employ pre-trained convolutional neural networks to recognize scenes[1], objects[2], and actions[3,4]. For audio processing, we utilize a specialized neural network designed for handling sound-related tasks, namely SoundNet[5]. Finally, Bidirectional Encoder Representations from Transformers (BERT) models are used to extract linguistic features[6] in our analysis. We report results on the COGNIMUSE dataset[7], where our proposed model outperforms the current state-of-the-art approaches. Surprisingly, our findings reveal that language significantly influences the experienced arousal, while sound emerges as the primary determinant for predicting valence. In contrast, the visual modality exhibits the least impact among all modalities in predicting emotions
    corecore