68,677 research outputs found
Emotion Recognition: An Integration of Different Perspectives
Automatic emotion recognition describes the computational task of predicting emotion from various inputs including visual information, speech, and language. This task is rooted in principles from psychology such as the model used to categorize emotions and the definition of what constitutes an emotional expression. In both psychology and computer science, there is a plethora of different perspectives on emotion. The goal of this work is to investigate some of these perspectives about emotion recognition and discuss how these perspectives can be integrated to create better emotion recognition systems. To accomplish this, we first discuss psychological concepts including emotion theories, emotion models, and emotion perception, and how this can be used when creating automatic emotion recognition systems. We also perform emotion recognition on text, visual, and speech data from different datasets to show that emotional information can be expressed in different modalities
ASR-based Features for Emotion Recognition: A Transfer Learning Approach
During the last decade, the applications of signal processing have
drastically improved with deep learning. However areas of affecting computing
such as emotional speech synthesis or emotion recognition from spoken language
remains challenging. In this paper, we investigate the use of a neural
Automatic Speech Recognition (ASR) as a feature extractor for emotion
recognition. We show that these features outperform the eGeMAPS feature set to
predict the valence and arousal emotional dimensions, which means that the
audio-to-text mapping learning by the ASR system contain information related to
the emotional dimensions in spontaneous speech. We also examine the
relationship between first layers (closer to speech) and last layers (closer to
text) of the ASR and valence/arousal.Comment: Accepted to be published in the First Workshop on Computational
Modeling of Human Multimodal Language - ACL 201
Spoken affect classification : algorithms and experimental implementation : a thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Computer Science at Massey University, Palmerston North, New Zealand
Machine-based emotional intelligence is a requirement for natural interaction between humans and computer interfaces and a basic level of accurate emotion perception is needed for computer systems to respond adequately to human emotion. Humans convey emotional information both intentionally and unintentionally via speech patterns. These vocal patterns are perceived and understood by listeners during conversation. This research aims to improve the automatic perception of vocal emotion in two ways. First, we compare two emotional speech data sources: natural, spontaneous emotional speech and acted or portrayed emotional speech. This comparison demonstrates the advantages and disadvantages of both acquisition methods and how these methods affect the end application of vocal emotion recognition. Second, we look at two classification methods which have gone unexplored in this field: stacked generalisation and unweighted vote. We show how these techniques can yield an improvement over traditional classification methods
Towards Indonesian Speech-Emotion Automatic Recognition (I-SpEAR)
Even though speech-emotion recognition (SER) has been receiving much
attention as research topic, there are still some disputes about which vocal
features can identify certain emotion. Emotion expression is also known to be
differed according to the cultural backgrounds that make it important to study
SER specific to the culture where the language belongs to. Furthermore, only a
few studies addresses the SER in Indonesian which what this study attempts to
explore. In this study, we extract simple features from 3420 voice data
gathered from 38 participants. The features are compared by means of linear
mixed effect model which shows that people who are in emotional and
non-emotional state can be differentiated by their speech duration. Using SVM
and speech duration as input feature, we achieve 76.84% average accuracy in
classifying emotional and non-emotional speech.Comment: 4 pages, 3 tables, published in 4th International Conference on New
Media (Conmedia) on 8-10 Nov. 2017 (http://conmedia.umn.ac.id/) [in print as
in Sept. 17, 2017
Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition
Automatic emotion recognition from speech, which is an important and challenging task in the field of affective computing, heavily relies on the effectiveness of the speech features for classification. Previous approaches to emotion recognition have mostly focused on the extraction of carefully hand-crafted features. How to model spatio-temporal dynamics for speech emotion recognition effectively is still under active investigation. In this paper, we propose a method to tackle the problem of emotional relevant feature extraction from speech by leveraging Attention-based Bidirectional Long Short-Term Memory Recurrent Neural Networks with fully convolutional networks in order to automatically learn the best spatio-temporal representations of speech signals. The learned high-level features are then fed into a deep neural network (DNN) to predict the final emotion. The experimental results on the Chinese Natural Audio-Visual Emotion Database (CHEAVD) and the Interactive Emotional Dyadic Motion Capture (IEMOCAP) corpora show that our method provides more accurate predictions compared with other existing emotion recognition algorithms
- …