377 research outputs found

    CNN and LSTM-Based Emotion Charting Using Physiological Signals

    Get PDF
    Novel trends in affective computing are based on reliable sources of physiological signals such as Electroencephalogram (EEG), Electrocardiogram (ECG), and Galvanic Skin Response (GSR). The use of these signals provides challenges of performance improvement within a broader set of emotion classes in a less constrained real-world environment. To overcome these challenges, we propose a computational framework of 2D Convolutional Neural Network (CNN) architecture for the arrangement of 14 channels of EEG, and a combination of Long Short-Term Memory (LSTM) and 1D-CNN architecture for ECG and GSR. Our approach is subject-independent and incorporates two publicly available datasets of DREAMER and AMIGOS with low-cost, wearable sensors to extract physiological signals suitable for real-world environments. The results outperform state-of-the-art approaches for classification into four classes, namely High Valence—High Arousal, High Valence—Low Arousal, Low Valence—High Arousal, and Low Valence—Low Arousal. Emotion elicitation average accuracy of 98.73% is achieved with ECG right-channel modality, 76.65% with EEG modality, and 63.67% with GSR modality for AMIGOS. The overall highest accuracy of 99.0% for the AMIGOS dataset and 90.8% for the DREAMER dataset is achieved with multi-modal fusion. A strong correlation between spectral-and hidden-layer feature analysis with classification performance suggests the efficacy of the proposed method for significant feature extraction and higher emotion elicitation performance to a broader context for less constrained environments.Peer reviewe

    MIMAMO Net: Integrating Micro- and Macro-motion for Video Emotion Recognition

    Full text link
    Spatial-temporal feature learning is of vital importance for video emotion recognition. Previous deep network structures often focused on macro-motion which extends over long time scales, e.g., on the order of seconds. We believe integrating structures capturing information about both micro- and macro-motion will benefit emotion prediction, because human perceive both micro- and macro-expressions. In this paper, we propose to combine micro- and macro-motion features to improve video emotion recognition with a two-stream recurrent network, named MIMAMO (Micro-Macro-Motion) Net. Specifically, smaller and shorter micro-motions are analyzed by a two-stream network, while larger and more sustained macro-motions can be well captured by a subsequent recurrent network. Assigning specific interpretations to the roles of different parts of the network enables us to make choice of parameters based on prior knowledge: choices that turn out to be optimal. One of the important innovations in our model is the use of interframe phase differences rather than optical flow as input to the temporal stream. Compared with the optical flow, phase differences require less computation and are more robust to illumination changes. Our proposed network achieves state of the art performance on two video emotion datasets, the OMG emotion dataset and the Aff-Wild dataset. The most significant gains are for arousal prediction, for which motion information is intuitively more informative. Source code is available at https://github.com/wtomin/MIMAMO-Net.Comment: Accepted by AAAI 202

    Multimodal sentiment analysis in real-life videos

    Get PDF
    This thesis extends the emerging field of multimodal sentiment analysis of real-life videos, taking two components into consideration: the emotion and the emotion's target. The emotion component of media is traditionally represented as a segment-based intensity model of emotion classes. This representation is replaced here by a value- and time-continuous view. Adjacent research fields, such as affective computing, have largely neglected the linguistic information available from automatic transcripts of audio-video material. As is demonstrated here, this text modality is well-suited for time- and value-continuous prediction. Moreover, source-specific problems, such as trustworthiness, have been largely unexplored so far. This work examines perceived trustworthiness of the source, and its quantification, in user-generated video data and presents a possible modelling path. Furthermore, the transfer between the continuous and discrete emotion representations is explored in order to summarise the emotional context at a segment level. The other component deals with the target of the emotion, for example, the topic the speaker is addressing. Emotion targets in a video dataset can, as is shown here, be coherently extracted based on automatic transcripts without limiting a priori parameters, such as the expected number of targets. Furthermore, alternatives to purely linguistic investigation in predicting targets, such as knowledge-bases and multimodal systems, are investigated. A new dataset is designed for this investigation, and, in conjunction with proposed novel deep neural networks, extensive experiments are conducted to explore the components described above. The developed systems show robust prediction results and demonstrate strengths of the respective modalities, feature sets, and modelling techniques. Finally, foundations are laid for cross-modal information prediction systems with applications to the correction of corrupted in-the-wild signals from real-life videos

    Interaction intermodale dans les réseaux neuronaux profonds pour la classification et la localisation d'évènements audiovisuels

    Get PDF
    La compréhension automatique du monde environnant a de nombreuses applications telles que la surveillance et sécurité, l'interaction Homme-Machine, la robotique, les soins de santé, etc. Plus précisément, la compréhension peut s'exprimer par le biais de différentes taches telles que la classification et localisation dans l'espace d'évènements. Les êtres vivants exploitent un maximum de l'information disponible pour comprendre ce qui les entoure. En s'inspirant du comportement des êtres vivants, les réseaux de neurones artificiels devraient également utiliser conjointement plusieurs modalités, par exemple, la vision et l'audition. Premièrement, les modèles de classification et localisation, basés sur l'information audio-visuelle, doivent être évalués de façon objective. Nous avons donc enregistré une nouvelle base de données pour compléter les bases actuellement disponibles. Comme aucun modèle audio-visuel de classification et localisation n'existe, seule la partie sonore de la base est évaluée avec un modèle de la littérature. Deuxièmement, nous nous concentrons sur le cœur de la thèse: comment utiliser conjointement de l'information visuelle et sonore pour résoudre une tâche spécifique, la reconnaissance d'évènements. Le cerveau n'est pas constitué d'une "simple" fusion mais comprend de multiples interactions entre les deux modalités. Il y a un couplage important entre le traitement de l'information visuelle et sonore. Les réseaux de neurones offrent la possibilité de créer des interactions entre les modalités en plus de la fusion. Dans cette thèse, nous explorons plusieurs stratégies pour fusionner les modalités visuelles et sonores et pour créer des interactions entre les modalités. Ces techniques ont les meilleures performances en comparaison aux architectures de l'état de l'art au moment de la publication. Ces techniques montrent l'utilité de la fusion audio-visuelle mais surtout l'importance des interactions entre les modalités. Pour conclure la thèse, nous proposons un réseau de référence pour la classification et localisation d'évènements audio-visuels. Ce réseau a été testé avec la nouvelle base de données. Les modèles précédents de classification sont modifiés pour prendre en compte la localisation dans l'espace en plus de la classification.Abstract: The automatic understanding of the surrounding world has a wide range of applications, including surveillance, human-computer interaction, robotics, health care, etc. The understanding can be expressed in several ways such as event classification and its localization in space. Living beings exploit a maximum of the available information to understand the surrounding world. Artificial neural networks should build on this behavior and jointly use several modalities such as vision and hearing. First, audio-visual networks for classification and localization must be evaluated objectively. We recorded a new audio-visual dataset to fill a gap in the current available datasets. We were not able to find audio-visual models for classification and localization. Only the dataset audio part is evaluated with a state-of-the-art model. Secondly, we focus on the main challenge of the thesis: How to jointly use visual and audio information to solve a specific task, event recognition. The brain does not comprise a simple fusion but has multiple interactions between the two modalities to create a strong coupling between them. The neural networks offer the possibility to create interactions between the two modalities in addition to the fusion. We explore several strategies to fuse the audio and visual modalities and to create interactions between modalities. These techniques have the best performance compared to the state-of-the-art architectures at the time of publishing. They show the usefulness of audio-visual fusion but above all the contribution of the interaction between modalities. To conclude, we propose a benchmark for audio-visual classification and localization on the new dataset. Previous models for the audio-visual classification are modified to address the localization in addition to the classification

    Deep learning-based EEG emotion recognition: Current trends and future perspectives

    Get PDF
    Automatic electroencephalogram (EEG) emotion recognition is a challenging component of human–computer interaction (HCI). Inspired by the powerful feature learning ability of recently-emerged deep learning techniques, various advanced deep learning models have been employed increasingly to learn high-level feature representations for EEG emotion recognition. This paper aims to provide an up-to-date and comprehensive survey of EEG emotion recognition, especially for various deep learning techniques in this area. We provide the preliminaries and basic knowledge in the literature. We review EEG emotion recognition benchmark data sets briefly. We review deep learning techniques in details, including deep belief networks, convolutional neural networks, and recurrent neural networks. We describe the state-of-the-art applications of deep learning techniques for EEG emotion recognition in detail. We analyze the challenges and opportunities in this field and point out its future directions

    Automated Classification for Electrophysiological Data: Machine Learning Approaches for Disease Detection and Emotion Recognition

    Get PDF
    Smart healthcare is a health service system that utilizes technologies, e.g., artificial intelligence and big data, to alleviate the pressures on healthcare systems. Much recent research has focused on the automatic disease diagnosis and recognition and, typically, our research pays attention on automatic classifications for electrophysiological signals, which are measurements of the electrical activity. Specifically, for electrocardiogram (ECG) and electroencephalogram (EEG) data, we develop a series of algorithms for automatic cardiovascular disease (CVD) classification, emotion recognition and seizure detection. With the ECG signals obtained from wearable devices, the candidate developed novel signal processing and machine learning method for continuous monitoring of heart conditions. Compared to the traditional methods based on the devices at clinical settings, the developed method in this thesis is much more convenient to use. To identify arrhythmia patterns from the noisy ECG signals obtained through the wearable devices, CNN and LSTM are used, and a wavelet-based CNN is proposed to enhance the performance. An emotion recognition method with a single channel ECG is developed, where a novel exploitative and explorative GWO-SVM algorithm is proposed to achieve high performance emotion classification. The attractive part is that the proposed algorithm has the capability to learn the SVM hyperparameters automatically, and it can prevent the algorithm from falling into local solutions, thereby achieving better performance than existing algorithms. A novel EEG-signal based seizure detector is developed, where the EEG signals are transformed to the spectral-temporal domain, so that the dimension of the input features to the CNN can be significantly reduced, while the detector can still achieve superior detection performance
    • …
    corecore