377 research outputs found
CNN and LSTM-Based Emotion Charting Using Physiological Signals
Novel trends in affective computing are based on reliable sources of physiological signals such as Electroencephalogram (EEG), Electrocardiogram (ECG), and Galvanic Skin Response (GSR). The use of these signals provides challenges of performance improvement within a broader set of emotion classes in a less constrained real-world environment. To overcome these challenges, we propose a computational framework of 2D Convolutional Neural Network (CNN) architecture for the arrangement of 14 channels of EEG, and a combination of Long Short-Term Memory (LSTM) and 1D-CNN architecture for ECG and GSR. Our approach is subject-independent and incorporates two publicly available datasets of DREAMER and AMIGOS with low-cost, wearable sensors to extract physiological signals suitable for real-world environments. The results outperform state-of-the-art approaches for classification into four classes, namely High Valence—High Arousal, High Valence—Low Arousal, Low Valence—High Arousal, and Low Valence—Low Arousal. Emotion elicitation average accuracy of 98.73% is achieved with ECG right-channel modality, 76.65% with EEG modality, and 63.67% with GSR modality for AMIGOS. The overall highest accuracy of 99.0% for the AMIGOS dataset and 90.8% for the DREAMER dataset is achieved with multi-modal fusion. A strong correlation between spectral-and hidden-layer feature analysis with classification performance suggests the efficacy of the proposed method for significant feature extraction and higher emotion elicitation performance to a broader context for less constrained environments.Peer reviewe
MIMAMO Net: Integrating Micro- and Macro-motion for Video Emotion Recognition
Spatial-temporal feature learning is of vital importance for video emotion
recognition. Previous deep network structures often focused on macro-motion
which extends over long time scales, e.g., on the order of seconds. We believe
integrating structures capturing information about both micro- and macro-motion
will benefit emotion prediction, because human perceive both micro- and
macro-expressions. In this paper, we propose to combine micro- and macro-motion
features to improve video emotion recognition with a two-stream recurrent
network, named MIMAMO (Micro-Macro-Motion) Net. Specifically, smaller and
shorter micro-motions are analyzed by a two-stream network, while larger and
more sustained macro-motions can be well captured by a subsequent recurrent
network. Assigning specific interpretations to the roles of different parts of
the network enables us to make choice of parameters based on prior knowledge:
choices that turn out to be optimal. One of the important innovations in our
model is the use of interframe phase differences rather than optical flow as
input to the temporal stream. Compared with the optical flow, phase differences
require less computation and are more robust to illumination changes. Our
proposed network achieves state of the art performance on two video emotion
datasets, the OMG emotion dataset and the Aff-Wild dataset. The most
significant gains are for arousal prediction, for which motion information is
intuitively more informative. Source code is available at
https://github.com/wtomin/MIMAMO-Net.Comment: Accepted by AAAI 202
Multimodal sentiment analysis in real-life videos
This thesis extends the emerging field of multimodal sentiment analysis of real-life videos, taking two components into consideration: the emotion and the emotion's target.
The emotion component of media is traditionally represented as a segment-based intensity model of emotion classes. This representation is replaced here by a value- and time-continuous view. Adjacent research fields, such as affective computing, have largely neglected the linguistic information available from automatic transcripts of audio-video material. As is demonstrated here, this text modality is well-suited for time- and value-continuous prediction. Moreover, source-specific problems, such as trustworthiness, have been largely unexplored so far.
This work examines perceived trustworthiness of the source, and its quantification, in user-generated video data and presents a possible modelling path. Furthermore, the transfer between the continuous and discrete emotion representations is explored in order to summarise the emotional context at a segment level.
The other component deals with the target of the emotion, for example, the topic the speaker is addressing. Emotion targets in a video dataset can, as is shown here, be coherently extracted based on automatic transcripts without limiting a priori parameters, such as the expected number of targets. Furthermore, alternatives to purely linguistic investigation in predicting targets, such as knowledge-bases and multimodal systems, are investigated.
A new dataset is designed for this investigation, and, in conjunction with proposed novel deep neural networks, extensive experiments are conducted to explore the components described above.
The developed systems show robust prediction results and demonstrate strengths of the respective modalities, feature sets, and modelling techniques. Finally, foundations are laid for cross-modal information prediction systems with applications to the correction of corrupted in-the-wild signals from real-life videos
Interaction intermodale dans les réseaux neuronaux profonds pour la classification et la localisation d'évènements audiovisuels
La compréhension automatique du monde environnant a de nombreuses applications
telles que la surveillance et sécurité, l'interaction Homme-Machine,
la robotique, les soins de santé, etc. Plus précisément, la compréhension peut
s'exprimer par le biais de différentes taches telles que la classification et localisation
dans l'espace d'évènements. Les êtres vivants exploitent un maximum
de l'information disponible pour comprendre ce qui les entoure. En s'inspirant
du comportement des êtres vivants, les réseaux de neurones artificiels devraient
également utiliser conjointement plusieurs modalités, par exemple, la vision et
l'audition.
Premièrement, les modèles de classification et localisation, basés sur l'information
audio-visuelle, doivent être évalués de façon objective. Nous avons donc
enregistré une nouvelle base de données pour compléter les bases actuellement
disponibles. Comme aucun modèle audio-visuel de classification et localisation
n'existe, seule la partie sonore de la base est évaluée avec un modèle de la
littérature.
Deuxièmement, nous nous concentrons sur le cœur de la thèse: comment
utiliser conjointement de l'information visuelle et sonore pour résoudre une
tâche spécifique, la reconnaissance d'évènements. Le cerveau n'est pas constitué d'une "simple" fusion mais comprend de multiples interactions entre
les deux modalités. Il y a un couplage important entre le traitement de
l'information visuelle et sonore. Les réseaux de neurones offrent la possibilité de créer des interactions entre les modalités en plus de la fusion. Dans
cette thèse, nous explorons plusieurs stratégies pour fusionner les modalités
visuelles et sonores et pour créer des interactions entre les modalités. Ces techniques
ont les meilleures performances en comparaison aux architectures de
l'état de l'art au moment de la publication. Ces techniques montrent l'utilité
de la fusion audio-visuelle mais surtout l'importance des interactions entre les
modalités.
Pour conclure la thèse, nous proposons un réseau de référence pour la classification et localisation d'évènements audio-visuels. Ce réseau a été testé avec
la nouvelle base de données. Les modèles précédents de classification sont
modifiés pour prendre en compte la localisation dans l'espace en plus de la
classification.Abstract: The automatic understanding of the surrounding world has a wide range of applications, including surveillance, human-computer interaction, robotics, health care, etc. The understanding can be expressed in several ways such as event classification and its localization in space. Living beings exploit a maximum of the available information to understand the surrounding world. Artificial neural networks should build on this behavior and jointly use several modalities such as vision and hearing. First, audio-visual networks for classification and localization must be evaluated objectively. We recorded a new audio-visual dataset to fill a gap in the current available datasets. We were not able to find audio-visual models for classification and localization. Only the dataset audio part is evaluated with a state-of-the-art model. Secondly, we focus on the main challenge of the thesis: How to jointly use visual and audio information to solve a specific task, event recognition. The brain does not comprise a simple fusion but has multiple interactions between the two modalities to create a strong coupling between them. The neural networks offer the possibility to create interactions between the two modalities in addition to the fusion. We explore several strategies to fuse the audio and visual modalities and to create interactions between modalities. These techniques have the best performance compared to the state-of-the-art architectures at the time of publishing. They show the usefulness of audio-visual fusion but above all the contribution of the interaction between modalities. To conclude, we propose a benchmark for audio-visual classification and localization on the new dataset. Previous models for the audio-visual classification are modified to address the localization in addition to the classification
Deep learning-based EEG emotion recognition: Current trends and future perspectives
Automatic electroencephalogram (EEG) emotion recognition is a challenging component of human–computer interaction (HCI). Inspired by the powerful feature learning ability of recently-emerged deep learning techniques, various advanced deep learning models have been employed increasingly to learn high-level feature representations for EEG emotion recognition. This paper aims to provide an up-to-date and comprehensive survey of EEG emotion recognition, especially for various deep learning techniques in this area. We provide the preliminaries and basic knowledge in the literature. We review EEG emotion recognition benchmark data sets briefly. We review deep learning techniques in details, including deep belief networks, convolutional neural networks, and recurrent neural networks. We describe the state-of-the-art applications of deep learning techniques for EEG emotion recognition in detail. We analyze the challenges and opportunities in this field and point out its future directions
Automated Classification for Electrophysiological Data: Machine Learning Approaches for Disease Detection and Emotion Recognition
Smart healthcare is a health service system that utilizes technologies, e.g., artificial intelligence and
big data, to alleviate the pressures on healthcare systems. Much recent research has focused on the
automatic disease diagnosis and recognition and, typically, our research pays attention on automatic
classifications for electrophysiological signals, which are measurements of the electrical activity.
Specifically, for electrocardiogram (ECG) and electroencephalogram (EEG) data, we develop a
series of algorithms for automatic cardiovascular disease (CVD) classification, emotion recognition
and seizure detection.
With the ECG signals obtained from wearable devices, the candidate developed novel signal
processing and machine learning method for continuous monitoring of heart conditions. Compared to
the traditional methods based on the devices at clinical settings, the developed method in this thesis
is much more convenient to use. To identify arrhythmia patterns from the noisy ECG signals obtained
through the wearable devices, CNN and LSTM are used, and a wavelet-based CNN is proposed to
enhance the performance.
An emotion recognition method with a single channel ECG is developed, where a novel exploitative
and explorative GWO-SVM algorithm is proposed to achieve high performance emotion
classification. The attractive part is that the proposed algorithm has the capability to learn the SVM
hyperparameters automatically, and it can prevent the algorithm from falling into local solutions,
thereby achieving better performance than existing algorithms.
A novel EEG-signal based seizure detector is developed, where the EEG signals are transformed to
the spectral-temporal domain, so that the dimension of the input features to the CNN can be
significantly reduced, while the detector can still achieve superior detection performance
- …