39 research outputs found
Multimodal sentiment analysis in real-life videos
This thesis extends the emerging field of multimodal sentiment analysis of real-life videos, taking two components into consideration: the emotion and the emotion's target.
The emotion component of media is traditionally represented as a segment-based intensity model of emotion classes. This representation is replaced here by a value- and time-continuous view. Adjacent research fields, such as affective computing, have largely neglected the linguistic information available from automatic transcripts of audio-video material. As is demonstrated here, this text modality is well-suited for time- and value-continuous prediction. Moreover, source-specific problems, such as trustworthiness, have been largely unexplored so far.
This work examines perceived trustworthiness of the source, and its quantification, in user-generated video data and presents a possible modelling path. Furthermore, the transfer between the continuous and discrete emotion representations is explored in order to summarise the emotional context at a segment level.
The other component deals with the target of the emotion, for example, the topic the speaker is addressing. Emotion targets in a video dataset can, as is shown here, be coherently extracted based on automatic transcripts without limiting a priori parameters, such as the expected number of targets. Furthermore, alternatives to purely linguistic investigation in predicting targets, such as knowledge-bases and multimodal systems, are investigated.
A new dataset is designed for this investigation, and, in conjunction with proposed novel deep neural networks, extensive experiments are conducted to explore the components described above.
The developed systems show robust prediction results and demonstrate strengths of the respective modalities, feature sets, and modelling techniques. Finally, foundations are laid for cross-modal information prediction systems with applications to the correction of corrupted in-the-wild signals from real-life videos
The Multimodal Sentiment Analysis in Car Reviews (MuSe-CaR) Dataset: Collection, Insights and Improvements
Truly real-life data presents a strong, but exciting challenge for sentiment
and emotion research. The high variety of possible `in-the-wild' properties
makes large datasets such as these indispensable with respect to building
robust machine learning models. A sufficient quantity of data covering a deep
variety in the challenges of each modality to force the exploratory analysis of
the interplay of all modalities has not yet been made available in this
context. In this contribution, we present MuSe-CaR, a first of its kind
multimodal dataset. The data is publicly available as it recently served as the
testing bed for the 1st Multimodal Sentiment Analysis Challenge, and focused on
the tasks of emotion, emotion-target engagement, and trustworthiness
recognition by means of comprehensively integrating the audio-visual and
language modalities. Furthermore, we give a thorough overview of the dataset in
terms of collection and annotation, including annotation tiers not used in this
year's MuSe 2020. In addition, for one of the sub-challenges - predicting the
level of trustworthiness - no participant outperformed the baseline model, and
so we propose a simple, but highly efficient Multi-Head-Attention network that
exceeds using multimodal fusion the baseline by around 0.2 CCC (almost 50 %
improvement).Comment: accepted versio
A physiologically-adapted gold standard for arousal during stress
Emotion is an inherently subjective psychophysiological human-state and to
produce an agreed-upon representation (gold standard) for continuous emotion
requires a time-consuming and costly training procedure of multiple human
annotators. There is strong evidence in the literature that physiological
signals are sufficient objective markers for states of emotion, particularly
arousal. In this contribution, we utilise a dataset which includes continuous
emotion and physiological signals - Heartbeats per Minute (BPM), Electrodermal
Activity (EDA), and Respiration-rate - captured during a stress inducing
scenario (Trier Social Stress Test). We utilise a Long Short-Term Memory,
Recurrent Neural Network to explore the benefit of fusing these physiological
signals with arousal as the target, learning from various audio, video, and
textual based features. We utilise the state-of-the-art MuSe-Toolbox to
consider both annotation delay and inter-rater agreement weighting when fusing
the target signals. An improvement in Concordance Correlation Coefficient (CCC)
is seen across features sets when fusing EDA with arousal, compared to the
arousal only gold standard results. Additionally, BERT-based textual features'
results improved for arousal plus all physiological signals, obtaining up to
.3344 CCC compared to .2118 CCC for arousal only. Multimodal fusion also
improves overall CCC with audio plus video features obtaining up to .6157 CCC
to recognize arousal plus EDA and BPM
Embracing and exploiting annotator emotional subjectivity: an affective rater ensemble model
Automated recognition of continuous emotions in audio-visual data is a growing area of study that aids in understanding human-machine interaction. Training such systems presupposes human annotation of the data. The annotation process, however, is laborious and expensive given that several human ratings are required for every data sample to compensate for the subjectivity of emotion perception. As a consequence, labelled data for emotion recognition are rare and the existing corpora are limited when compared to other state-of-the-art deep learning datasets. In this study, we explore different ways in which existing emotion annotations can be utilised more effectively to exploit available labelled information to the fullest. To reach this objective, we exploit individual raters’ opinions by employing an ensemble of rater-specific models, one for each annotator, by that reducing the loss of information which is a byproduct of annotation aggregation; we find that individual models can indeed infer subjective opinions. Furthermore, we explore the fusion of such ensemble predictions using different fusion techniques. Our ensemble model with only two annotators outperforms the regular Arousal baseline on the test set of the MuSe-CaR corpus. While no considerable improvements on valence could be obtained, using all annotators increases the prediction performance of arousal by up to. 07 Concordance Correlation Coefficient absolute improvement on test - solely trained on rate-specific models and fused by an attention-enhanced Long-short Term Memory-Recurrent Neural Network
A hierarchical attention network-based approach for depression detection from transcribed clinical interviews
The high prevalence of depression in society has given rise to a need for new digital tools that can aid its early detection. Among other effects, depression impacts the use of language. Seeking to exploit this, this work focuses on the detection of depressed and non-depressed individuals through the analysis of linguistic information extracted from transcripts of clinical interviews with a virtual agent. Specifically, we investigated the advantages of employing hierarchical attention-based networks for this task. Using Global Vectors (GloVe) pretrained word embedding models to extract low-level representations of the words, we compared hierarchical local-global attention networks and hierarchical contextual attention networks. We performed our experiments on the Distress Analysis Interview Corpus - Wizard of Oz (DAIC-WoZ) dataset, which contains audio, visual, and linguistic information acquired from participants during a clinical session. Our results using the DAIC-WoZ test set indicate that hierarchical contextual attention networks are the most suitable configuration to detect depression from transcripts. The configuration achieves an Unweighted Average Recall (UAR) of .66 using the test set, surpassing our baseline, a Recurrent Neural Network that does not use attention.Funding by EU- sustAGE (826506), EU-RADAR-CNS (115902), Key Program of the Natural Science Foundation of Tianjin, CHINA (18JCZDJC36300) and BMW Group Research
Pages 221-225
https://www.isca-speech.org/archive/Interspeech_2019/index.htm
Domain Adaptation with Joint Learning for Generic, Optical Car Part Recognition and Detection Systems (Go-CaRD)
Systems for the automatic recognition and detection of automotive parts are
crucial in several emerging research areas in the development of intelligent
vehicles. They enable, for example, the detection and modelling of interactions
between human and the vehicle. In this paper, we quantitatively and
qualitatively explore the efficacy of deep learning architectures for the
classification and localisation of 29 interior and exterior vehicle regions on
three novel datasets. Furthermore, we experiment with joint and transfer
learning approaches across datasets and point out potential applications of our
systems. Our best network architecture achieves an F1 score of 93.67 % for
recognition, while our best localisation approach utilising state-of-the-art
backbone networks achieve a mAP of 63.01 % for detection. The MuSe-CAR-Part
dataset, which is based on a large variety of human-car interactions in videos,
the weights of the best models, and the code is publicly available to academic
parties for benchmarking and future research.Comment: Demonstration and instructions to obtain data and models:
https://github.com/lstappen/GoCar
From speech to facial activity: towards cross-modal sequence-to-sequence attention networks
Abstract
Multimodal data sources offer the possibility to capture and model interactions between modalities, leading to an improved understanding of underlying relationships. In this regard, the work presented in this paper explores the relationship between facial muscle movements and speech signals. Specifically, we explore the efficacy of different sequence-to-sequence neural network architectures for the task of predicting Facial Action Coding System Action Units (AUs) from one of two acoustic feature representations extracted from speech signals, namely the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPs) or the Interspeech Computational Paralinguistics Challenge features set (ComParE). Furthermore, these architectures were enhanced by two different attention mechanisms (intra- and inter-attention) and various state-of-the-art network settings to improve prediction performance. Results indicate that a sequence-to-sequence model with inter-attention can achieve on average an Unweighted Average Recall (UAR) of 65.9 % for AU onset, 67.8 % for AU apex (both eGeMAPs), 79.7 % for AU offset and 65.3 % for AU occurrence (both ComParE) detection over all AUs.2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP)
DOI: 10.1109/MMSP46350.2019
Funding : BMW Group Researc