23 research outputs found
Controlling for Confounders in Multimodal Emotion Classification via Adversarial Learning
Various psychological factors affect how individuals express emotions. Yet,
when we collect data intended for use in building emotion recognition systems,
we often try to do so by creating paradigms that are designed just with a focus
on eliciting emotional behavior. Algorithms trained with these types of data
are unlikely to function outside of controlled environments because our
emotions naturally change as a function of these other factors. In this work,
we study how the multimodal expressions of emotion change when an individual is
under varying levels of stress. We hypothesize that stress produces modulations
that can hide the true underlying emotions of individuals and that we can make
emotion recognition algorithms more generalizable by controlling for variations
in stress. To this end, we use adversarial networks to decorrelate stress
modulations from emotion representations. We study how stress alters acoustic
and lexical emotional predictions, paying special attention to how modulations
due to stress affect the transferability of learned emotion recognition models
across domains. Our results show that stress is indeed encoded in trained
emotion classifiers and that this encoding varies across levels of emotions and
across the lexical and acoustic modalities. Our results also show that emotion
recognition models that control for stress during training have better
generalizability when applied to new domains, compared to models that do not
control for stress during training. We conclude that is is necessary to
consider the effect of extraneous psychological factors when building and
testing emotion recognition models.Comment: 10 pages, ICMI 201
Privacy Enhanced Multimodal Neural Representations for Emotion Recognition
Many mobile applications and virtual conversational agents now aim to
recognize and adapt to emotions. To enable this, data are transmitted from
users' devices and stored on central servers. Yet, these data contain sensitive
information that could be used by mobile applications without user's consent
or, maliciously, by an eavesdropping adversary. In this work, we show how
multimodal representations trained for a primary task, here emotion
recognition, can unintentionally leak demographic information, which could
override a selected opt-out option by the user. We analyze how this leakage
differs in representations obtained from textual, acoustic, and multimodal
data. We use an adversarial learning paradigm to unlearn the private
information present in a representation and investigate the effect of varying
the strength of the adversarial component on the primary task and on the
privacy metric, defined here as the inability of an attacker to predict
specific demographic information. We evaluate this paradigm on multiple
datasets and show that we can improve the privacy metric while not
significantly impacting the performance on the primary task. To the best of our
knowledge, this is the first work to analyze how the privacy metric differs
across modalities and how multiple privacy concerns can be tackled while still
maintaining performance on emotion recognition.Comment: 8 page
Personalized Prediction of Recurrent Stress Events Using Self-Supervised Learning on Multimodal Time-Series Data
Chronic stress can significantly affect physical and mental health. The
advent of wearable technology allows for the tracking of physiological signals,
potentially leading to innovative stress prediction and intervention methods.
However, challenges such as label scarcity and data heterogeneity render stress
prediction difficult in practice. To counter these issues, we have developed a
multimodal personalized stress prediction system using wearable biosignal data.
We employ self-supervised learning (SSL) to pre-train the models on each
subject's data, allowing the models to learn the baseline dynamics of the
participant's biosignals prior to fine-tuning the stress prediction task. We
test our model on the Wearable Stress and Affect Detection (WESAD) dataset,
demonstrating that our SSL models outperform non-SSL models while utilizing
less than 5% of the annotations. These results suggest that our approach can
personalize stress prediction to each user with minimal annotations. This
paradigm has the potential to enable personalized prediction of a variety of
recurring health events using complex multimodal data streams
Best Practices for Noise-Based Augmentation to Improve the Performance of Deployable Speech-Based Emotion Recognition Systems
Speech emotion recognition is an important component of any human centered
system. But speech characteristics produced and perceived by a person can be
influenced by a multitude of reasons, both desirable such as emotion, and
undesirable such as noise. To train robust emotion recognition models, we need
a large, yet realistic data distribution, but emotion datasets are often small
and hence are augmented with noise. Often noise augmentation makes one
important assumption, that the prediction label should remain the same in
presence or absence of noise, which is true for automatic speech recognition
but not necessarily true for perception based tasks. In this paper we make
three novel contributions. We validate through crowdsourcing that the presence
of noise does change the annotation label and hence may alter the original
ground truth label. We then show how disregarding this knowledge and assuming
consistency in ground truth labels propagates to downstream evaluation of ML
models, both for performance evaluation and robustness testing. We end the
paper with a set of recommendations for noise augmentations in speech emotion
recognition datasets
Multimodal sentiment analysis in real-life videos
This thesis extends the emerging field of multimodal sentiment analysis of real-life videos, taking two components into consideration: the emotion and the emotion's target.
The emotion component of media is traditionally represented as a segment-based intensity model of emotion classes. This representation is replaced here by a value- and time-continuous view. Adjacent research fields, such as affective computing, have largely neglected the linguistic information available from automatic transcripts of audio-video material. As is demonstrated here, this text modality is well-suited for time- and value-continuous prediction. Moreover, source-specific problems, such as trustworthiness, have been largely unexplored so far.
This work examines perceived trustworthiness of the source, and its quantification, in user-generated video data and presents a possible modelling path. Furthermore, the transfer between the continuous and discrete emotion representations is explored in order to summarise the emotional context at a segment level.
The other component deals with the target of the emotion, for example, the topic the speaker is addressing. Emotion targets in a video dataset can, as is shown here, be coherently extracted based on automatic transcripts without limiting a priori parameters, such as the expected number of targets. Furthermore, alternatives to purely linguistic investigation in predicting targets, such as knowledge-bases and multimodal systems, are investigated.
A new dataset is designed for this investigation, and, in conjunction with proposed novel deep neural networks, extensive experiments are conducted to explore the components described above.
The developed systems show robust prediction results and demonstrate strengths of the respective modalities, feature sets, and modelling techniques. Finally, foundations are laid for cross-modal information prediction systems with applications to the correction of corrupted in-the-wild signals from real-life videos
Multimodal machine learning in medical screenings
The healthcare industry, with its high demand and standards, has long been considered a crucial area for technology-based innovation. However, the medical field often relies on experience-based evaluation. Limited resources, overloading capacity, and a lack of accessibility can hinder timely medical care and diagnosis delivery. In light of these challenges, automated medical screening as a decision-making aid is highly recommended. With the increasing availability of data and the need to explore the complementary effect among modalities, multimodal machine learning has emerged as a potential area of technology. Its impact has been witnessed across a wide range of domains, prompting the question of how far machine learning can be leveraged to automate processes in even more complex and high-risk sectors.
This paper delves into the realm of multimodal machine learning in the field of automated medical screening and evaluates the potential of this area of study in mental disorder detection, a highly important area of healthcare. First, we conduct a scoping review targeted at high-impact papers to highlight the trends and directions of multimodal machine learning in screening prevalent mental disorders such as depression, stress, and bipolar disorder. The review provides a comprehensive list of popular datasets and extensively studied modalities. The review also proposes an end-to-end pipeline for multimodal machine learning applications, covering essential steps from preprocessing, representation, and fusion, to modelling and evaluation. While cross-modality interaction has been considered a promising factor to leverage fusion among multimodalities, the number of existing multimodal fusion methods employing this mechanism is rather limited. This study investigates multimodal fusion in more detail through the proposal of Autofusion, an autoencoder-infused fusion technique that harnesses the cross-modality interaction among different modalities. The technique is evaluated on DementiaBank’s Pitt corpus to detect Alzheimer’s disease, leveraging the power of cross-modality interaction. Autofusion achieves a promising performance of 79.89% in accuracy, 83.85% in recall, 81.72% in precision, and 82.47% in F1. The technique consistently outperforms all unimodal methods by an average of 5.24% across all metrics. Our method consistently outperforms early fusion and late fusion. Especially against the late fusion hard-voting technique, our method outperforms by an average of 20% across all metrics. Further, empirical results show that the cross-modality interaction term enhances the model performance by 2-3% across metrics. This research highlights the promising impact of cross-modality interaction in multimodal machine learning and calls for further research to unlock its full potential
Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020
On behalf of the Program Committee, a very warm welcome to the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020). This edition of the conference is held in Bologna and organised by the University of Bologna. The CLiC-it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after six years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges
Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018 : 10-12 December 2018, Torino
On behalf of the Program Committee, a very warm welcome to the Fifth Italian Conference on Computational Linguistics (CLiC-‐it 2018). This edition of the conference is held in Torino. The conference is locally organised by the University of Torino and hosted into its prestigious main lecture hall “Cavallerizza Reale”. The CLiC-‐it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after five years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges