32 research outputs found
Best Practices for Noise-Based Augmentation to Improve the Performance of Deployable Speech-Based Emotion Recognition Systems
Speech emotion recognition is an important component of any human centered
system. But speech characteristics produced and perceived by a person can be
influenced by a multitude of reasons, both desirable such as emotion, and
undesirable such as noise. To train robust emotion recognition models, we need
a large, yet realistic data distribution, but emotion datasets are often small
and hence are augmented with noise. Often noise augmentation makes one
important assumption, that the prediction label should remain the same in
presence or absence of noise, which is true for automatic speech recognition
but not necessarily true for perception based tasks. In this paper we make
three novel contributions. We validate through crowdsourcing that the presence
of noise does change the annotation label and hence may alter the original
ground truth label. We then show how disregarding this knowledge and assuming
consistency in ground truth labels propagates to downstream evaluation of ML
models, both for performance evaluation and robustness testing. We end the
paper with a set of recommendations for noise augmentations in speech emotion
recognition datasets
Privacy Enhanced Multimodal Neural Representations for Emotion Recognition
Many mobile applications and virtual conversational agents now aim to
recognize and adapt to emotions. To enable this, data are transmitted from
users' devices and stored on central servers. Yet, these data contain sensitive
information that could be used by mobile applications without user's consent
or, maliciously, by an eavesdropping adversary. In this work, we show how
multimodal representations trained for a primary task, here emotion
recognition, can unintentionally leak demographic information, which could
override a selected opt-out option by the user. We analyze how this leakage
differs in representations obtained from textual, acoustic, and multimodal
data. We use an adversarial learning paradigm to unlearn the private
information present in a representation and investigate the effect of varying
the strength of the adversarial component on the primary task and on the
privacy metric, defined here as the inability of an attacker to predict
specific demographic information. We evaluate this paradigm on multiple
datasets and show that we can improve the privacy metric while not
significantly impacting the performance on the primary task. To the best of our
knowledge, this is the first work to analyze how the privacy metric differs
across modalities and how multiple privacy concerns can be tackled while still
maintaining performance on emotion recognition.Comment: 8 page