19 research outputs found
Progressive Neural Networks for Transfer Learning in Emotion Recognition
Many paralinguistic tasks are closely related and thus representations
learned in one domain can be leveraged for another. In this paper, we
investigate how knowledge can be transferred between three paralinguistic
tasks: speaker, emotion, and gender recognition. Further, we extend this
problem to cross-dataset tasks, asking how knowledge captured in one emotion
dataset can be transferred to another. We focus on progressive neural networks
and compare these networks to the conventional deep learning method of
pre-training and fine-tuning. Progressive neural networks provide a way to
transfer knowledge and avoid the forgetting effect present when pre-training
neural networks on different tasks. Our experiments demonstrate that: (1)
emotion recognition can benefit from using representations originally learned
for different paralinguistic tasks and (2) transfer learning can effectively
leverage additional datasets to improve the performance of emotion recognition
systems.Comment: 5 pages, 4 figures, to appear in the proceedings of Interspeech 201
Variational Autoencoders for Learning Latent Representations of Speech Emotion: A Preliminary Study
Learning the latent representation of data in unsupervised fashion is a very
interesting process that provides relevant features for enhancing the
performance of a classifier. For speech emotion recognition tasks, generating
effective features is crucial. Currently, handcrafted features are mostly used
for speech emotion recognition, however, features learned automatically using
deep learning have shown strong success in many problems, especially in image
processing. In particular, deep generative models such as Variational
Autoencoders (VAEs) have gained enormous success for generating features for
natural images. Inspired by this, we propose VAEs for deriving the latent
representation of speech signals and use this representation to classify
emotions. To the best of our knowledge, we are the first to propose VAEs for
speech emotion classification. Evaluations on the IEMOCAP dataset demonstrate
that features learned by VAEs can produce state-of-the-art results for speech
emotion classification.Comment: Proc. Interspeech 201
The PRIORI Emotion Dataset: Linking Mood to Emotion Detected In-the-Wild
Bipolar Disorder is a chronic psychiatric illness characterized by
pathological mood swings associated with severe disruptions in emotion
regulation. Clinical monitoring of mood is key to the care of these dynamic and
incapacitating mood states. Frequent and detailed monitoring improves clinical
sensitivity to detect mood state changes, but typically requires costly and
limited resources. Speech characteristics change during both depressed and
manic states, suggesting automatic methods applied to the speech signal can be
effectively used to monitor mood state changes. However, speech is modulated by
many factors, which renders mood state prediction challenging. We hypothesize
that emotion can be used as an intermediary step to improve mood state
prediction. This paper presents critical steps in developing this pipeline,
including (1) a new in the wild emotion dataset, the PRIORI Emotion Dataset,
collected from everyday smartphone conversational speech recordings, (2)
activation/valence emotion recognition baselines on this dataset (PCC of 0.71
and 0.41, respectively), and (3) significant correlation between predicted
emotion and mood state for individuals with bipolar disorder. This provides
evidence and a working baseline for the use of emotion as a meta-feature for
mood state monitoring.Comment: Interspeech 201
On the Robustness of Speech Emotion Recognition for Human-Robot Interaction with Deep Neural Networks
Speech emotion recognition (SER) is an important aspect of effective
human-robot collaboration and received a lot of attention from the research
community. For example, many neural network-based architectures were proposed
recently and pushed the performance to a new level. However, the applicability
of such neural SER models trained only on in-domain data to noisy conditions is
currently under-researched. In this work, we evaluate the robustness of
state-of-the-art neural acoustic emotion recognition models in human-robot
interaction scenarios. We hypothesize that a robot's ego noise, room
conditions, and various acoustic events that can occur in a home environment
can significantly affect the performance of a model. We conduct several
experiments on the iCub robot platform and propose several novel ways to reduce
the gap between the model's performance during training and testing in
real-world conditions. Furthermore, we observe large improvements in the model
performance on the robot and demonstrate the necessity of introducing several
data augmentation techniques like overlaying background noise and loudness
variations to improve the robustness of the neural approaches.Comment: Submitted to IROS'18, Madrid, Spai
Detecting Escalation Level from Speech with Transfer Learning and Acoustic-Lexical Information Fusion
Textual escalation detection has been widely applied to e-commerce companies'
customer service systems to pre-alert and prevent potential conflicts.
Similarly, in public areas such as airports and train stations, where many
impersonal conversations frequently take place, acoustic-based escalation
detection systems are also useful to enhance passengers' safety and maintain
public order. To this end, we introduce a system based on acoustic-lexical
features to detect escalation from speech, Voice Activity Detection (VAD) and
label smoothing are adopted to further enhance the performance in our
experiments. Considering a small set of training and development data, we also
employ transfer learning on several wellknown emotional detection datasets,
i.e. RAVDESS, CREMA-D, to learn advanced emotional representations that is then
applied to the conversational escalation detection task. On the development
set, our proposed system achieves 81.5% unweighted average recall (UAR) which
significantly outperforms the baseline with 72.2% UAR
Transfer Learning for Improving Speech Emotion Classification Accuracy
The majority of existing speech emotion recognition research focuses on
automatic emotion detection using training and testing data from same corpus
collected under the same conditions. The performance of such systems has been
shown to drop significantly in cross-corpus and cross-language scenarios. To
address the problem, this paper exploits a transfer learning technique to
improve the performance of speech emotion recognition systems that is novel in
cross-language and cross-corpus scenarios. Evaluations on five different
corpora in three different languages show that Deep Belief Networks (DBNs)
offer better accuracy than previous approaches on cross-corpus emotion
recognition, relative to a Sparse Autoencoder and SVM baseline system. Results
also suggest that using a large number of languages for training and using a
small fraction of the target data in training can significantly boost accuracy
compared with baseline also for the corpus with limited training examples.Comment: Proc. Interspeech 201
Training Strategies to Handle Missing Modalities for Audio-Visual Expression Recognition
Automatic audio-visual expression recognition can play an important role in
communication services such as tele-health, VOIP calls and human-machine
interaction. Accuracy of audio-visual expression recognition could benefit from
the interplay between the two modalities. However, most audio-visual expression
recognition systems, trained in ideal conditions, fail to generalize in real
world scenarios where either the audio or visual modality could be missing due
to a number of reasons such as limited bandwidth, interactors' orientation,
caller initiated muting. This paper studies the performance of a state-of-the
art transformer when one of the modalities is missing. We conduct ablation
studies to evaluate the model in the absence of either modality. Further, we
propose a strategy to randomly ablate visual inputs during training at the clip
or frame level to mimic real world scenarios. Results conducted on in-the-wild
data, indicate significant generalization in proposed models trained on missing
cues, with gains up to 17% for frame level ablations, showing that these
training strategies cope better with the loss of input modalities.Comment: ICMI 2020 workshop on "MODELING SOCIO-EMOTIONAL AND COGNITIVE
PROCESSES FROM MULTIMODAL DATA IN THE WILD
Convolutional Neural Network-based Speech Enhancement for Cochlear Implant Recipients
Attempts to develop speech enhancement algorithms with improved speech
intelligibility for cochlear implant (CI) users have met with limited success.
To improve speech enhancement methods for CI users, we propose to perform
speech enhancement in a cochlear filter-bank feature space, a feature-set
specifically designed for CI users based on CI auditory stimuli. We leverage a
convolutional neural network (CNN) to extract both stationary and
non-stationary components of environmental acoustics and speech. We propose
three CNN architectures: (1) vanilla CNN that directly generates the enhanced
signal; (2) spectral-subtraction-style CNN (SS-CNN) that first predicts noise
and then generates the enhanced signal by subtracting noise from the noisy
signal; (3) Wiener-style CNN (Wiener-CNN) that generates an optimal mask for
suppressing noise. An important problem of the proposed networks is that they
introduce considerable delays, which limits their real-time application for CI
users. To address this, this study also considers causal variations of these
networks. Our experiments show that the proposed networks (both causal and
non-causal forms) achieve significant improvement over existing baseline
systems. We also found that causal Wiener-CNN outperforms other networks, and
leads to the best overall envelope coefficient measure (ECM). The proposed
algorithms represent a viable option for implementation on the CCi-MOBILE
research platform as a pre-processor for CI users in naturalistic environments.Comment: Interspeech 201
Incorporating End-to-End Speech Recognition Models for Sentiment Analysis
Previous work on emotion recognition demonstrated a synergistic effect of
combining several modalities such as auditory, visual, and transcribed text to
estimate the affective state of a speaker. Among these, the linguistic modality
is crucial for the evaluation of an expressed emotion. However, manually
transcribed spoken text cannot be given as input to a system practically. We
argue that using ground-truth transcriptions during training and evaluation
phases leads to a significant discrepancy in performance compared to real-world
conditions, as the spoken text has to be recognized on the fly and can contain
speech recognition mistakes. In this paper, we propose a method of integrating
an automatic speech recognition (ASR) output with a character-level recurrent
neural network for sentiment recognition. In addition, we conduct several
experiments investigating sentiment recognition for human-robot interaction in
a noise-realistic scenario which is challenging for the ASR systems. We
quantify the improvement compared to using only the acoustic modality in
sentiment recognition. We demonstrate the effectiveness of this approach on the
Multimodal Corpus of Sentiment Intensity (MOSI) by achieving 73,6% accuracy in
a binary sentiment classification task, exceeding previously reported results
that use only acoustic input. In addition, we set a new state-of-the-art
performance on the MOSI dataset (80.4% accuracy, 2% absolute improvement).Comment: Accepted at the 2019 International Conference on Robotics and
Automation (ICRA) will be held on May 20-24, 2019 in Montreal, Canad
Speaker-invariant Affective Representation Learning via Adversarial Training
Representation learning for speech emotion recognition is challenging due to
labeled data sparsity issue and lack of gold standard references. In addition,
there is much variability from input speech signals, human subjective
perception of the signals and emotion label ambiguity. In this paper, we
propose a machine learning framework to obtain speech emotion representations
by limiting the effect of speaker variability in the speech signals.
Specifically, we propose to disentangle the speaker characteristics from
emotion through an adversarial training network in order to better represent
emotion. Our method combines the gradient reversal technique with an entropy
loss function to remove such speaker information. Our approach is evaluated on
both IEMOCAP and CMU-MOSEI datasets. We show that our method improves speech
emotion classification and increases generalization to unseen speakers.Comment: Accepted by ICASSP 2020; 5 page