18 research outputs found
Towards Speech Emotion Recognition "in the wild" using Aggregated Corpora and Deep Multi-Task Learning
One of the challenges in Speech Emotion Recognition (SER) "in the wild" is
the large mismatch between training and test data (e.g. speakers and tasks). In
order to improve the generalisation capabilities of the emotion models, we
propose to use Multi-Task Learning (MTL) and use gender and naturalness as
auxiliary tasks in deep neural networks. This method was evaluated in
within-corpus and various cross-corpus classification experiments that simulate
conditions "in the wild". In comparison to Single-Task Learning (STL) based
state of the art methods, we found that our MTL method proposed improved
performance significantly. Particularly, models using both gender and
naturalness achieved more gains than those using either gender or naturalness
separately. This benefit was also found in the high-level representations of
the feature space, obtained from our method proposed, where discriminative
emotional clusters could be observed.Comment: Published in the proceedings of INTERSPEECH, Stockholm, September,
201
Towards Speech Emotion Recognition "in the wild" using Aggregated Corpora and Deep Multi-Task Learning
One of the challenges in Speech Emotion Recognition (SER) "in the wild" is the large mismatch between training and test data (e.g. speakers and tasks). In order to improve the generalisation capabilities of the emotion models, we propose to use Multi-Task Learning (MTL) and use gender and naturalness as auxiliary tasks in deep neural networks. This method was evaluated in within-corpus and various cross-corpus classification experiments that simulate conditions "in the wild". In comparison to Single-Task Learning (STL) based state of the art methods, we found that our MTL method proposed improved performance significantly. Particularly, models using both gender and naturalness achieved more gains than those using either gender or naturalness separately. This benefit was also found in the high-level representations of the feature space, obtained from our method proposed, where discriminative emotional clusters could be observed
Planning Based System for Child-Robot Interaction in Dynamic Play Environments
This paper describes the initial steps towards the design of a robotic system
that intends to perform actions autonomously in a naturalistic play
environment. At the same time it aims for social human-robot interaction~(HRI),
focusing on children. We draw on existing theories of child development and on
dimensional models of emotions to explore the design of a dynamic interaction
framework for natural child-robot interaction. In this dynamic setting, the
social HRI is defined by the ability of the system to take into consideration
the socio-emotional state of the user and to plan appropriately by selecting
appropriate strategies for execution. The robot needs a temporal planning
system, which combines features of task-oriented actions and principles of
social human robot interaction. We present initial results of an empirical
study for the evaluation of the proposed framework in the context of a
collaborative sorting game
Learning spectro-temporal features with 3D CNNs for speech emotion recognition
In this paper, we propose to use deep 3-dimensional convolutional networks
(3D CNNs) in order to address the challenge of modelling spectro-temporal
dynamics for speech emotion recognition (SER). Compared to a hybrid of
Convolutional Neural Network and Long-Short-Term-Memory (CNN-LSTM), our
proposed 3D CNNs simultaneously extract short-term and long-term spectral
features with a moderate number of parameters. We evaluated our proposed and
other state-of-the-art methods in a speaker-independent manner using aggregated
corpora that give a large and diverse set of speakers. We found that 1) shallow
temporal and moderately deep spectral kernels of a homogeneous architecture are
optimal for the task; and 2) our 3D CNNs are more effective for
spectro-temporal feature learning compared to other methods. Finally, we
visualised the feature space obtained with our proposed method using
t-distributed stochastic neighbour embedding (T-SNE) and could observe distinct
clusters of emotions.Comment: ACII, 2017, San Antoni
Speaking of Trust -- Speech as a Measure of Trust
Since trust measures in human-robot interaction are often subjective or not
possible to implement real-time, we propose to use speech cues (on what, when
and how the user talks) as an objective real-time measure of trust. This could
be implemented in the robot to calibrate towards appropriate trust. However, we
would like to open the discussion on how to deal with the ethical implications
surrounding this trust measure.Comment: in TRAITS Workshop Proceedings (arXiv:2103.12679) held in conjunction
with Companion of the 2021 ACM/IEEE International Conference on Human-Robot
Interaction, March 2021, Pages 709-71
Does your robot know? Enhancing children's information retrieval through spoken conversation with responsible robots
In this paper, we identify challenges in children's current information
retrieval process, and propose conversational robots as an opportunity to ease
this process in a responsible way. Tools children currently use in this
process, such as search engines on a computer or voice agents, do not always
meet their specific needs. The conversational robot we propose maintains
context, asks clarifying questions, and gives suggestions in order to better
meet children's needs. Since children are often too trusting of robots, we
propose to have the robot measure, monitor and adapt to the trust the child has
in the robot. This way, we hope to induce a critical attitude with the children
during their information retrieval process.Comment: IR4Children'21 workshop at SIGIR 2021 - http://www.fab4.science/IR4C
Ladder Networks for Emotion Recognition: Using Unsupervised Auxiliary Tasks to Improve Predictions of Emotional Attributes
Recognizing emotions using few attribute dimensions such as arousal, valence
and dominance provides the flexibility to effectively represent complex range
of emotional behaviors. Conventional methods to learn these emotional
descriptors primarily focus on separate models to recognize each of these
attributes. Recent work has shown that learning these attributes together
regularizes the models, leading to better feature representations. This study
explores new forms of regularization by adding unsupervised auxiliary tasks to
reconstruct hidden layer representations. This auxiliary task requires the
denoising of hidden representations at every layer of an auto-encoder. The
framework relies on ladder networks that utilize skip connections between
encoder and decoder layers to learn powerful representations of emotional
dimensions. The results show that ladder networks improve the performance of
the system compared to baselines that individually learn each attribute, and
conventional denoising autoencoders. Furthermore, the unsupervised auxiliary
tasks have promising potential to be used in a semi-supervised setting, where
few labeled sentences are available.Comment: Submitted to Interspeech 201