6,466 research outputs found
Speech Emotion Recognition Using Multi-hop Attention Mechanism
In this paper, we are interested in exploiting textual and acoustic data of
an utterance for the speech emotion classification task. The baseline approach
models the information from audio and text independently using two deep neural
networks (DNNs). The outputs from both the DNNs are then fused for
classification. As opposed to using knowledge from both the modalities
separately, we propose a framework to exploit acoustic information in tandem
with lexical data. The proposed framework uses two bi-directional long
short-term memory (BLSTM) for obtaining hidden representations of the
utterance. Furthermore, we propose an attention mechanism, referred to as the
multi-hop, which is trained to automatically infer the correlation between the
modalities. The multi-hop attention first computes the relevant segments of the
textual data corresponding to the audio signal. The relevant textual data is
then applied to attend parts of the audio signal. To evaluate the performance
of the proposed system, experiments are performed in the IEMOCAP dataset.
Experimental results show that the proposed technique outperforms the
state-of-the-art system by 6.5% relative improvement in terms of weighted
accuracy.Comment: 5 pages, Accepted as a conference paper at ICASSP 2019 (oral
presentation
Speech-based recognition of self-reported and observed emotion in a dimensional space
The differences between self-reported and observed emotion have only marginally been investigated in the context of speech-based automatic emotion recognition. We address this issue by comparing self-reported emotion ratings to observed emotion ratings and look at how differences between these two types of ratings affect the development and performance of automatic emotion recognizers developed with these ratings. A dimensional approach to emotion modeling is adopted: the ratings are based on continuous arousal and valence scales. We describe the TNO-Gaming Corpus that contains spontaneous vocal and facial expressions elicited via a multiplayer videogame and that includes emotion annotations obtained via self-report and observation by outside observers. Comparisons show that there are discrepancies between self-reported and observed emotion ratings which are also reflected in the performance of the emotion recognizers developed. Using Support Vector Regression in combination with acoustic and textual features, recognizers of arousal and valence are developed that can predict points in a 2-dimensional arousal-valence space. The results of these recognizers show that the self-reported emotion is much harder to recognize than the observed emotion, and that averaging ratings from multiple observers improves performance
Recognizing emotions in spoken dialogue with acoustic and lexical cues
Automatic emotion recognition has long been a focus of Affective Computing. It has
become increasingly apparent that awareness of human emotions in Human-Computer
Interaction (HCI) is crucial for advancing related technologies, such as dialogue
systems. However, performance of current automatic emotion recognition is
disappointing compared to human performance. Current research on emotion
recognition in spoken dialogue focuses on identifying better feature representations
and recognition models from a data-driven point of view. The goal of this thesis
is to explore how incorporating prior knowledge of human emotion recognition
in the automatic model can improve state-of-the-art performance of automatic
emotion recognition in spoken dialogue. Specifically, we study this by proposing
knowledge-inspired features representing occurrences of disfluency and non-verbal
vocalisation in speech, and by building a multimodal recognition model that combines
acoustic and lexical features in a knowledge-inspired hierarchical structure. In our
study, emotions are represented with the Arousal, Expectancy, Power, and Valence
emotion dimensions. We build unimodal and multimodal emotion recognition
models to study the proposed features and modelling approach, and perform emotion
recognition on both spontaneous and acted dialogue.
Psycholinguistic studies have suggested that DISfluency and Non-verbal
Vocalisation (DIS-NV) in dialogue is related to emotions. However, these affective
cues in spoken dialogue are overlooked by current automatic emotion recognition
research. Thus, we propose features for recognizing emotions in spoken dialogue
which describe five types of DIS-NV in utterances, namely filled pause, filler, stutter,
laughter, and audible breath. Our experiments show that this small set of features
is predictive of emotions. Our DIS-NV features achieve better performance than
benchmark acoustic and lexical features for recognizing all emotion dimensions in
spontaneous dialogue. Consistent with Psycholinguistic studies, the DIS-NV features
are especially predictive of the Expectancy dimension of emotion, which relates to
speaker uncertainty. Our study illustrates the relationship between DIS-NVs and
emotions in dialogue, which contributes to Psycholinguistic understanding of them
as well. Note that our DIS-NV features are based on manual annotations, yet our
long-term goal is to apply our emotion recognition model to HCI systems. Thus, we
conduct preliminary experiments on automatic detection of DIS-NVs, and on using
automatically detected DIS-NV features for emotion recognition. Our results show
that DIS-NVs can be automatically detected from speech with stable accuracy, and
auto-detected DIS-NV features remain predictive of emotions in spontaneous dialogue.
This suggests that our emotion recognition model can be applied to a fully automatic
system in the future, and holds the potential to improve the quality of emotional
interaction in current HCI systems.
To study the robustness of the DIS-NV features, we conduct cross-corpora
experiments on both spontaneous and acted dialogue. We identify how dialogue
type influences the performance of DIS-NV features and emotion recognition models.
DIS-NVs contain additional information beyond acoustic characteristics or lexical
contents. Thus, we study the gain of modality fusion for emotion recognition with the
DIS-NV features. Previous work combines different feature sets by fusing modalities
at the same level using two types of fusion strategies: Feature-Level (FL) fusion,
which concatenates feature sets before recognition; and Decision-Level (DL) fusion,
which makes the final decision based on outputs of all unimodal models. However,
features from different modalities may describe data at different time scales or levels
of abstraction. Moreover, Cognitive Science research indicates that when perceiving
emotions, humans make use of information from different modalities at different
cognitive levels and time steps. Therefore, we propose a HierarchicaL (HL) fusion
strategy for multimodal emotion recognition, which incorporates features that describe
data at a longer time interval or which are more abstract at higher levels of its
knowledge-inspired hierarchy. Compared to FL and DL fusion, HL fusion incorporates
both inter- and intra-modality differences. Our experiments show that HL fusion
consistently outperforms FL and DL fusion on multimodal emotion recognition in both
spontaneous and acted dialogue. The HL model combining our DIS-NV features with
benchmark acoustic and lexical features improves current performance of multimodal
emotion recognition in spoken dialogue.
To study how other emotion-related tasks of spoken dialogue can benefit from the
proposed approaches, we apply the DIS-NV features and the HL fusion strategy to
recognize movie-induced emotions. Our experiments show that although designed
for recognizing emotions in spoken dialogue, DIS-NV features and HL fusion
remain effective for recognizing movie-induced emotions. This suggests that other
emotion-related tasks can also benefit from the proposed features and model structure
- ā¦