10,709 research outputs found
Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition
Automatic emotion recognition from speech, which is an important and challenging task in the field of affective computing, heavily relies on the effectiveness of the speech features for classification. Previous approaches to emotion recognition have mostly focused on the extraction of carefully hand-crafted features. How to model spatio-temporal dynamics for speech emotion recognition effectively is still under active investigation. In this paper, we propose a method to tackle the problem of emotional relevant feature extraction from speech by leveraging Attention-based Bidirectional Long Short-Term Memory Recurrent Neural Networks with fully convolutional networks in order to automatically learn the best spatio-temporal representations of speech signals. The learned high-level features are then fed into a deep neural network (DNN) to predict the final emotion. The experimental results on the Chinese Natural Audio-Visual Emotion Database (CHEAVD) and the Interactive Emotional Dyadic Motion Capture (IEMOCAP) corpora show that our method provides more accurate predictions compared with other existing emotion recognition algorithms
Speech Emotion Recognition Using Multi-hop Attention Mechanism
In this paper, we are interested in exploiting textual and acoustic data of
an utterance for the speech emotion classification task. The baseline approach
models the information from audio and text independently using two deep neural
networks (DNNs). The outputs from both the DNNs are then fused for
classification. As opposed to using knowledge from both the modalities
separately, we propose a framework to exploit acoustic information in tandem
with lexical data. The proposed framework uses two bi-directional long
short-term memory (BLSTM) for obtaining hidden representations of the
utterance. Furthermore, we propose an attention mechanism, referred to as the
multi-hop, which is trained to automatically infer the correlation between the
modalities. The multi-hop attention first computes the relevant segments of the
textual data corresponding to the audio signal. The relevant textual data is
then applied to attend parts of the audio signal. To evaluate the performance
of the proposed system, experiments are performed in the IEMOCAP dataset.
Experimental results show that the proposed technique outperforms the
state-of-the-art system by 6.5% relative improvement in terms of weighted
accuracy.Comment: 5 pages, Accepted as a conference paper at ICASSP 2019 (oral
presentation
Convolutional RNN: an Enhanced Model for Extracting Features from Sequential Data
Traditional convolutional layers extract features from patches of data by
applying a non-linearity on an affine function of the input. We propose a model
that enhances this feature extraction process for the case of sequential data,
by feeding patches of the data into a recurrent neural network and using the
outputs or hidden states of the recurrent units to compute the extracted
features. By doing so, we exploit the fact that a window containing a few
frames of the sequential data is a sequence itself and this additional
structure might encapsulate valuable information. In addition, we allow for
more steps of computation in the feature extraction process, which is
potentially beneficial as an affine function followed by a non-linearity can
result in too simple features. Using our convolutional recurrent layers we
obtain an improvement in performance in two audio classification tasks,
compared to traditional convolutional layers. Tensorflow code for the
convolutional recurrent layers is publicly available in
https://github.com/cruvadom/Convolutional-RNN
Multimodal Speech Emotion Recognition Using Audio and Text
Speech emotion recognition is a challenging task, and extensive reliance has
been placed on models that use audio features in building well-performing
classifiers. In this paper, we propose a novel deep dual recurrent encoder
model that utilizes text data and audio signals simultaneously to obtain a
better understanding of speech data. As emotional dialogue is composed of sound
and spoken content, our model encodes the information from audio and text
sequences using dual recurrent neural networks (RNNs) and then combines the
information from these sources to predict the emotion class. This architecture
analyzes speech data from the signal level to the language level, and it thus
utilizes the information within the data more comprehensively than models that
focus on audio features. Extensive experiments are conducted to investigate the
efficacy and properties of the proposed model. Our proposed model outperforms
previous state-of-the-art methods in assigning data to one of four emotion
categories (i.e., angry, happy, sad and neutral) when the model is applied to
the IEMOCAP dataset, as reflected by accuracies ranging from 68.8% to 71.8%.Comment: 7 pages, Accepted as a conference paper at IEEE SLT 201
Advanced LSTM: A Study about Better Time Dependency Modeling in Emotion Recognition
Long short-term memory (LSTM) is normally used in recurrent neural network
(RNN) as basic recurrent unit. However,conventional LSTM assumes that the state
at current time step depends on previous time step. This assumption constraints
the time dependency modeling capability. In this study, we propose a new
variation of LSTM, advanced LSTM (A-LSTM), for better temporal context
modeling. We employ A-LSTM in weighted pooling RNN for emotion recognition. The
A-LSTM outperforms the conventional LSTM by 5.5% relatively. The A-LSTM based
weighted pooling RNN can also complement the state-of-the-art emotion
classification framework. This shows the advantage of A-LSTM
- …