7,444 research outputs found
An Attention Pooling based Representation Learning Method for Speech Emotion Recognition
This paper proposes an attention pooling based representation learning method for speech emotion recognition (SER). The emotional representation is learned in an end-to-end fashion by applying a deep convolutional neural network (CNN) directly to spectrograms extracted from speech utterances. Motivated by the success of GoogleNet, two groups of filters with different shapes are designed to capture both temporal and frequency domain context information from the input spectrogram. The learned features are concatenated and fed into the subsequent convolutional layers. To learn the final emotional representation, a novel attention pooling method is further proposed.
Compared with the existing pooling methods, such as max-pooling and average-pooling, the proposed attention pooling can effectively incorporate class-agnostic bottom-up, and class-specific top-down, attention maps. We conduct extensive evaluations on benchmark IEMOCAP data to assess the effectiveness of the proposed representation. Results demonstrate a recognition performance of 71.8% weighted accuracy (WA) and 68% unweighted accuracy (UA) over four emotions, which outperforms the state-of-the-art method by about 3% absolute for WA and 4% for UA
Advanced LSTM: A Study about Better Time Dependency Modeling in Emotion Recognition
Long short-term memory (LSTM) is normally used in recurrent neural network
(RNN) as basic recurrent unit. However,conventional LSTM assumes that the state
at current time step depends on previous time step. This assumption constraints
the time dependency modeling capability. In this study, we propose a new
variation of LSTM, advanced LSTM (A-LSTM), for better temporal context
modeling. We employ A-LSTM in weighted pooling RNN for emotion recognition. The
A-LSTM outperforms the conventional LSTM by 5.5% relatively. The A-LSTM based
weighted pooling RNN can also complement the state-of-the-art emotion
classification framework. This shows the advantage of A-LSTM
Robust Speaker Recognition Using Speech Enhancement And Attention Model
In this paper, a novel architecture for speaker recognition is proposed by
cascading speech enhancement and speaker processing. Its aim is to improve
speaker recognition performance when speech signals are corrupted by noise.
Instead of individually processing speech enhancement and speaker recognition,
the two modules are integrated into one framework by a joint optimisation using
deep neural networks. Furthermore, to increase robustness against noise, a
multi-stage attention mechanism is employed to highlight the speaker related
features learned from context information in time and frequency domain. To
evaluate speaker identification and verification performance of the proposed
approach, we test it on the dataset of VoxCeleb1, one of mostly used benchmark
datasets. Moreover, the robustness of our proposed approach is also tested on
VoxCeleb1 data when being corrupted by three types of interferences, general
noise, music, and babble, at different signal-to-noise ratio (SNR) levels. The
obtained results show that the proposed approach using speech enhancement and
multi-stage attention models outperforms two strong baselines not using them in
most acoustic conditions in our experiments.Comment: Acceptted by Odyssey 202
- …