Search CORE

1 research outputs found

Auditory-Inspired End-to-End Speech Emotion Recognition Using 3D Convolutional Recurrent Neural Networks Based on Spectral-Temporal Representation

Author: Akagi Masato
Dang Jianwu
Peng Zhichao
Unoki Masashi
Zhu Zhi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 26/07/2018
Field of study

The human auditory system has far superior emotion recognition abilities compared with recent speech emotion recognition systems, so research has focused on designing emotion recognition systems by mimicking the human auditory system. Psychoacoustic and physiological studies indicate that the human auditory system decomposes speech signals into acoustic and modulation frequency components, and further extracts temporal modulation cues. Speech emotional states are perceived from temporal modulation cues using the spectral and temporal receptive field of the neuron. This paper proposes an emotion recognition system in an end-to-end manner using three-dimensional convolutional recurrent neural networks (3D-CRNNs) based on temporal modulation cues. Temporal modulation cues contain four-dimensional spectral-temporal (ST) integration representations directly as the input of 3D-CRNNs. The convolutional layer is used to extract high-level multiscale ST representations, and the recurrent layer is used to extract long-term dependency for emotion recognition. The proposed method was verified on the IEMOCAP database. The results show that our proposed method can exceed the recognition accuracy compared to that of the state-of-the-art systems

Crossref

JAIST Repository