1 research outputs found
Multi-modal Attention for Speech Emotion Recognition
Emotion represents an essential aspect of human speech that is manifested in
speech prosody. Speech, visual, and textual cues are complementary in human
communication. In this paper, we study a hybrid fusion method, referred to as
multi-modal attention network (MMAN) to make use of visual and textual cues in
speech emotion recognition. We propose a novel multi-modal attention mechanism,
cLSTM-MMA, which facilitates the attention across three modalities and
selectively fuse the information. cLSTM-MMA is fused with other uni-modal
sub-networks in the late fusion. The experiments show that speech emotion
recognition benefits significantly from visual and textual cues, and the
proposed cLSTM-MMA alone is as competitive as other fusion methods in terms of
accuracy, but with a much more compact network structure. The proposed hybrid
network MMAN achieves state-of-the-art performance on IEMOCAP database for
emotion recognition.Comment: Accepted by Interspeech202