6,800 research outputs found
Contrastive Regularization for Multimodal Emotion Recognition Using Audio and Text
Speech emotion recognition is a challenge and an important step towards more
natural human-computer interaction (HCI). The popular approach is multimodal
emotion recognition based on model-level fusion, which means that the
multimodal signals can be encoded to acquire embeddings, and then the
embeddings are concatenated together for the final classification. However, due
to the influence of noise or other factors, each modality does not always tend
to the same emotional category, which affects the generalization of a model. In
this paper, we propose a novel regularization method via contrastive learning
for multimodal emotion recognition using audio and text. By introducing a
discriminator to distinguish the difference between the same and different
emotional pairs, we explicitly restrict the latent code of each modality to
contain the same emotional information, so as to reduce the noise interference
and get more discriminative representation. Experiments are performed on the
standard IEMOCAP dataset for 4-class emotion recognition. The results show a
significant improvement of 1.44\% and 1.53\% in terms of weighted accuracy (WA)
and unweighted accuracy (UA) compared to the baseline system.Comment: Completed in October 2020 and submitted to ICASSP202
- …