Emotions are crucial in understanding the human
psychological state and can influence social interactions and
decision-making. The integration between emotion processing
and computational systems enables the creation of more natural
and adaptive interfaces. This research focuses on Speech
Emotion Recognition (SER), which aims to identify human
emotions through voice signal analysis. Many previous studies
have been limited to unimodal approaches, and just few have
explored multimodal approaches that combine voice and text
simultaneously. Furthermore, there is a lack of research that
directly integrates Automatic Speech Recognition (ASR) for text
as well as voice feature extraction in one step and applies data
augmentation to improve model generalization. This research
contributes to the improvement of accuracy for emotion
recognition tasks using the IEMOCAP dataset with total 5797
voices and text that contain angry, sad, happy, and neutral by
incorporating the Wav2Vec model as a multimodal feature
extractor (voice and text), and by applying SpecAugment to
enrich the data variety. Structurally, our proposed architecture
consists of two branches: a voice branch and a text branch.
Features extracted from Wav2Vec are sent to the voice branch
using an ECAPA-TDNN model, and to the text branch using a
BERT model. These two branches are then combined using fully
connected layers for final classification. Experiments show that
this multimodal approach can achieve high performance,
namely weighted accuracy of 90.28% and unweighted accuracy
of 90.62%, with requiring special fine-tuning of the text model.
These results indicate that the integration of multimodal with
pretrained and data augmentation approaches can significantly
improve the performance of SER system
Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.