MULTIMODAL EMOTION ANALYSIS WITH FOCUSED ATTENTION

Abstract

Emotion analysis, a subset of sentiment analysis, involves the study of a wide array of emotional indicators. In contrast to sentiment analysis, which restricts its focus to positive and negative sentiments, emotion analysis extends beyond these limitations to a diverse spectrum of emotional cues. Contemporary trends in emotion analysis lean toward multimodal approaches that leverage audiovisual and text modalities. However, implementing multimodal strategies introduces its own set of challenges, marked by a rise in model complexity and an expansion of parameters, thereby creating a need for a larger volume of data. This thesis responds to this challenge by proposing a robust model tailored for emotion recognition, specifically focusing on leveraging audio and text data. Our approach is centered on using audio spectrogram transformers (AST), and the powerful BERT language model to extract distinctive features from both auditory and textual modalities followed by feature fusion. Despite the absence of the visual component, employed by state-of-the-art (SOTA) methods, our model demonstrates comparable performance levels achieving an f1 score of 0.67 when benchmarked against existing standards on the IEMOCAP dataset [1] which consists of 12-hour audio recordings broken down into 5255 scripted and 4784 spontaneous turns, with each turn labeled by emotions such as anger, neutral, frustration, happy, and sad. In essence, We propose a fully attention-focused multimodal approach for effective emotion analysis for relatively smaller datasets leveraging lightweight data sources like audio and text highlighting the efficacy of our proposed model. For reproducibility, the code is available at 2AI Lab’s GitHub repository: https://github.com/2ai-lab/multimodal-emotion

    Similar works