Affective computing is a field of study that focuses on developing systems
and technologies that can understand, interpret, and respond to human emotions.
Speech Emotion Recognition (SER), in particular, has got a lot of attention
from researchers in the recent past. However, in many cases, the publicly
available datasets, used for training and evaluation, are scarce and imbalanced
across the emotion labels. In this work, we focused on building a balanced
corpus from these publicly available datasets by combining these datasets as
well as employing various speech data augmentation techniques. Furthermore, we
experimented with different architectures for speech emotion recognition. Our
best system, a multi-modal speech, and text-based model, provides a performance
of UA(Unweighed Accuracy) + WA (Weighed Accuracy) of 157.57 compared to the
baseline algorithm performance of 119.66Comment: 11 pages 9 figures, 9 table