Exploring Self-Supervised Learning for Speech Emotion Recognition: Feature Analysis, Dimensional Enhancement and Emotion Classification

Abstract

Speech Emotion Recognition (SER) aims to identify emotional states from speech signals by analyzing acoustic properties that reflect affective expression. Traditional SER approaches often rely on handcrafted acoustic features such as Mel-Frequency Cepstral Coefficients (MFCCs) and prosodic descriptors, which may lack the capacity to capture context-sensitive or subtle emotional variations. Recent advancements in self-supervised learning (SSL) have enabled the development of models trained on large-scale unlabeled speech data, producing general-purpose speech embeddings that enhance emotion recognition without task-specific fine-tuning. This thesis investigates the effectiveness of SSL-derived acoustic embeddings in both dimensional and categorical SER tasks, with a particular focus on dimensional SER (DSER). The study addresses three key objectives: (1) systematically compare traditional handcrafted features with SSL embeddings across three benchmark datasets for DSER; (2) enhance temporal modeling of emotional dynamics using transformer-based encoders with a two-step sequence reduction strategy; and (3) explore strategies to improve categorical SER (CSER) by leveraging DSER outputs through integration, regression-informed mapping, and multi-task learning (MTL). Empirical results demonstrate that pre-trained SSL models such as WavLM and UniSpeech-SAT outperform traditional baselines in DSER, with the greatest improvements observed for valence, followed by dominance and arousal. Transformer-based architectures with sequence reduction further enhance valence prediction. Integrating DSER into CSER frameworks yields consistent performance gains, particularly via MTL and SSL-enhanced mappings. This work contributes to building more generalizable, flexible, and context-aware SER systems

Similar works

Full text

thumbnail-image

Sydney eScholarship

redirect
Last time updated on 31/08/2025

This paper was published in Sydney eScholarship.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.