1,308 research outputs found

    Multitask Learning from Augmented Auxiliary Data for Improving Speech Emotion Recognition

    Full text link
    Despite the recent progress in speech emotion recognition (SER), state-of-the-art systems lack generalisation across different conditions. A key underlying reason for poor generalisation is the scarcity of emotion datasets, which is a significant roadblock to designing robust machine learning (ML) models. Recent works in SER focus on utilising multitask learning (MTL) methods to improve generalisation by learning shared representations. However, most of these studies propose MTL solutions with the requirement of meta labels for auxiliary tasks, which limits the training of SER systems. This paper proposes an MTL framework (MTL-AUG) that learns generalised representations from augmented data. We utilise augmentation-type classification and unsupervised reconstruction as auxiliary tasks, which allow training SER systems on augmented data without requiring any meta labels for auxiliary tasks. The semi-supervised nature of MTL-AUG allows for the exploitation of the abundant unlabelled data to further boost the performance of SER. We comprehensively evaluate the proposed framework in the following settings: (1) within corpus, (2) cross-corpus and cross-language, (3) noisy speech, (4) and adversarial attacks. Our evaluations using the widely used IEMOCAP, MSP-IMPROV, and EMODB datasets show improved results compared to existing state-of-the-art methods.Comment: Under review IEEE Transactions on Affective Computin

    Transfer Learning for Personality Perception via Speech Emotion Recognition

    Full text link
    Holistic perception of affective attributes is an important human perceptual ability. However, this ability is far from being realized in current affective computing, as not all of the attributes are well studied and their interrelationships are poorly understood. In this work, we investigate the relationship between two affective attributes: personality and emotion, from a transfer learning perspective. Specifically, we transfer Transformer-based and wav2vec2-based emotion recognition models to perceive personality from speech across corpora. Compared with previous studies, our results show that transferring emotion recognition is effective for personality perception. Moreoever, this allows for better use and exploration of small personality corpora. We also provide novel findings on the relationship between personality and emotion that will aid future research on holistic affect recognition.Comment: Accepted to INTERSPEECH 202

    Survey of deep representation learning for speech emotion recognition

    Get PDF
    Traditionally, speech emotion recognition (SER) research has relied on manually handcrafted acoustic features using feature engineering. However, the design of handcrafted features for complex SER tasks requires significant manual eort, which impedes generalisability and slows the pace of innovation. This has motivated the adoption of representation learning techniques that can automatically learn an intermediate representation of the input signal without any manual feature engineering. Representation learning has led to improved SER performance and enabled rapid innovation. Its effectiveness has further increased with advances in deep learning (DL), which has facilitated \textit{deep representation learning} where hierarchical representations are automatically learned in a data-driven manner. This paper presents the first comprehensive survey on the important topic of deep representation learning for SER. We highlight various techniques, related challenges and identify important future areas of research. Our survey bridges the gap in the literature since existing surveys either focus on SER with hand-engineered features or representation learning in the general setting without focusing on SER

    Predicting emotion in speech: a Deep Learning approach using Attention mechanisms

    Get PDF
    Speech Emotion Recognition (SER) has recently become a popular field of research because of its implications in human-computer interaction. In this study, the emotional state of the speaker is successfully predicted by using Deep Convolutional Neural Networks to automatically extract features from the spectrogram of a speech signal. Parting from a baseline model that uses a statistical approach to pooling, an alternative method is proposed by incorporating Attention mechanisms as a pooling strategy. Additionally, multi-task learning is explored as an improvement over the baseline model by assigning language recognition as an auxiliary task. The final results show a remarkable improvement in classification accuracy in respect to previous more conventional techniques, in particular Gaussian Mixture Models and i-vectors, as well as a notable improvement in performance of the proposed Attention mechanisms over statistical pooling.En las últimas décadas, Speech Emotion Recognition (SER), o el reconocimiento de emociones por voz, ha generado un fuerte interés en el ámbito del tratamiento del habla por sus implicaciones en la interacción humano-computador. En este trabajo, se consigue reconocer el estado emocional del hablante mediante redes convolucionales profundas, capaces de extraer de manera automática características contenidas en el espectrograma de la señal de voz. Partiendo de un modelo que utiliza análisis estadístico para pooling, se propone una estrategia alternativa para mejorar el rendimiento incorporando mecanismos de Atención. Como mejora añadida, se explora el campo del multi-task learning definiendo el reconocimiento del idioma como tasca auxiliar para el modelo. Los resultados obtenidos reflejan una mejora substancial en la precisión comparado con anteriores técnicas más convencionales, concretamente Gaussian Mixture Models y i-vectors, y una mejora notable en la precisión de los mecanismos de Atención respecto al pooling estadístico.En les últimes dècades, Speech Emotion Recognition (SER), o el Reconeixement d'Emocions per Veu, ha generat fort interès en l'àmbit del tractament de la parla per a les implicacions que presenta en la interacció humà-computador. En aquest treball s'aconsegueix reconèixer l'estat emocional del parlant utilitzant xarxes neuronals profundes que extreuen de manera automàtica característiques contingudes en l'espectrograma del senyal de veu. Partint d'un model que utilitza anàlisi estadística per a pooling, es proposa una estratègia alternativa per a millorar el rendiment incorporant mecanismes d'Atenció. Com a millora afegida, s'explora el camp del mulit-task learning definint el reconeixement de l'idioma com a tasca auxiliar per al model. Els resultats finals obtinguts reflecteixen una millora substancial en la precisió comparat amb anteriors mètodes, concretament respecte Gaussian Mixture Models i i-vectors, i una millora notable en la precisió dels mecanismes d'Atenció respecte el pooling estadístic

    Multi-head attention-based long short-term memory for depression detection from speech.

    Get PDF
    Depression is a mental disorder that threatens the health and normal life of people. Hence, it is essential to provide an effective way to detect depression. However, research on depression detection mainly focuses on utilizing different parallel features from audio, video, and text for performance enhancement regardless of making full usage of the inherent information from speech. To focus on more emotionally salient regions of depression speech, in this research, we propose a multi-head time-dimension attention-based long short-term memory (LSTM) model. We first extract frame-level features to store the original temporal relationship of a speech sequence and then analyze their difference between speeches of depression and those of health status. Then, we study the performance of various features and use a modified feature set as the input of the LSTM layer. Instead of using the output of the traditional LSTM, multi-head time-dimension attention is employed to obtain more key time information related to depression detection by projecting the output into different subspaces. The experimental results show the proposed model leads to improvements of 2.3 and 10.3% over the LSTM model on the Distress Analysis Interview Corpus-Wizard of Oz (DAIC-WOZ) and the Multi-modal Open Dataset for Mental-disorder Analysis (MODMA) corpus, respectively
    • …
    corecore