Search CORE

243 research outputs found

A Weakly Supervised Approach to Emotion-change Prediction and Improved Mood Inference

Author: Abbasnejad Iman
Asthana Akshay
Goecke Roland
Narayana Soujanya
Parameshwara Ravikiran
Radwan Ibrahim
Subramanian Ramanathan
Publication venue
Publication date: 16/08/2023
Field of study

Whilst a majority of affective computing research focuses on inferring emotions, examining mood or understanding the \textit{mood-emotion interplay} has received significantly less attention. Building on prior work, we (a) deduce and incorporate emotion-change (

\Delta

) information for inferring mood, without resorting to annotated labels, and (b) attempt mood prediction for long duration video clips, in alignment with the characterisation of mood. We generate the emotion-change (

\Delta

) labels via metric learning from a pre-trained Siamese Network, and use these in addition to mood labels for mood classification. Experiments evaluating \textit{unimodal} (training only using mood labels) vs \textit{multimodal} (training using mood plus

\Delta

labels) models show that mood prediction benefits from the incorporation of emotion-change information, emphasising the importance of modelling the mood-emotion interplay for effective mood inference.Comment: 9 pages, 3 figures, 6 tables, published in IEEE International Conference on Affective Computing and Intelligent Interactio

arXiv.org e-Print Archive

Audio self-supervised learning: a survey

Author: Hu Bin
Jing Xin
Kathan Alexander
Liu Shuo
Mallol-Ragolta Adria
Parada-Cabaleiro Emilia
Qian Kun
Schuller Björn W.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2022
Field of study

Inspired by the humans' cognitive ability to generalise knowledge and skills, Self-Supervised Learning (SSL) targets at discovering general representations from large-scale data without requiring human annotations, which is an expensive and time consuming task. Its success in the fields of computer vision and natural language processing have prompted its recent adoption into the field of audio and speech processing. Comprehensive reviews summarising the knowledge in audio SSL are currently missing. To fill this gap, in the present work, we provide an overview of the SSL methods used for audio and speech processing applications. Herein, we also summarise the empirical works that exploit the audio modality in multi-modal SSL frameworks, and the existing suitable benchmarks to evaluate the power of SSL in the computer audition domain. Finally, we discuss some open problems and point out the future directions on the development of audio SSL

arXiv.org e-Print Archive

OPUS Augsburg

Supervised Contrastive Learning with Nearest Neighbor Search for Speech Emotion Recognition

Author: Qin Yong
Wang Xuechen
Zhao Shiwan
Publication venue
Publication date: 31/08/2023
Field of study

Speech Emotion Recognition (SER) is a challenging task due to limited data and blurred boundaries of certain emotions. In this paper, we present a comprehensive approach to improve the SER performance throughout the model lifecycle, including pre-training, fine-tuning, and inference stages. To address the data scarcity issue, we utilize a pre-trained model, wav2vec2.0. During fine-tuning, we propose a novel loss function that combines cross-entropy loss with supervised contrastive learning loss to improve the model's discriminative ability. This approach increases the inter-class distances and decreases the intra-class distances, mitigating the issue of blurred boundaries. Finally, to leverage the improved distances, we propose an interpolation method at the inference stage that combines the model prediction with the output from a k-nearest neighbors model. Our experiments on IEMOCAP demonstrate that our proposed methods outperform current state-of-the-art results.Comment: Accepted by lnterspeech 2023, poste

arXiv.org e-Print Archive

VoxCeleb2: Deep Speaker Recognition

Author: Chung Joon Son
Nagrani Arsha
Zisserman Andrew
Publication venue
Publication date: 26/06/2018
Field of study

The objective of this paper is speaker recognition under noisy and unconstrained conditions. We make two key contributions. First, we introduce a very large-scale audio-visual speaker recognition dataset collected from open-source media. Using a fully automated pipeline, we curate VoxCeleb2 which contains over a million utterances from over 6,000 speakers. This is several times larger than any publicly available speaker recognition dataset. Second, we develop and compare Convolutional Neural Network (CNN) models and training strategies that can effectively recognise identities from voice under various conditions. The models trained on the VoxCeleb2 dataset surpass the performance of previous works on a benchmark dataset by a significant margin.Comment: To appear in Interspeech 2018. The audio-visual dataset can be downloaded from http://www.robots.ox.ac.uk/~vgg/data/voxceleb2 . 1806.05622v2: minor fixes; 5 page

arXiv.org e-Print Archive

Oxford University Research Archive

Few-shot emotion recognition using intelligent voice assistants and wearables : learning from few samples of speech and physiological signals

Author: Kapadia M. (Mihir)
Publication venue
Publication date: 01/03/2022
Field of study

CWI's Institutional Repository