243 research outputs found
A Weakly Supervised Approach to Emotion-change Prediction and Improved Mood Inference
Whilst a majority of affective computing research focuses on inferring
emotions, examining mood or understanding the \textit{mood-emotion interplay}
has received significantly less attention. Building on prior work, we (a)
deduce and incorporate emotion-change () information for inferring
mood, without resorting to annotated labels, and (b) attempt mood prediction
for long duration video clips, in alignment with the characterisation of mood.
We generate the emotion-change () labels via metric learning from a
pre-trained Siamese Network, and use these in addition to mood labels for mood
classification. Experiments evaluating \textit{unimodal} (training only using
mood labels) vs \textit{multimodal} (training using mood plus labels)
models show that mood prediction benefits from the incorporation of
emotion-change information, emphasising the importance of modelling the
mood-emotion interplay for effective mood inference.Comment: 9 pages, 3 figures, 6 tables, published in IEEE International
Conference on Affective Computing and Intelligent Interactio
Audio self-supervised learning: a survey
Inspired by the humans' cognitive ability to generalise knowledge and skills,
Self-Supervised Learning (SSL) targets at discovering general representations
from large-scale data without requiring human annotations, which is an
expensive and time consuming task. Its success in the fields of computer vision
and natural language processing have prompted its recent adoption into the
field of audio and speech processing. Comprehensive reviews summarising the
knowledge in audio SSL are currently missing. To fill this gap, in the present
work, we provide an overview of the SSL methods used for audio and speech
processing applications. Herein, we also summarise the empirical works that
exploit the audio modality in multi-modal SSL frameworks, and the existing
suitable benchmarks to evaluate the power of SSL in the computer audition
domain. Finally, we discuss some open problems and point out the future
directions on the development of audio SSL
Supervised Contrastive Learning with Nearest Neighbor Search for Speech Emotion Recognition
Speech Emotion Recognition (SER) is a challenging task due to limited data
and blurred boundaries of certain emotions. In this paper, we present a
comprehensive approach to improve the SER performance throughout the model
lifecycle, including pre-training, fine-tuning, and inference stages. To
address the data scarcity issue, we utilize a pre-trained model, wav2vec2.0.
During fine-tuning, we propose a novel loss function that combines
cross-entropy loss with supervised contrastive learning loss to improve the
model's discriminative ability. This approach increases the inter-class
distances and decreases the intra-class distances, mitigating the issue of
blurred boundaries. Finally, to leverage the improved distances, we propose an
interpolation method at the inference stage that combines the model prediction
with the output from a k-nearest neighbors model. Our experiments on IEMOCAP
demonstrate that our proposed methods outperform current state-of-the-art
results.Comment: Accepted by lnterspeech 2023, poste
VoxCeleb2: Deep Speaker Recognition
The objective of this paper is speaker recognition under noisy and
unconstrained conditions.
We make two key contributions. First, we introduce a very large-scale
audio-visual speaker recognition dataset collected from open-source media.
Using a fully automated pipeline, we curate VoxCeleb2 which contains over a
million utterances from over 6,000 speakers. This is several times larger than
any publicly available speaker recognition dataset.
Second, we develop and compare Convolutional Neural Network (CNN) models and
training strategies that can effectively recognise identities from voice under
various conditions. The models trained on the VoxCeleb2 dataset surpass the
performance of previous works on a benchmark dataset by a significant margin.Comment: To appear in Interspeech 2018. The audio-visual dataset can be
downloaded from http://www.robots.ox.ac.uk/~vgg/data/voxceleb2 .
1806.05622v2: minor fixes; 5 page
- …