14 research outputs found
Unsupervised Learning of Semantic Audio Representations
Even in the absence of any explicit semantic annotation, vast collections of
audio recordings provide valuable information for learning the categorical
structure of sounds. We consider several class-agnostic semantic constraints
that apply to unlabeled nonspeech audio: (i) noise and translations in time do
not change the underlying sound category, (ii) a mixture of two sound events
inherits the categories of the constituents, and (iii) the categories of events
in close temporal proximity are likely to be the same or related. Without
labels to ground them, these constraints are incompatible with classification
loss functions. However, they may still be leveraged to identify geometric
inequalities needed for triplet loss-based training of convolutional neural
networks. The result is low-dimensional embeddings of the input spectrograms
that recover 41% and 84% of the performance of their fully-supervised
counterparts when applied to downstream query-by-example sound retrieval and
sound event classification tasks, respectively. Moreover, in
limited-supervision settings, our unsupervised embeddings double the
state-of-the-art classification performance.Comment: Submitted to ICASSP 201
CNN Architectures for Large-Scale Audio Classification
Convolutional Neural Networks (CNNs) have proven very effective in image
classification and show promise for audio. We use various CNN architectures to
classify the soundtracks of a dataset of 70M training videos (5.24 million
hours) with 30,871 video-level labels. We examine fully connected Deep Neural
Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We
investigate varying the size of both training set and label vocabulary, finding
that analogs of the CNNs used in image classification do well on our audio
classification task, and larger training and label sets help up to a point. A
model using embeddings from these classifiers does much better than raw
features on the Audio Set [5] Acoustic Event Detection (AED) classification
task.Comment: Accepted for publication at ICASSP 2017 Changes: Added definitions of
mAP, AUC, and d-prime. Updated mAP/AUC/d-prime numbers for Audio Set based on
changes of latest Audio Set revision. Changed wording to fit 4 page limit
with new addition
Audiovisual Transformer Architectures for Large-Scale Classification and Synchronization of Weakly Labeled Audio Events
We tackle the task of environmental event classification by drawing
inspiration from the transformer neural network architecture used in machine
translation. We modify this attention-based feedforward structure in such a way
that allows the resulting model to use audio as well as video to compute sound
event predictions. We perform extensive experiments with these adapted
transformers on an audiovisual data set, obtained by appending relevant visual
information to an existing large-scale weakly labeled audio collection. The
employed multi-label data contains clip-level annotation indicating the
presence or absence of 17 classes of environmental sounds, and does not include
temporal information. We show that the proposed modified transformers strongly
improve upon previously introduced models and in fact achieve state-of-the-art
results. We also make a compelling case for devoting more attention to research
in multimodal audiovisual classification by proving the usefulness of visual
information for the task at hand,namely audio event recognition. In addition,
we visualize internal attention patterns of the audiovisual transformers and in
doing so demonstrate their potential for performing multimodal synchronization
Self-supervised learning from automatically separated sound scenes
Comunicació presentada a 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), celebrat del 17 al 20 d'octubre de 2021 a New Paltz, Estats Units.Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and each other is semantically constrained: the sound scene contains the union of source classes and not all classes naturally co-occur. With this motivation, this paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into multiple semantically-linked views for use in self-supervised contrastive learning. We find that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches that use the mixtures alone. Further, we discover that optimal source separation is not required for successful contrastive learning by demonstrating that a range of separation system convergence states all lead to useful and often complementary example transformations. Our best system incorporates these unsupervised separation models into a single augmentation front-end and jointly optimizes similarity maximization and coincidence prediction objectives across the views. The result is an unsupervised audio representation that rivals state-of-the-art alternatives on the established shallow AudioSet classification benchmark
Speech Emotion Recognition among Elderly Individuals using Multimodal Fusion and Transfer Learning
Recognizing the emotions of the elderly is important as it could give an insight into their mental health. Emotion recognition systems that work well on the elderly could be used to assess their emotions in places such as nursing homes and could inform the development of various activities and interventions to improve their mental health. However, several emotion recognition systems are developed using data from younger adults. In this work, we train machine learning models to recognize the emotions of elderly individuals via performing a 3-class classification of valence and arousal as part of the INTERSPEECH 2020 Computational Paralinguistics Challenge (COMPARE). We used speech data from 87 participants who gave spontaneous personal narratives. We leveraged a transfer learning approach in which we used pretrained CNN and BERT models to extract acoustic and linguistic features respectively and fed them into separate machine learning models. Also, we fused these two modalities in a multimodal approach. Our best model used a linguistic approach and outperformed the official competition of unweighted average recall (UAR) baseline for valence by 8.8% and the mean of valence and arousal by 3.2%. We also showed that feature engineering is not necessary as transfer learning without fine-tuning performs as well or better and could be leveraged for the task of recognizing the emotions of elderly individuals. This work is a step towards better recognition of the emotions of the elderly which could eventually inform the development of interventions to manage their mental health
Speech emotion recognition among couples using the peak-end rule and transfer learning
Extensive couples? literature shows that how couples feel after a conflict is predicted by certain emotional aspects of that conversation. Understanding the emotions of couples leads to a better understanding of partners? mental well-being and consequently their relationships. Hence, automatic emotion recognition among couples could potentially guide interventions to help couples improve their emotional well-being and their relationships. It has been shown that people's global emotional judgment after an experience is strongly influenced by the emotional extremes and ending of that experience, known as the peak-end rule. In this work, we leveraged this theory and used machine learning to investigate, which audio segments can be used to best predict the end-of-conversation emotions of couples. We used speech data collected from 101 Dutch-speaking couples in Belgium who engaged in 10-minute long conversations in the lab. We extracted acoustic features from (1) the audio segments with the most extreme positive and negative ratings, and (2) the ending of the audio. We used transfer learning in which we extracted these acoustic features with a pre-trained convolutional neural network (YAMNet). We then used these features to train machine learning models - support vector machines - to predict the end-of-conversation valence ratings (positive vs negative) of each partner. The results of this work could inform how to best recognize the emotions of couples after conversation-sessions and eventually, lead to a better understanding of couples? relationships either in therapy or in everyday life