Search CORE

14 research outputs found

Unsupervised Learning of Semantic Audio Representations

Author: Ellis Daniel P. W.
Hershey Shawn
Jansen Aren
Liu Jiayang
Moore R. Channing
Pandya Ratheet
Plakal Manoj
Saurous Rif A.
Publication venue
Publication date: 06/11/2017
Field of study

Even in the absence of any explicit semantic annotation, vast collections of audio recordings provide valuable information for learning the categorical structure of sounds. We consider several class-agnostic semantic constraints that apply to unlabeled nonspeech audio: (i) noise and translations in time do not change the underlying sound category, (ii) a mixture of two sound events inherits the categories of the constituents, and (iii) the categories of events in close temporal proximity are likely to be the same or related. Without labels to ground them, these constraints are incompatible with classification loss functions. However, they may still be leveraged to identify geometric inequalities needed for triplet loss-based training of convolutional neural networks. The result is low-dimensional embeddings of the input spectrograms that recover 41% and 84% of the performance of their fully-supervised counterparts when applied to downstream query-by-example sound retrieval and sound event classification tasks, respectively. Moreover, in limited-supervision settings, our unsupervised embeddings double the state-of-the-art classification performance.Comment: Submitted to ICASSP 201

arXiv.org e-Print Archive

Crossref

CNN Architectures for Large-Scale Audio Classification

Author: Chaudhuri Sourish
Ellis Daniel P. W.
Gemmeke Jort F.
Hershey Shawn
Jansen Aren
Moore R. Channing
Plakal Manoj
Platt Devin
Saurous Rif A.
Seybold Bryan
Slaney Malcolm
Weiss Ron J.
Wilson Kevin
Publication venue
Publication date: 10/01/2017
Field of study

Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on our audio classification task, and larger training and label sets help up to a point. A model using embeddings from these classifiers does much better than raw features on the Audio Set [5] Acoustic Event Detection (AED) classification task.Comment: Accepted for publication at ICASSP 2017 Changes: Added definitions of mAP, AUC, and d-prime. Updated mAP/AUC/d-prime numbers for Audio Set based on changes of latest Audio Set revision. Changed wording to fit 4 page limit with new addition

arXiv.org e-Print Archive

Crossref

Audiovisual Transformer Architectures for Large-Scale Classification and Synchronization of Weakly Labeled Audio Events

Author: Abadi Mart'in
Gehring Jonas
Hershey Shawn
Iqbal Turab
Mesaros Annamaria
Mesaros Annamaria
Parekh Sanjeel
Plumbley Mark D.
Simonyan Karen
Virtanen Tuomas
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 02/12/2019
Field of study

We tackle the task of environmental event classification by drawing inspiration from the transformer neural network architecture used in machine translation. We modify this attention-based feedforward structure in such a way that allows the resulting model to use audio as well as video to compute sound event predictions. We perform extensive experiments with these adapted transformers on an audiovisual data set, obtained by appending relevant visual information to an existing large-scale weakly labeled audio collection. The employed multi-label data contains clip-level annotation indicating the presence or absence of 17 classes of environmental sounds, and does not include temporal information. We show that the proposed modified transformers strongly improve upon previously introduced models and in fact achieve state-of-the-art results. We also make a compelling case for devoting more attention to research in multimodal audiovisual classification by proving the usefulness of visual information for the task at hand,namely audio event recognition. In addition, we visualize internal attention patterns of the audiovisual transformers and in doing so demonstrate their potential for performing multimodal synchronization

arXiv.org e-Print Archive

Crossref

Self-supervised learning from automatically separated sound scenes

Author: Ellis Daniel P. W.
Fonseca Eduardo
Hershey John R.
Hershey Shawn
Jansen Aren
Moore R. Channing
Plakal Manoj
Serra Xavier
Tagliasacchi Marco
Wisdom Scott
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2021
Field of study

Comunicació presentada a 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), celebrat del 17 al 20 d'octubre de 2021 a New Paltz, Estats Units.Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and each other is semantically constrained: the sound scene contains the union of source classes and not all classes naturally co-occur. With this motivation, this paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into multiple semantically-linked views for use in self-supervised contrastive learning. We find that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches that use the mixtures alone. Further, we discover that optimal source separation is not required for successful contrastive learning by demonstrating that a range of separation system convergence states all lead to useful and often complementary example transformations. Our best system incorporates these unsupervised separation models into a single augmentation front-end and jointly optimizes similarity maximization and coincidence prediction objectives across the views. The result is an unsupervised audio representation that rivals state-of-the-art alternatives on the established shallow AudioSet classification benchmark

UPF Digital Repository

Speech Emotion Recognition among Elderly Individuals using Multimodal Fusion and Transfer Learning

Author: Busso Carlos
Chollet Francc
Devlin Jacob
Hershey Shawn
Hochreiter Sepp
Howard Andrew G
Howard Jeremy
Onu Charles C.
Pedregosa Fabian
Reimers Nils
Risch Julian
Sahoo Sourav
Schuller Bjorn W.
Srivastava Nitish
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/10/2020
Field of study

Recognizing the emotions of the elderly is important as it could give an insight into their mental health. Emotion recognition systems that work well on the elderly could be used to assess their emotions in places such as nursing homes and could inform the development of various activities and interventions to improve their mental health. However, several emotion recognition systems are developed using data from younger adults. In this work, we train machine learning models to recognize the emotions of elderly individuals via performing a 3-class classification of valence and arousal as part of the INTERSPEECH 2020 Computational Paralinguistics Challenge (COMPARE). We used speech data from 87 participants who gave spontaneous personal narratives. We leveraged a transfer learning approach in which we used pretrained CNN and BERT models to extract acoustic and linguistic features respectively and fed them into separate machine learning models. Also, we fused these two modalities in a multimodal approach. Our best model used a linguistic approach and outperformed the official competition of unweighted average recall (UAR) baseline for valence by 8.8% and the mean of valence and arousal by 3.2%. We also showed that feature engineering is not necessary as transfer learning without fine-tuning performs as well or better and could be leveraged for the task of recognizing the emotions of elderly individuals. This work is a step towards better recognition of the emotions of the elderly which could eventually inform the development of interventions to manage their mental health

Repository for Publications and Research Data

Crossref

Speech emotion recognition among couples using the peak-end rule and transfer learning

Author: Black Matthew P
Boateng George
Busso Carlos
Carstensen Laura L
Dejonckheere Egon
Gibson James
Gottman John M
Gottman John Mordechai
Gottman John Mordechai
Hershey Shawn
Heyman Richard E
Howard Andrew G
Lee Chi-Chun
Metallinou Angeliki
Pedregosa Fabian
Ruef Anna Marie
Sahoo Sourav
Sels Laura
West Tessa V
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2020
Field of study

Extensive couples? literature shows that how couples feel after a conflict is predicted by certain emotional aspects of that conversation. Understanding the emotions of couples leads to a better understanding of partners? mental well-being and consequently their relationships. Hence, automatic emotion recognition among couples could potentially guide interventions to help couples improve their emotional well-being and their relationships. It has been shown that people's global emotional judgment after an experience is strongly influenced by the emotional extremes and ending of that experience, known as the peak-end rule. In this work, we leveraged this theory and used machine learning to investigate, which audio segments can be used to best predict the end-of-conversation emotions of couples. We used speech data collected from 101 Dutch-speaking couples in Belgium who engaged in 10-minute long conversations in the lab. We extracted acoustic features from (1) the audio segments with the most extreme positive and negative ratings, and (2) the ending of the audio. We used transfer learning in which we extracted these acoustic features with a pre-trained convolutional neural network (YAMNet). We then used these features to train machine learning models - support vector machines - to predict the end-of-conversation valence ratings (positive vs negative) of each partner. The results of this work could inform how to best recognize the emotions of couples after conversation-sessions and eventually, lead to a better understanding of couples? relationships either in therapy or in everyday life

Crossref

Ghent University Academic Bibliography

Serveur académique lausannois

White matter integrity and executive abilities following treatment with tetrahydrobiopterin (BH4) in individuals with phenylketonuria

Author: Anderson
Antenor-Dorsey
Brumm
Christ
Cleary
de Groot
DeRoche
Desirée A. White
Dezortova
Dorothy K. Grange
Gassio
Jerrel Rutlin
Jo Ann V. Antenor-Dorsey
Joshua S. Shimony
Kono
Leuzzi
Moyle
Mukherjee
Paine
Peng
Robert C. McKinstry
Scarabino
Shawn E. Christ
Tamara Hershey
van Spronsen
Vermathen
White
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref