Search CORE

55 research outputs found

Unsupervised Learning of Semantic Audio Representations

Author: Ellis Daniel P. W.
Hershey Shawn
Jansen Aren
Liu Jiayang
Moore R. Channing
Pandya Ratheet
Plakal Manoj
Saurous Rif A.
Publication venue
Publication date: 06/11/2017
Field of study

Even in the absence of any explicit semantic annotation, vast collections of audio recordings provide valuable information for learning the categorical structure of sounds. We consider several class-agnostic semantic constraints that apply to unlabeled nonspeech audio: (i) noise and translations in time do not change the underlying sound category, (ii) a mixture of two sound events inherits the categories of the constituents, and (iii) the categories of events in close temporal proximity are likely to be the same or related. Without labels to ground them, these constraints are incompatible with classification loss functions. However, they may still be leveraged to identify geometric inequalities needed for triplet loss-based training of convolutional neural networks. The result is low-dimensional embeddings of the input spectrograms that recover 41% and 84% of the performance of their fully-supervised counterparts when applied to downstream query-by-example sound retrieval and sound event classification tasks, respectively. Moreover, in limited-supervision settings, our unsupervised embeddings double the state-of-the-art classification performance.Comment: Submitted to ICASSP 201

arXiv.org e-Print Archive

Crossref

Semantic Indexing of Multimedia Content Using Visual, Audio, and Text Cues

Author
Publication venue: Springer
Publication date
Field of study

Springer - Publisher Connector

Recommended from our members

Audio-Based Semantic Concept Classification for Consumer Video

Author: Ellis Daniel P. W.
Lee Keansub
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2010
Field of study

This paper presents a novel method for automatically classifying consumer video clips based on their soundtracks. We use a set of 25 overlapping semantic classes, chosen for their usefulness to users, viability of automatic detection and of annotator labeling, and sufficiency of representation in available video collections. A set of 1873 videos from real users has been annotated with these concepts. Starting with a basic representation of each video clip as a sequence of mel-frequency cepstral coefficient (MFCC) frames, we experiment with three clip-level representations: single Gaussian modeling, Gaussian mixture modeling, and probabilistic latent semantic analysis of a Gaussian component histogram. Using such summary features, we produce support vector machine (SVM) classifiers based on the Kullback-Leibler, Bhattacharyya, or Mahalanobis distance measures. Quantitative evaluation shows that our approaches are effective for detecting interesting concepts in a large collection of real-world consumer video clips

Columbia University Academic Commons

Recommended from our members

Recognizing and Classifying Environmental Sounds

Author: Ellis Daniel P. W.
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2012
Field of study

Prof. Ellis presents a summary of LabROSA's new work, with a focus on recognizing environmental sounds, particularly for video classification by soundtrack

Columbia University Academic Commons

Automatic social role recognition and its application in structuring multiparty interactions

Author: Sapru Ashtosh
Publication venue: Lausanne, EPFL
Publication date: 17/06/2015
Field of study

Automatic processing of multiparty interactions is a research domain with important applications in content browsing, summarization and information retrieval. In recent years, several works have been devoted to find regular patterns which speakers exhibit in a multiparty interaction also known as social roles. Most of the research in literature has generally focused on recognition of scenario specific formal roles. More recently, role coding schemes based on informal social roles have been proposed in literature, defining roles based on the behavior speakers have in the functioning of a small group interaction. Informal social roles represent a flexible classification scheme that can generalize across different scenarios of multiparty interaction. In this thesis, we focus on automatic recognition of informal social roles and exploit the influence of informal social roles on speaker behavior for structuring multiparty interactions. To model speaker behavior, we systematically explore various verbal and non verbal cues extracted from turn taking patterns, vocal expression and linguistic style. The influence of social roles on the behavior cues exhibited by a speaker is modeled using a discriminative approach based on conditional random fields. Experiments performed on several hours of meeting data reveal that classification using conditional random fields improves the role recognition performance. We demonstrate the effectiveness of our approach by evaluating it on previously unseen scenarios of multiparty interaction. Furthermore, we also consider whether formal roles and informal roles can be automatically predicted by the same verbal and nonverbal features. We exploit the influence of social roles on turn taking patterns to improve speaker diarization under distant microphone condition. Our work extends the Hidden Markov model (HMM)- Gaussian mixture model (GMM) speaker diarization system, and is based on jointly estimating both the speaker segmentation and social roles in an audio recording. We modify the minimum duration constraint in HMM-GMM diarization system by using role information to model the expected duration of speaker's turn. We also use social role n-grams as prior information to model speaker interaction patterns. Finally, we demonstrate the application of social roles for the problem of topic segmentation in meetings. We exploit our findings that social roles can dynamically change in conversations and use this information to predict topic changes in meetings. We also present an unsupervised method for topic segmentation which combines social roles and lexical cohesion. Experimental results show that social roles improve performance of both speaker diarization and topic segmentation

Infoscience - École polytechnique fédérale de Lausanne

Pervasive Sound Sensing: A Weakly Supervised Training Approach

Author: Caulfield Brian
Kelly Daniel
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 17/06/2015
Field of study

Modern smartphones present an ideal device for pervasive sensing of human behaviour. Microphones have the potential to reveal key information about a persons behaviour.However, they have been utilized to a significantly lesser extent than other smartphone sensors in the context of human behaviour sensing. We postulate that, in order for microphones to be useful in behaviour sensing applications, the analysis tecniques must be flexible and allow easy modification of the types of sounds to be sensed. A simplification of the training data collection process could allow a more flexible sound classification framework. We hypothesize that detailed training, a prerequisite for the majority of sound sensing techniques, is not necessary and that a significantly less detailed and time consuming data collection process can be carried out, allow-ng even a non expert to conduct the collection, labeling, and training process. To test this hypothesis, we implement a diverse density-based multiple instance learning framework, to identify a target sound, and a bag trimming algorithm, which, using the target sound, automatically segments weakly labeled soundclips to construct an accurate training set. Experiments reveal that our hypothesis is a valid one and results show that classifiers, trained using the automatically segmented training sets,were able to accurately classify unseen sound samples with accuracies comparable to supervised classifiers, achieving an average F-measure of 0.969 and 0.87 for two weakly supervised datasets

Crossref

Research Repository UCD

Ulster University's Research Portal

Automatic Speech Recognition System to Analyze Autism Spectrum Disorder in Young Children

Author: M.Abinath
P.Anojan
Prof. Koliya Pulasinga
R.Vithurshan
S.M.Afaal
Wishalya Tissera
Publication venue: 'Vandana Publications'
Publication date: 31/10/2022
Field of study

It's possible to learn things about a person just by listening to their voice. When trying to construct an abstract concept of a speaker, it is essential to extract significant features from audio signals that are modulation-insensitive. This research assessed how individuals with autism spectrum disorder (ASD) recognize and recall voice identity. Autism spectrum disorder is the abbreviation for autism spectrum disorder. Both the ASD group and the control group performed equally well in a task in which they were asked to choose the name of a newly-learned speaker based on his or her voice. However, the ASD group outperformed the control group in a subsequent familiarity test in which they were asked to differentiate between previously trained voices and untrained voices. Persons with ASD classified voices numerically according to the exact acoustic characteristics, whereas non - autistic individuals classified voices qualitatively depending on the acoustic patterns associated to the speakers' physical and psychological traits. Child vocalizations show potential as an objective marker of developmental problems like Autism. In typical detection systems, hand-crafted acoustic features are input into a discriminative classifier, but its accuracy and resilience are limited by the number of its training data. This research addresses using CNN-learned feature representations to classify children's speech with developmental problems. On the Child Pathological and Emotional Speech database, we compare several acoustic feature sets. CNN-based approaches perform comparably to conventional paradigms in terms of unweighted average recall

International Journal of Engineering and Management Research

Audio-Based Semantic Concept Classification for Consumer Video

Author: Daniel P. W. Ellis
Keansub Lee
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

The “Narratives” fMRI dataset for evaluating models of naturalistic language comprehension

Author: Baldassano Christopher
Brooks Paula P.
Chang Claire H.C.
Chen Janice
Choe Gina
Chow Michael A.
Hasenfratz Liat
Hillman Hanna
Honey Christopher J.
Keshavarzian Neggin
Leong Yuan Chang
Liu Yun Fei
Lositsky Olga
Micciche Emily
Nastase Samuel A
Nguyen Mai
Regev Mor
Simony Erez
Yeshurun Yaara
Zadbood Asieh
Publication venue: Dartmouth Digital Commons
Publication date: 01/12/2021
Field of study

The “Narratives” collection aggregates a variety of functional MRI datasets collected while human subjects listened to naturalistic spoken stories. The current release includes 345 subjects, 891 functional scans, and 27 diverse stories of varying duration totaling ~4.6 hours of unique stimuli (~43,000 words). This data collection is well-suited for naturalistic neuroimaging analysis, and is intended to serve as a benchmark for models of language and narrative comprehension. We provide standardized MRI data accompanied by rich metadata, preprocessed versions of the data ready for immediate use, and the spoken story stimuli with time-stamped phoneme- and word-level transcripts. All code and data are publicly available with full provenance in keeping with current best practices in transparent and reproducible neuroimaging

Dartmouth Digital Commons (Dartmouth College)