55 research outputs found
Unsupervised Learning of Semantic Audio Representations
Even in the absence of any explicit semantic annotation, vast collections of
audio recordings provide valuable information for learning the categorical
structure of sounds. We consider several class-agnostic semantic constraints
that apply to unlabeled nonspeech audio: (i) noise and translations in time do
not change the underlying sound category, (ii) a mixture of two sound events
inherits the categories of the constituents, and (iii) the categories of events
in close temporal proximity are likely to be the same or related. Without
labels to ground them, these constraints are incompatible with classification
loss functions. However, they may still be leveraged to identify geometric
inequalities needed for triplet loss-based training of convolutional neural
networks. The result is low-dimensional embeddings of the input spectrograms
that recover 41% and 84% of the performance of their fully-supervised
counterparts when applied to downstream query-by-example sound retrieval and
sound event classification tasks, respectively. Moreover, in
limited-supervision settings, our unsupervised embeddings double the
state-of-the-art classification performance.Comment: Submitted to ICASSP 201
Recommended from our members
Audio-Based Semantic Concept Classification for Consumer Video
This paper presents a novel method for automatically classifying consumer video clips based on their soundtracks. We use a set of 25 overlapping semantic classes, chosen for their usefulness to users, viability of automatic detection and of annotator labeling, and sufficiency of representation in available video collections. A set of 1873 videos from real users has been annotated with these concepts. Starting with a basic representation of each video clip as a sequence of mel-frequency cepstral coefficient (MFCC) frames, we experiment with three clip-level representations: single Gaussian modeling, Gaussian mixture modeling, and probabilistic latent semantic analysis of a Gaussian component histogram. Using such summary features, we produce support vector machine (SVM) classifiers based on the Kullback-Leibler, Bhattacharyya, or Mahalanobis distance measures. Quantitative evaluation shows that our approaches are effective for detecting interesting concepts in a large collection of real-world consumer video clips
Recommended from our members
Recognizing and Classifying Environmental Sounds
Prof. Ellis presents a summary of LabROSA's new work, with a focus on recognizing environmental sounds, particularly for video classification by soundtrack
Automatic social role recognition and its application in structuring multiparty interactions
Automatic processing of multiparty interactions is a research domain with important applications in content browsing, summarization and information retrieval. In recent years, several works have been devoted to find regular patterns which speakers exhibit in a multiparty interaction also known as social roles. Most of the research in literature has generally focused on recognition of scenario specific formal roles. More recently, role coding schemes based on informal social roles have been proposed in literature, defining roles based on the behavior speakers have in the functioning of a small group interaction. Informal social roles represent a flexible classification scheme that can generalize across different scenarios of multiparty interaction. In this thesis, we focus on automatic recognition of informal social roles and exploit the influence of informal social roles on speaker behavior for structuring multiparty interactions. To model speaker behavior, we systematically explore various verbal and non verbal cues extracted from turn taking patterns, vocal expression and linguistic style. The influence of social roles on the behavior cues exhibited by a speaker is modeled using a discriminative approach based on conditional random fields. Experiments performed on several hours of meeting data reveal that classification using conditional random fields improves the role recognition performance. We demonstrate the effectiveness of our approach by evaluating it on previously unseen scenarios of multiparty interaction. Furthermore, we also consider whether formal roles and informal roles can be automatically predicted by the same verbal and nonverbal features. We exploit the influence of social roles on turn taking patterns to improve speaker diarization under distant microphone condition. Our work extends the Hidden Markov model (HMM)- Gaussian mixture model (GMM) speaker diarization system, and is based on jointly estimating both the speaker segmentation and social roles in an audio recording. We modify the minimum duration constraint in HMM-GMM diarization system by using role information to model the expected duration of speaker's turn. We also use social role n-grams as prior information to model speaker interaction patterns. Finally, we demonstrate the application of social roles for the problem of topic segmentation in meetings. We exploit our findings that social roles can dynamically change in conversations and use this information to predict topic changes in meetings. We also present an unsupervised method for topic segmentation which combines social roles and lexical cohesion. Experimental results show that social roles improve performance of both speaker diarization and topic segmentation
Pervasive Sound Sensing: A Weakly Supervised Training Approach
Modern smartphones present an ideal device for pervasive sensing of human behaviour. Microphones have the potential to reveal key information about a persons behaviour.However, they have been utilized to a significantly lesser extent than other smartphone sensors in the context of human behaviour sensing. We postulate that, in order for microphones to be useful in behaviour sensing applications, the analysis tecniques must be flexible and allow easy modification of the types of sounds to be sensed. A simplification of the training data collection process could allow a more flexible sound classification framework. We hypothesize that detailed training, a prerequisite for the majority of sound sensing techniques, is not necessary and that a significantly less detailed and time consuming data collection process can be carried out, allow-ng even a non expert to conduct the collection, labeling, and training process. To test this hypothesis, we implement a diverse density-based multiple instance learning framework, to identify a target sound, and a bag trimming algorithm, which, using the target sound, automatically segments weakly labeled soundclips to construct an accurate training set. Experiments reveal that our hypothesis is a valid one and results show that classifiers, trained using the automatically segmented training sets,were able to accurately classify unseen sound samples with accuracies comparable to supervised classifiers, achieving an average F-measure of 0.969 and 0.87 for two weakly supervised datasets
Automatic Speech Recognition System to Analyze Autism Spectrum Disorder in Young Children
It's possible to learn things about a person just by listening to their voice. When trying to construct an abstract concept of a speaker, it is essential to extract significant features from audio signals that are modulation-insensitive. This research assessed how individuals with autism spectrum disorder (ASD) recognize and recall voice identity. Autism spectrum disorder is the abbreviation for autism spectrum disorder. Both the ASD group and the control group performed equally well in a task in which they were asked to choose the name of a newly-learned speaker based on his or her voice. However, the ASD group outperformed the control group in a subsequent familiarity test in which they were asked to differentiate between previously trained voices and untrained voices. Persons with ASD classified voices numerically according to the exact acoustic characteristics, whereas non - autistic individuals classified voices qualitatively depending on the acoustic patterns associated to the speakers' physical and psychological traits. Child vocalizations show potential as an objective marker of developmental problems like Autism. In typical detection systems, hand-crafted acoustic features are input into a discriminative classifier, but its accuracy and resilience are limited by the number of its training data. This research addresses using CNN-learned feature representations to classify children's speech with developmental problems. On the Child Pathological and Emotional Speech database, we compare several acoustic feature sets. CNN-based approaches perform comparably to conventional paradigms in terms of unweighted average recall
The âNarrativesâ fMRI dataset for evaluating models of naturalistic language comprehension
The âNarrativesâ collection aggregates a variety of functional MRI datasets collected while human subjects listened to naturalistic spoken stories. The current release includes 345 subjects, 891 functional scans, and 27 diverse stories of varying duration totaling ~4.6 hours of unique stimuli (~43,000 words). This data collection is well-suited for naturalistic neuroimaging analysis, and is intended to serve as a benchmark for models of language and narrative comprehension. We provide standardized MRI data accompanied by rich metadata, preprocessed versions of the data ready for immediate use, and the spoken story stimuli with time-stamped phoneme- and word-level transcripts. All code and data are publicly available with full provenance in keeping with current best practices in transparent and reproducible neuroimaging
- âŠ