55 research outputs found

    Unsupervised Learning of Semantic Audio Representations

    Full text link
    Even in the absence of any explicit semantic annotation, vast collections of audio recordings provide valuable information for learning the categorical structure of sounds. We consider several class-agnostic semantic constraints that apply to unlabeled nonspeech audio: (i) noise and translations in time do not change the underlying sound category, (ii) a mixture of two sound events inherits the categories of the constituents, and (iii) the categories of events in close temporal proximity are likely to be the same or related. Without labels to ground them, these constraints are incompatible with classification loss functions. However, they may still be leveraged to identify geometric inequalities needed for triplet loss-based training of convolutional neural networks. The result is low-dimensional embeddings of the input spectrograms that recover 41% and 84% of the performance of their fully-supervised counterparts when applied to downstream query-by-example sound retrieval and sound event classification tasks, respectively. Moreover, in limited-supervision settings, our unsupervised embeddings double the state-of-the-art classification performance.Comment: Submitted to ICASSP 201

    Automatic social role recognition and its application in structuring multiparty interactions

    Get PDF
    Automatic processing of multiparty interactions is a research domain with important applications in content browsing, summarization and information retrieval. In recent years, several works have been devoted to find regular patterns which speakers exhibit in a multiparty interaction also known as social roles. Most of the research in literature has generally focused on recognition of scenario specific formal roles. More recently, role coding schemes based on informal social roles have been proposed in literature, defining roles based on the behavior speakers have in the functioning of a small group interaction. Informal social roles represent a flexible classification scheme that can generalize across different scenarios of multiparty interaction. In this thesis, we focus on automatic recognition of informal social roles and exploit the influence of informal social roles on speaker behavior for structuring multiparty interactions. To model speaker behavior, we systematically explore various verbal and non verbal cues extracted from turn taking patterns, vocal expression and linguistic style. The influence of social roles on the behavior cues exhibited by a speaker is modeled using a discriminative approach based on conditional random fields. Experiments performed on several hours of meeting data reveal that classification using conditional random fields improves the role recognition performance. We demonstrate the effectiveness of our approach by evaluating it on previously unseen scenarios of multiparty interaction. Furthermore, we also consider whether formal roles and informal roles can be automatically predicted by the same verbal and nonverbal features. We exploit the influence of social roles on turn taking patterns to improve speaker diarization under distant microphone condition. Our work extends the Hidden Markov model (HMM)- Gaussian mixture model (GMM) speaker diarization system, and is based on jointly estimating both the speaker segmentation and social roles in an audio recording. We modify the minimum duration constraint in HMM-GMM diarization system by using role information to model the expected duration of speaker's turn. We also use social role n-grams as prior information to model speaker interaction patterns. Finally, we demonstrate the application of social roles for the problem of topic segmentation in meetings. We exploit our findings that social roles can dynamically change in conversations and use this information to predict topic changes in meetings. We also present an unsupervised method for topic segmentation which combines social roles and lexical cohesion. Experimental results show that social roles improve performance of both speaker diarization and topic segmentation

    Pervasive Sound Sensing: A Weakly Supervised Training Approach

    Get PDF
    Modern smartphones present an ideal device for pervasive sensing of human behaviour. Microphones have the potential to reveal key information about a persons behaviour.However, they have been utilized to a significantly lesser extent than other smartphone sensors in the context of human behaviour sensing. We postulate that, in order for microphones to be useful in behaviour sensing applications, the analysis tecniques must be flexible and allow easy modification of the types of sounds to be sensed. A simplification of the training data collection process could allow a more flexible sound classification framework. We hypothesize that detailed training, a prerequisite for the majority of sound sensing techniques, is not necessary and that a significantly less detailed and time consuming data collection process can be carried out, allow-ng even a non expert to conduct the collection, labeling, and training process. To test this hypothesis, we implement a diverse density-based multiple instance learning framework, to identify a target sound, and a bag trimming algorithm, which, using the target sound, automatically segments weakly labeled soundclips to construct an accurate training set. Experiments reveal that our hypothesis is a valid one and results show that classifiers, trained using the automatically segmented training sets,were able to accurately classify unseen sound samples with accuracies comparable to supervised classifiers, achieving an average F-measure of 0.969 and 0.87 for two weakly supervised datasets

    Automatic Speech Recognition System to Analyze Autism Spectrum Disorder in Young Children

    Get PDF
    It's possible to learn things about a person just by listening to their voice. When trying to construct an abstract concept of a speaker, it is essential to extract significant features from audio signals that are modulation-insensitive. This research assessed how individuals with autism spectrum disorder (ASD) recognize and recall voice identity. Autism spectrum disorder is the abbreviation for autism spectrum disorder. Both the ASD group and the control group performed equally well in a task in which they were asked to choose the name of a newly-learned speaker based on his or her voice. However, the ASD group outperformed the control group in a subsequent familiarity test in which they were asked to differentiate between previously trained voices and untrained voices. Persons with ASD classified voices numerically according to the exact acoustic characteristics, whereas non - autistic individuals classified voices qualitatively depending on the acoustic patterns associated to the speakers' physical and psychological traits. Child vocalizations show potential as an objective marker of developmental problems like Autism. In typical detection systems, hand-crafted acoustic features are input into a discriminative classifier, but its accuracy and resilience are limited by the number of its training data. This research addresses using CNN-learned feature representations to classify children's speech with developmental problems. On the Child Pathological and Emotional Speech database, we compare several acoustic feature sets. CNN-based approaches perform comparably to conventional paradigms in terms of unweighted average recall

    Audio-Based Semantic Concept Classification for Consumer Video

    Full text link

    The “Narratives” fMRI dataset for evaluating models of naturalistic language comprehension

    Get PDF
    The “Narratives” collection aggregates a variety of functional MRI datasets collected while human subjects listened to naturalistic spoken stories. The current release includes 345 subjects, 891 functional scans, and 27 diverse stories of varying duration totaling ~4.6 hours of unique stimuli (~43,000 words). This data collection is well-suited for naturalistic neuroimaging analysis, and is intended to serve as a benchmark for models of language and narrative comprehension. We provide standardized MRI data accompanied by rich metadata, preprocessed versions of the data ready for immediate use, and the spoken story stimuli with time-stamped phoneme- and word-level transcripts. All code and data are publicly available with full provenance in keeping with current best practices in transparent and reproducible neuroimaging
    • 

    corecore