61 research outputs found

    Learning Audio Sequence Representations for Acoustic Event Classification

    Full text link
    Acoustic Event Classification (AEC) has become a significant task for machines to perceive the surrounding auditory scene. However, extracting effective representations that capture the underlying characteristics of the acoustic events is still challenging. Previous methods mainly focused on designing the audio features in a 'hand-crafted' manner. Interestingly, data-learnt features have been recently reported to show better performance. Up to now, these were only considered on the frame-level. In this paper, we propose an unsupervised learning framework to learn a vector representation of an audio sequence for AEC. This framework consists of a Recurrent Neural Network (RNN) encoder and a RNN decoder, which respectively transforms the variable-length audio sequence into a fixed-length vector and reconstructs the input sequence on the generated vector. After training the encoder-decoder, we feed the audio sequences to the encoder and then take the learnt vectors as the audio sequence representations. Compared with previous methods, the proposed method can not only deal with the problem of arbitrary-lengths of audio streams, but also learn the salient information of the sequence. Extensive evaluation on a large-size acoustic event database is performed, and the empirical results demonstrate that the learnt audio sequence representation yields a significant performance improvement by a large margin compared with other state-of-the-art hand-crafted sequence features for AEC

    Sound Object Recognition

    Get PDF
    Humans are constantly exposed to a variety of acoustic stimuli ranging from music and speech to more complex acoustic scenes like a noisy marketplace. The human auditory perception mechanism is able to analyze these different kinds of sounds and extract meaningful information suggesting that the same processing mechanism is capable of representing different sound classes. In this thesis, we test this hypothesis by proposing a high dimensional sound object representation framework, that captures the various modulations of sound by performing a multi-resolution mapping. We then show that this model is able to capture a wide variety of sound classes (speech, music, soundscapes) by applying it to the tasks of speech recognition, speaker verification, musical instrument recognition and acoustic soundscape recognition. We propose a multi-resolution analysis approach that captures the detailed variations in the spectral characterists as a basis for recognizing sound objects. We then show how such a system can be fine tuned to capture both the message information (speech content) and the messenger information (speaker identity). This system is shown to outperform state-of-art system for noise robustness at both automatic speech recognition and speaker verification tasks. The proposed analysis scheme with the included ability to analyze temporal modulations was used to capture musical sound objects. We showed that using a model of cortical processing, we were able to accurately replicate the human perceptual similarity judgments and also were able to get a good classification performance on a large set of musical instruments. We also show that neither just the spectral feature or the marginals of the proposed model are sufficient to capture human perception. Moreover, we were able to extend this model to continuous musical recordings by proposing a new method to extract notes from the recordings. Complex acoustic scenes like a sports stadium have multiple sources producing sounds at the same time. We show that the proposed representation scheme can not only capture these complex acoustic scenes, but provides a flexible mechanism to adapt to target sources of interest. The human auditory perception system is known to be a complex system where there are both bottom-up analysis pathways and top-down feedback mechanisms. The top-down feedback enhances the output of the bottom-up system to better realize the target sounds. In this thesis we propose an implementation of top-down attention module which is complimentary to the high dimensional acoustic feature extraction mechanism. This attention module is a distributed system operating at multiple stages of representation, effectively acting as a retuning mechanism, that adapts the same system to different tasks. We showed that such an adaptation mechanism is able to tremendously improve the performance of the system at detecting the target source in the presence of various distracting background sources

    Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016)

    Get PDF

    URBAN SOUND RECOGNITION USING DIFFERENT FEATURE EXTRACTION TECHNIQUES

    Get PDF
    The application of the advanced methods for noise analysis in the urban areas through the development of systems for classification of sound events significantly improves and simplifies the process of noise assessment. The main purpose of sound recognition and classification systems is to develop algorithms that can detect and classify sound events that occur in the chosen environment, giving an appropriate response to their users. In this research, a supervised system for recognition and classification of sound events has been established through the development of feature extraction techniques based on digital signal processing of the audio signals that are further used as an input parameter in the machine learning algorithms for classification of the sound events. Various audio parameters were extracted and processed in order to choose the best set of parameters that result in better recognition of the class to which the sounds belong. The created acoustic event detection and classification (AED/C) system could be further implemented in sound sensors for automatic control of environmental noise using the source classification that leads to reduced amount of required human validation of the sound level measurements since the target noise source is evidently defined

    Robust Audio Event Recognition with 1-Max Pooling Convolutional Neural Networks

    Get PDF
    We present in this paper a simple, yet efficient convolutional neural network (CNN) architecture for robust audio event recognition. Opposing to deep CNN architectures with multiple convolutional and pooling layers topped up with multiple fully connected layers, the proposed network consists of only three layers: convolutional, pooling, and softmax layer. Two further features distinguish it from the deep architectures that have been proposed for the task: varying-size convolutional filters at the convolutional layer and 1-max pooling scheme at the pooling layer. In intuition, the network tends to select the most discriminative features from the whole audio signals for recognition. Our proposed CNN not only shows state-of-the-art performance on the standard task of robust audio event recognition but also outperforms other deep architectures up to 4.5% in terms of recognition accuracy, which is equivalent to 76.3% relative error reduction

    A HIERARCHY BASED ACOUSTIC FRAMEWORK FOR AUDITORY SCENE ANALYSIS

    Get PDF
    The acoustic environment surrounding us is extremely dynamic and unstructured in nature. Humans exhibit a great ability at navigating these complex acoustic environments, and can parse a complex acoustic scene into its perceptually meaningful objects, referred to as ``auditory scene analysis". Current neuro-computational strategies developed for auditory scene analysis related tasks are primarily based on prior knowledge of acoustic environment and hence, fail to match human performance under realistic settings, i.e. the acoustic environment being dynamic in nature and presence of multiple competing auditory objects in the same scene. In this thesis, we explore hierarchy based computational frameworks that not only solve different auditory scene analysis related paradigms but also explain the processes driving these paradigms from physiological, psychophysical and computational viewpoint. In the first part of the thesis, we explore computational strategies that can extract varying degree of details from complex acoustic scene with an aim to capture non-trivial commonalities within a sound class as well as differences across sound classes. We specifically demonstrate that a rich feature space of spectro-temporal modulation representation complimented with markovian based temporal dynamics information captures the fine and subtle changes in the spectral and temporal structure of sound events in a complex and dynamic acoustic environment. We further extend this computational model to incorporate a biologically plausible network capable of learning a rich hierarchy of localized spectro-temporal bases and their corresponding long term temporal regularities from natural soundscape in a data driven fashion. We demonstrate that the unsupervised nature of the network yields physiologically and perceptually meaningful tuning functions that drive the organization of acoustic scene into distinct auditory objects. Next, we explore computational models based on hierarchical acoustic representation in the context of bottom-up salient event detection. We demonstrate that a rich hierarchy of local and global cues capture the salient details upon which the bottom-up saliency mechanisms operate to make a "new" event pop out in a complex acoustic scene. We further show that a top-down event specific knowledge gathered by scene classification framework biases bottom-up computational resources towards events of "interest" rather than any new event. We further extend the top-down framework in the context of modeling a broad and heterogeneous acoustic class. We demonstrate that when an acoustic scene comprises of multiple events, modeling the global details in the hierarchy as a mixture of temporal trajectories help to capture its semantic categorization and provide a detailed understanding of the scene. Overall, the results of this thesis improve our understanding of how a rich hierarchy of acoustic representation drives various auditory scene analysis paradigms and how to integrate multiple theories of scene analysis into a unified strategy, hence providing a platform for further development of computational scene analysis research
    corecore