223 research outputs found

    Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection

    Get PDF
    Sound events often occur in unstructured environments where they exhibit wide variations in their frequency content and temporal structure. Convolutional neural networks (CNN) are able to extract higher level features that are invariant to local spectral and temporal variations. Recurrent neural networks (RNNs) are powerful in learning the longer term temporal context in the audio signals. CNNs and RNNs as classifiers have recently shown improved performances over established methods in various sound recognition tasks. We combine these two approaches in a Convolutional Recurrent Neural Network (CRNN) and apply it on a polyphonic sound event detection task. We compare the performance of the proposed CRNN method with CNN, RNN, and other established methods, and observe a considerable improvement for four different datasets consisting of everyday sound events.Comment: Accepted for IEEE Transactions on Audio, Speech and Language Processing, Special Issue on Sound Scene and Event Analysi

    Polyphonic Sound Event Tracking Using Linear Dynamical Systems

    Get PDF
    In this paper, a system for polyphonic sound event detection and tracking is proposed, based on spectrogram factorisation techniques and state space models. The system extends probabilistic latent component analysis (PLCA) and is modelled around a 4-dimensional spectral template dictionary of frequency, sound event class, exemplar index, and sound state. In order to jointly track multiple overlapping sound events over time, the integration of linear dynamical systems (LDS) within the PLCA inference is proposed. The system assumes that the PLCA sound event activation is the (noisy) observation in an LDS, with the latent states corresponding to the true event activations. LDS training is achieved using fully observed data, making use of ground truth-informed event activations produced by the PLCA-based model. Several LDS variants are evaluated, using polyphonic datasets of office sounds generated from an acoustic scene simulator, as well as real and synthesized monophonic datasets for comparative purposes. Results show that the integration of LDS tracking within PLCA leads to an improvement of +8.5-10.5% in terms of frame-based F-measure as compared to the use of the PLCA model alone. In addition, the proposed system outperforms several state-of-the-art methods for the task of polyphonic sound event detection

    A Four-Stage Data Augmentation Approach to ResNet-Conformer Based Acoustic Modeling for Sound Event Localization and Detection

    Full text link
    In this paper, we propose a novel four-stage data augmentation approach to ResNet-Conformer based acoustic modeling for sound event localization and detection (SELD). First, we explore two spatial augmentation techniques, namely audio channel swapping (ACS) and multi-channel simulation (MCS), to deal with data sparsity in SELD. ACS and MDS focus on augmenting the limited training data with expanding direction of arrival (DOA) representations such that the acoustic models trained with the augmented data are robust to localization variations of acoustic sources. Next, time-domain mixing (TDM) and time-frequency masking (TFM) are also investigated to deal with overlapping sound events and data diversity. Finally, ACS, MCS, TDM and TFM are combined in a step-by-step manner to form an effective four-stage data augmentation scheme. Tested on the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 data sets, our proposed augmentation approach greatly improves the system performance, ranking our submitted system in the first place in the SELD task of DCASE 2020 Challenge. Furthermore, we employ a ResNet-Conformer architecture to model both global and local context dependencies of an audio sequence to yield further gains over those architectures used in the DCASE 2020 SELD evaluations.Comment: 12 pages, 8 figure

    Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016)

    Get PDF

    Deep Neural Networks for Sound Event Detection

    Get PDF
    The objective of this thesis is to develop novel classification and feature learning techniques for the task of sound event detection (SED) in real-world environments. Throughout their lives, humans experience a consistent learning process on how to assign meanings to sounds. Thanks to this, most of the humans can easily recognize the sound of a thunder, dog bark, door bell, bird singing etc. In this work, we aim to develop systems that can automatically detect the sound events commonly present in our daily lives. Such systems can be utilized in e.g. contextaware devices, acoustic surveillance, bio-acoustical and healthcare monitoring, and smart-home cities.In this thesis, we propose to apply the modern machine learning methods called deep learning for SED. The relationship between the commonly used timefrequency representations for SED (such as mel spectrogram and magnitude spectrogram) and the target sound event labels are highly complex. Deep learning methods such as deep neural networks (DNN) utilize a layered structure of units to extract features from the given sound representation input with increased abstraction at each layer. This increases the network’s capacity to efficiently learn the highly complex relationship between the sound representation and the target sound event labels. We found that the proposed DNN approach performs significantly better than the established classifier techniques for SED such as Gaussian mixture models.In a time-frequency representation of an audio recording, a sound event can often be recognized as a distinct pattern that may exhibit shifts in both dimensions. The intra-class variability of the sound events may cause to small shifts in the frequency domain content, and the time domain shift results from the fact that a sound event can occur at any time for a given audio recording. We found that convolutional neural networks (CNN) are useful to learn shift-invariant filters that are essential for robust modeling of sound events. In addition, we show that recurrent neural networks (RNN) are effective in modeling the long-term temporal characteristics of the sound events. Finally, we combine the convolutional and recurrent layers in a single classifier called convolutional recurrent neural networks (CRNN), which emphasizes the benefits of both and provides state-of-the-art results in multiple SED benchmark datasets.Aside from learning the mappings between the time-frequency representations and the sound event labels, we show that deep learning methods can also be utilized to learn a direct mapping between the the target labels and a lower level representation such as the magnitude spectrogram or even the raw audio signals. In this thesis, the feature learning capabilities of the deep learning methods and the empirical knowledge on the human auditory perception are proposed to be integrated through the means of layer weight initialization with filterbank coefficients. This results with an optimal, ad-hoc filterbank that is obtained through gradient based optimization of the original coefficients to improve the SED performance

    Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)

    Get PDF
    corecore