367 research outputs found
Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection
Sound events often occur in unstructured environments where they exhibit wide
variations in their frequency content and temporal structure. Convolutional
neural networks (CNN) are able to extract higher level features that are
invariant to local spectral and temporal variations. Recurrent neural networks
(RNNs) are powerful in learning the longer term temporal context in the audio
signals. CNNs and RNNs as classifiers have recently shown improved performances
over established methods in various sound recognition tasks. We combine these
two approaches in a Convolutional Recurrent Neural Network (CRNN) and apply it
on a polyphonic sound event detection task. We compare the performance of the
proposed CRNN method with CNN, RNN, and other established methods, and observe
a considerable improvement for four different datasets consisting of everyday
sound events.Comment: Accepted for IEEE Transactions on Audio, Speech and Language
Processing, Special Issue on Sound Scene and Event Analysi
Sound Event Detection in Synthetic Audio: Analysis of the DCASE 2016 Task Results
As part of the 2016 public evaluation challenge on Detection and
Classification of Acoustic Scenes and Events (DCASE 2016), the second task
focused on evaluating sound event detection systems using synthetic mixtures of
office sounds. This task, which follows the `Event Detection - Office
Synthetic' task of DCASE 2013, studies the behaviour of tested algorithms when
facing controlled levels of audio complexity with respect to background noise
and polyphony/density, with the added benefit of a very accurate ground truth.
This paper presents the task formulation, evaluation metrics, submitted
systems, and provides a statistical analysis of the results achieved, with
respect to various aspects of the evaluation dataset
Recommended from our members
Characterisation of acoustic scenes using a temporally-constrained shift-invariant model
International audienceIn this paper, we propose a method for modeling and classifying acoustic scenes using temporally-constrained shift-invariant probabilistic latent component analysis (SIPLCA). SIPLCA can be used for extracting time-frequency patches from spectrograms in an unsupervised manner. Component-wise hidden Markov models are incorporated to the SIPLCA formulation for enforcing temporal constraints on the activation of each acoustic component. The time-frequency patches are converted to cepstral coefficients in order to provide a compact representation of acoustic events within a scene. Experiments are made using a corpus of train station recordings, classified into 6 scene classes. Results show that the proposed model is able to model salient events within a scene and outperforms the non-negative matrix factorization algorithm for the same task. In addition, it is demonstrated that the use of temporal constraints can lead to improved performance
Classification of Overlapped Audio Events Based on AT, PLSA, and the Combination of Them
Audio event classification, as an important part of Computational Auditory Scene Analysis, has attracted much attention. Currently, the classification technology is mature enough to classify isolated audio events accurately, but for overlapped audio events, it performs much worse. While in real life, most audio documents would have certain percentage of overlaps, and so the overlap classification problem is an important part of audio classification. Nowadays, the work on overlapped audio event classification is still scarce, and most existing overlap classification systems can only recognize one audio event for an overlap. In this paper, in order to deal with overlaps, we innovatively introduce the author-topic (AT) model which was first proposed for text analysis into audio classification, and innovatively combine it with PLSA (Probabilistic Latent Semantic Analysis). We propose 4 systems, i.e. AT, PLSA, AT-PLSA and PLSA-AT, to classify overlaps. The 4 proposed systems have the ability to recognize two or more audio events for an overlap. The experimental results show that the 4 systems perform well in classifying overlapped audio events, whether it is the overlap in training set or the overlap out of training set. Also they perform well in classifying isolated audio events
Eventness: Object Detection on Spectrograms for Temporal Localization of Audio Events
In this paper, we introduce the concept of Eventness for audio event
detection, which can, in part, be thought of as an analogue to Objectness from
computer vision. The key observation behind the eventness concept is that audio
events reveal themselves as 2-dimensional time-frequency patterns with specific
textures and geometric structures in spectrograms. These time-frequency
patterns can then be viewed analogously to objects occurring in natural images
(with the exception that scaling and rotation invariance properties do not
apply). With this key observation in mind, we pose the problem of detecting
monophonic or polyphonic audio events as an equivalent visual object(s)
detection problem under partial occlusion and clutter in spectrograms. We adapt
a state-of-the-art visual object detection model to evaluate the audio event
detection task on publicly available datasets. The proposed network has
comparable results with a state-of-the-art baseline and is more robust on
minority events. Provided large-scale datasets, we hope that our proposed
conceptual model of eventness will be beneficial to the audio signal processing
community towards improving performance of audio event detection.Comment: 5 pages, 3 figures, accepted to ICASSP 201
Polyphonic Sound Event Tracking Using Linear Dynamical Systems
In this paper, a system for polyphonic sound event detection and tracking is proposed, based on spectrogram factorisation techniques and state space models. The system extends probabilistic latent component analysis (PLCA) and is modelled around a 4-dimensional spectral template dictionary of frequency, sound event class, exemplar index, and sound state. In order to jointly track multiple overlapping sound events over time, the integration of linear dynamical systems (LDS) within the PLCA inference is proposed. The system assumes that the PLCA sound event activation is the (noisy) observation in an LDS, with the latent states corresponding to the true event activations. LDS training is achieved using fully observed data, making use of ground truth-informed event activations produced by the PLCA-based model. Several LDS variants are evaluated, using polyphonic datasets of office sounds generated from an acoustic scene simulator, as well as real and synthesized monophonic datasets for comparative purposes. Results show that the integration of LDS tracking within PLCA leads to an improvement of +8.5-10.5% in terms of frame-based F-measure as compared to the use of the PLCA model alone. In addition, the proposed system outperforms several state-of-the-art methods for the task of polyphonic sound event detection
Automatic transcription of polyphonic music exploiting temporal evolution
PhDAutomatic music transcription is the process of converting an audio recording
into a symbolic representation using musical notation. It has numerous applications
in music information retrieval, computational musicology, and the
creation of interactive systems. Even for expert musicians, transcribing polyphonic
pieces of music is not a trivial task, and while the problem of automatic
pitch estimation for monophonic signals is considered to be solved, the creation
of an automated system able to transcribe polyphonic music without setting
restrictions on the degree of polyphony and the instrument type still remains
open.
In this thesis, research on automatic transcription is performed by explicitly
incorporating information on the temporal evolution of sounds. First efforts address
the problem by focusing on signal processing techniques and by proposing
audio features utilising temporal characteristics. Techniques for note onset and
offset detection are also utilised for improving transcription performance. Subsequent
approaches propose transcription models based on shift-invariant probabilistic
latent component analysis (SI-PLCA), modeling the temporal evolution
of notes in a multiple-instrument case and supporting frequency modulations in
produced notes. Datasets and annotations for transcription research have also
been created during this work. Proposed systems have been privately as well as
publicly evaluated within the Music Information Retrieval Evaluation eXchange
(MIREX) framework. Proposed systems have been shown to outperform several
state-of-the-art transcription approaches.
Developed techniques have also been employed for other tasks related to music
technology, such as for key modulation detection, temperament estimation,
and automatic piano tutoring. Finally, proposed music transcription models
have also been utilized in a wider context, namely for modeling acoustic scenes
Recognition of Harmonic Sounds in Polyphonic Audio using a Missing Feature Approach: Extended Report
A method based on local spectral features and missing feature techniques
is proposed for the recognition of harmonic sounds in mixture
signals. A mask estimation algorithm is proposed for identifying
spectral regions that contain reliable information for each sound
source and then bounded marginalization is employed to treat the
feature vector elements that are determined as unreliable. The proposed
method is tested on musical instrument sounds due to the
extensive availability of data but it can be applied on other sounds
(i.e. animal sounds, environmental sounds), whenever these are harmonic.
In simulations the proposed method clearly outperformed a
baseline method for mixture signals
- …