112 research outputs found

    A joint separation-classification model for sound event detection of weakly labelled data

    Get PDF
    Source separation (SS) aims to separate individual sources from an audio recording. Sound event detection (SED) aims to detect sound events from an audio recording. We propose a joint separation-classification (JSC) model trained only on weakly labelled audio data, that is, only the tags of an audio recording are known but the time of the events are unknown. First, we propose a separation mapping from the time-frequency (T-F) representation of an audio to the T-F segmentation masks of the audio events. Second, a classification mapping is built from each T-F segmentation mask to the presence probability of each audio event. In the source separation stage, sources of audio events and time of sound events can be obtained from the T-F segmentation masks. The proposed method achieves an equal error rate (EER) of 0.14 in SED, outperforming deep neural network baseline of 0.29. Source separation SDR of 8.08 dB is obtained by using global weighted rank pooling (GWRP) as probability mapping, outperforming the global max pooling (GMP) based probability mapping giving SDR at 0.03 dB. Source code of our work is published.Comment: Accepted by ICASSP 201

    GMM classification of environmental sounds for surveillance applications

    Get PDF
    This thesis describes an audio event detection system which automatically classifies an impulsive audio event as scream, gunshot, broken glasses or barking dogs with every background noise. The classification system uses four parallel Gaussian Mixture Models (GMMs) classifiers each of which decides if the sound belongs to its class or is only noise. Each classifier is trained using different features, chosen from a set of 40 audio features. Simultaneously the system can detect any kind of impulsive sounds using only one feature with very high precision. The classification system is implemented in the Network-Integrated Multimedia Middleware (NMM) for real-time processing and communications with other surveillance applications. In order to validate the proposed detection algorithm, we carried out extensive experiments (both off-line and real-time) on a hand-made set of sounds mixed with ambient noise at different Signal-to-Noise ratios (SNRs). Our results demonstrate that the system is able to guarantee 70\% of accuracy and 90\% of precision at 0 dB SNR, starting from 100\% of both accuracy and precision with clean sounds at 20 dB SNR. Sommario: Questa tesi descrive un sistema di rilevazione di eventi audio che classifica automaticamente un rumore impulsivo come urla, spari, vetri rotti o cani che abbaiano con qualsiasi rumore di sottofondo. Il sistema di classificazione utilizza quattro classificatori in parallelo, costruiti con i Gaussian Mixture Models (GMMs), ciascuno dei quali decide se il suono appartiene alla propria classe o se \`e soltanto rumore. Ogni classificatore \`e addestrato con differenti feature, scelte da un insieme di 40 feature audio. Contemporaneamente il sistema pu\`o rilevare qualsiasi tipo di suoni impulsivi utilizzando una sola feature con una precisione molto elevata. Il sistema di classificazione \`e implementato nel Network-Integrated Multimedia Middleware (NMM) per l'elaborazione in tempo reale e le comunicazioni con altre applicazioni di sorveglianza. Al fine di validare l'algoritmo di rilevazione proposto, sono stati effettuati vari esperimenti (sia off-line sia in tempo reale) su un personale database di suoni, mescolati con rumore ambientale, a diversi rapporti di segnale-rumore (SNR). I nostri risultati dimostrano che il sistema \`e in grado di garantire il 70\% di accuratezza e il 90\% di precisione a 0 dB di SNR, a partire da 100\% di accuratezza e precisione con suoni puliti a 20 dB di SN
    corecore