8,773 research outputs found

    Large-scale weakly supervised audio classification using gated convolutional neural network

    Get PDF
    In this paper, we present a gated convolutional neural network and a temporal attention-based localization method for audio classification, which won the 1st place in the large-scale weakly supervised sound event detection task of Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 challenge. The audio clips in this task, which are extracted from YouTube videos, are manually labeled with one or a few audio tags but without timestamps of the audio events, which is called as weakly labeled data. Two sub-tasks are defined in this challenge including audio tagging and sound event detection using this weakly labeled data. A convolutional recurrent neural network (CRNN) with learnable gated linear units (GLUs) non-linearity applied on the log Mel spectrogram is proposed. In addition, a temporal attention method is proposed along the frames to predicate the locations of each audio event in a chunk from the weakly labeled data. We ranked the 1st and the 2nd as a team in these two sub-tasks of DCASE 2017 challenge with F value 55.6\% and Equal error 0.73, respectively.Comment: submitted to ICASSP2018, summary on the 1st place system in DCASE2017 task4 challeng

    Surrey-cvssp system for DCASE2017 challenge task4

    Get PDF
    In this technique report, we present a bunch of methods for the task 4 of Detection and Classification of Acoustic Scenes and Events 2017 (DCASE2017) challenge. This task evaluates systems for the large-scale detection of sound events using weakly labeled training data. The data are YouTube video excerpts focusing on transportation and warnings due to their industry applications. There are two tasks, audio tagging and sound event detection from weakly labeled data. Convolutional neural network (CNN) and gated recurrent unit (GRU) based recurrent neural network (RNN) are adopted as our basic framework. We proposed a learnable gating activation function for selecting informative local features. Attention-based scheme is used for localizing the specific events in a weakly-supervised mode. A new batch-level balancing strategy is also proposed to tackle the data unbalancing problem. Fusion of posteriors from different systems are found effective to improve the performance. In a summary, we get 61% F-value for the audio tagging subtask and 0.73 error rate (ER) for the sound event detection subtask on the development set. While the official multilayer perceptron (MLP) based baseline just obtained 13.1% F-value for the audio tagging and 1.02 for the sound event detection.Comment: DCASE2017 challenge ranked 1st system, task4, tech repor

    Unifying Isolated and Overlapping Audio Event Detection with Multi-Label Multi-Task Convolutional Recurrent Neural Networks

    Get PDF
    We propose a multi-label multi-task framework based on a convolutional recurrent neural network to unify detection of isolated and overlapping audio events. The framework leverages the power of convolutional recurrent neural network architectures; convolutional layers learn effective features over which higher recurrent layers perform sequential modelling. Furthermore, the output layer is designed to handle arbitrary degrees of event overlap. At each time step in the recurrent output sequence, an output triple is dedicated to each event category of interest to jointly model event occurrence and temporal boundaries. That is, the network jointly determines whether an event of this category occurs, and when it occurs, by estimating onset and offset positions at each recurrent time step. We then introduce three sequential losses for network training: multi-label classification loss, distance estimation loss, and confidence loss. We demonstrate good generalization on two datasets: ITC-Irst for isolated audio event detection, and TUT-SED-Synthetic-2016 for overlapping audio event detection

    Convolutional Gated Recurrent Neural Network Incorporating Spatial Features for Audio Tagging

    Get PDF
    Environmental audio tagging is a newly proposed task to predict the presence or absence of a specific audio event in a chunk. Deep neural network (DNN) based methods have been successfully adopted for predicting the audio tags in the domestic audio scene. In this paper, we propose to use a convolutional neural network (CNN) to extract robust features from mel-filter banks (MFBs), spectrograms or even raw waveforms for audio tagging. Gated recurrent unit (GRU) based recurrent neural networks (RNNs) are then cascaded to model the long-term temporal structure of the audio signal. To complement the input information, an auxiliary CNN is designed to learn on the spatial features of stereo recordings. We evaluate our proposed methods on Task 4 (audio tagging) of the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE 2016) challenge. Compared with our recent DNN-based method, the proposed structure can reduce the equal error rate (EER) from 0.13 to 0.11 on the development set. The spatial features can further reduce the EER to 0.10. The performance of the end-to-end learning on raw waveforms is also comparable. Finally, on the evaluation set, we get the state-of-the-art performance with 0.12 EER while the performance of the best existing system is 0.15 EER.Comment: Accepted to IJCNN2017, Anchorage, Alaska, US

    An Audio-Based Vehicle Classifier Using Convolutional Neural Network

    Get PDF
    Audio-based event and scene classification are getting more attention in recent years. Many examples of environmental noise detection, vehicle classification, and soundscape analysis are developed using state of art deep learning techniques. The major noise source in urban and rural areas is road traffic noise. Environmental noise pa-rameters for urban and rural small roads have not been investigated due to some practical reasons. The purpose of this study is to develop an audio-based traffic classifier for rural and urban small roads which have limited or no traffic flow data to supply values for noise mapping and other noise metrics. An audio-based vehicle classifier a convolutional neural network-based algorithm was pro-posed using Mel spectrogram of audio signals as an input feature. Different variations of the network were generated by changing the parameters of the convolu-tional layers and the length of the network. Filter size, number of filters were tested with a dataset prepared with various real-life traffic records and audio extracts from traffic videos. The precision of the networks was evaluated with the common performance metrics. Further assessments were conducted with longer audio files and predictions of the system compared with actual traffic flow. The results showed that convolutional neural networks can be used to classify road traffic noise sources and perform outstandingly for single or double-lane roads

    R-CRNN: Region-based Convolutional Recurrent Neural Network for Audio Event Detection

    Full text link
    This paper proposes a Region-based Convolutional Recurrent Neural Network (R-CRNN) for audio event detection (AED). The proposed network is inspired by Faster-RCNN, a well known region-based convolutional network framework for visual object detection. Different from the original Faster-RCNN, a recurrent layer is added on top of the convolutional network to capture the long-term temporal context from the extracted high level features. While most of the previous works on AED generate predictions at frame level first, and then use post-processing to predict the onset/offset timestamps of events from a probability sequence; the proposed method generates predictions at event level directly and can be trained end-to-end with a multitask loss, which optimizes the classification and localization of audio events simultaneously. The proposed method is tested on DCASE 2017 Challenge dataset. To the best of our knowledge, R-CRNN is the best performing single-model method among all methods without using ensembles both on development and evaluation sets. Compared to the other region-based network for AED (R-FCN) with an event-based error rate (ER) of 0.18 on the development set, our method reduced the ER to half.Comment: Accepted by Interspeech 201

    Data Reduction Methods of Audio Signals for Embedded Sound Event Recognition

    Get PDF
    Sound event detection is a typical Internet of Things (IoT) application task, which could be used in many scenarios like dedicated security application, where cameras might be unsuitable due to the environment variations like lights and movements. In realistic applications, usually models for this task are implemented on embedded devices with microphones. And the idea of edge computing is to process the data near the place where it happens, because reacting in real time is very important in some applications. Transmitting collected audio clips to cloud may cause huge delay and sometime results in serious consequence. But processing on local has another problem, heavy computation may beyond the load for embedded devices, which happens to be the weakness of embedded devices. Works on this problem have make a huge progress recent year, like model compression and hardware acceleration. This thesis provides a new perspective on embedded deep learning for audio tasks, aimed at reducing data amount of audio signals for sound event recognition task. Instead of following the idea of compressing model or designing hardware accelerator, our methods focus on analog front-end signal acquisition side, reducing data amount of audio signal clips directly, using specific sampling methods. The state-of-the-art works for sound event detection are mainly based on deep learning models. For deep learning models, less input size means lower latency due to less time steps for recurrent neural network (RNN) or less convolutional computations for convolutional neural network (CNN). So, less data amount of input, audio signals gain less computation and parameters of neural network classifier, naturally, resulting less delay while interference. Our experiments implement three kind of data reduction methods on this sound event detection task, all of these three methods are based on reducing the sample points of an audio signal, using less sampling rate and sampling width, using sigma delta analog digital converter (ADC) and using level crossing (LC) ADC for audio signals. We simulated these three kinds of signals and feed them into the neural network to train the classifier Finally, we derive the conclusion that there is still some redundancy of audio signals in traditional sampling ways for audio classification. And using specific ADC modules better performance on classification with the same data amount in original way

    Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset

    Full text link
    Audio signals represent a wide diversity of acoustic events, from background environmental noise to spoken communication. Machine learning models such as neural networks have already been proposed for audio signal modeling, where recurrent structures can take advantage of temporal dependencies. This work aims to study the implementation of several neural network-based systems for speech and music event detection over a collection of 77,937 10-second audio segments (216 h), selected from the Google AudioSet dataset. These segments belong to YouTube videos and have been represented as mel-spectrograms. We propose and compare two approaches. The first one is the training of two different neural networks, one for speech detection and another for music detection. The second approach consists on training a single neural network to tackle both tasks at the same time. The studied architectures include fully connected, convolutional and LSTM (long short-term memory) recurrent networks. Comparative results are provided in terms of classification performance and model complexity. We would like to highlight the performance of convolutional architectures, specially in combination with an LSTM stage. The hybrid convolutional-LSTM models achieve the best overall results (85% accuracy) in the three proposed tasks. Furthermore, a distractor analysis of the results has been carried out in order to identify which events in the ontology are the most harmful for the performance of the models, showing some difficult scenarios for the detection of music and speechThis work has been supported by project “DSSL: Redes Profundas y Modelos de Subespacios para Deteccion y Seguimiento de Locutor, Idioma y Enfermedades Degenerativas a partir de la Voz” (TEC2015-68172-C2-1-P), funded by the Ministry of Economy and Competitivity of Spain and FEDE
    corecore