1,890 research outputs found

    Convolutional Gated Recurrent Neural Network Incorporating Spatial Features for Audio Tagging

    Get PDF
    Environmental audio tagging is a newly proposed task to predict the presence or absence of a specific audio event in a chunk. Deep neural network (DNN) based methods have been successfully adopted for predicting the audio tags in the domestic audio scene. In this paper, we propose to use a convolutional neural network (CNN) to extract robust features from mel-filter banks (MFBs), spectrograms or even raw waveforms for audio tagging. Gated recurrent unit (GRU) based recurrent neural networks (RNNs) are then cascaded to model the long-term temporal structure of the audio signal. To complement the input information, an auxiliary CNN is designed to learn on the spatial features of stereo recordings. We evaluate our proposed methods on Task 4 (audio tagging) of the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE 2016) challenge. Compared with our recent DNN-based method, the proposed structure can reduce the equal error rate (EER) from 0.13 to 0.11 on the development set. The spatial features can further reduce the EER to 0.10. The performance of the end-to-end learning on raw waveforms is also comparable. Finally, on the evaluation set, we get the state-of-the-art performance with 0.12 EER while the performance of the best existing system is 0.15 EER.Comment: Accepted to IJCNN2017, Anchorage, Alaska, US

    Attention and Localization based on a Deep Convolutional Recurrent Model for Weakly Supervised Audio Tagging

    Get PDF
    Audio tagging aims to perform multi-label classification on audio chunks and it is a newly proposed task in the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE 2016) challenge. This task encourages research efforts to better analyze and understand the content of the huge amounts of audio data on the web. The difficulty in audio tagging is that it only has a chunk-level label without a frame-level label. This paper presents a weakly supervised method to not only predict the tags but also indicate the temporal locations of the occurred acoustic events. The attention scheme is found to be effective in identifying the important frames while ignoring the unrelated frames. The proposed framework is a deep convolutional recurrent model with two auxiliary modules: an attention module and a localization module. The proposed algorithm was evaluated on the Task 4 of DCASE 2016 challenge. State-of-the-art performance was achieved on the evaluation set with equal error rate (EER) reduced from 0.13 to 0.11, compared with the convolutional recurrent baseline system.Comment: 5 pages, submitted to interspeech201

    Supervised Principal Component Regression for Functional Responses with High Dimensional Predictors

    Full text link
    We propose a supervised principal component regression method for relating functional responses with high dimensional predictors. Unlike the conventional principal component analysis, the proposed method builds on a newly defined expected integrated residual sum of squares, which directly makes use of the association between the functional response and the predictors. Minimizing the integrated residual sum of squares gives the supervised principal components, which is equivalent to solving a sequence of nonconvex generalized Rayleigh quotient optimization problems. We reformulate the nonconvex optimization problems into a simultaneous linear regression with a sparse penalty to deal with high dimensional predictors. Theoretically, we show that the reformulated regression problem can recover the same supervised principal subspace under suitable conditions. Statistically, we establish non-asymptotic error bounds for the proposed estimators. We demonstrate the advantages of the proposed method through both numerical experiments and an application to the Human Connectome Project fMRI data

    DiffSketching: Sketch Control Image Synthesis with Diffusion Models

    Full text link
    Creative sketch is a universal way of visual expression, but translating images from an abstract sketch is very challenging. Traditionally, creating a deep learning model for sketch-to-image synthesis needs to overcome the distorted input sketch without visual details, and requires to collect large-scale sketch-image datasets. We first study this task by using diffusion models. Our model matches sketches through the cross domain constraints, and uses a classifier to guide the image synthesis more accurately. Extensive experiments confirmed that our method can not only be faithful to user's input sketches, but also maintain the diversity and imagination of synthetic image results. Our model can beat GAN-based method in terms of generation quality and human evaluation, and does not rely on massive sketch-image datasets. Additionally, we present applications of our method in image editing and interpolation
    • …
    corecore