1,890 research outputs found
Convolutional Gated Recurrent Neural Network Incorporating Spatial Features for Audio Tagging
Environmental audio tagging is a newly proposed task to predict the presence
or absence of a specific audio event in a chunk. Deep neural network (DNN)
based methods have been successfully adopted for predicting the audio tags in
the domestic audio scene. In this paper, we propose to use a convolutional
neural network (CNN) to extract robust features from mel-filter banks (MFBs),
spectrograms or even raw waveforms for audio tagging. Gated recurrent unit
(GRU) based recurrent neural networks (RNNs) are then cascaded to model the
long-term temporal structure of the audio signal. To complement the input
information, an auxiliary CNN is designed to learn on the spatial features of
stereo recordings. We evaluate our proposed methods on Task 4 (audio tagging)
of the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE
2016) challenge. Compared with our recent DNN-based method, the proposed
structure can reduce the equal error rate (EER) from 0.13 to 0.11 on the
development set. The spatial features can further reduce the EER to 0.10. The
performance of the end-to-end learning on raw waveforms is also comparable.
Finally, on the evaluation set, we get the state-of-the-art performance with
0.12 EER while the performance of the best existing system is 0.15 EER.Comment: Accepted to IJCNN2017, Anchorage, Alaska, US
Attention and Localization based on a Deep Convolutional Recurrent Model for Weakly Supervised Audio Tagging
Audio tagging aims to perform multi-label classification on audio chunks and
it is a newly proposed task in the Detection and Classification of Acoustic
Scenes and Events 2016 (DCASE 2016) challenge. This task encourages research
efforts to better analyze and understand the content of the huge amounts of
audio data on the web. The difficulty in audio tagging is that it only has a
chunk-level label without a frame-level label. This paper presents a weakly
supervised method to not only predict the tags but also indicate the temporal
locations of the occurred acoustic events. The attention scheme is found to be
effective in identifying the important frames while ignoring the unrelated
frames. The proposed framework is a deep convolutional recurrent model with two
auxiliary modules: an attention module and a localization module. The proposed
algorithm was evaluated on the Task 4 of DCASE 2016 challenge. State-of-the-art
performance was achieved on the evaluation set with equal error rate (EER)
reduced from 0.13 to 0.11, compared with the convolutional recurrent baseline
system.Comment: 5 pages, submitted to interspeech201
Supervised Principal Component Regression for Functional Responses with High Dimensional Predictors
We propose a supervised principal component regression method for relating
functional responses with high dimensional predictors. Unlike the conventional
principal component analysis, the proposed method builds on a newly defined
expected integrated residual sum of squares, which directly makes use of the
association between the functional response and the predictors. Minimizing the
integrated residual sum of squares gives the supervised principal components,
which is equivalent to solving a sequence of nonconvex generalized Rayleigh
quotient optimization problems. We reformulate the nonconvex optimization
problems into a simultaneous linear regression with a sparse penalty to deal
with high dimensional predictors. Theoretically, we show that the reformulated
regression problem can recover the same supervised principal subspace under
suitable conditions. Statistically, we establish non-asymptotic error bounds
for the proposed estimators. We demonstrate the advantages of the proposed
method through both numerical experiments and an application to the Human
Connectome Project fMRI data
DiffSketching: Sketch Control Image Synthesis with Diffusion Models
Creative sketch is a universal way of visual expression, but translating
images from an abstract sketch is very challenging. Traditionally, creating a
deep learning model for sketch-to-image synthesis needs to overcome the
distorted input sketch without visual details, and requires to collect
large-scale sketch-image datasets. We first study this task by using diffusion
models. Our model matches sketches through the cross domain constraints, and
uses a classifier to guide the image synthesis more accurately. Extensive
experiments confirmed that our method can not only be faithful to user's input
sketches, but also maintain the diversity and imagination of synthetic image
results. Our model can beat GAN-based method in terms of generation quality and
human evaluation, and does not rely on massive sketch-image datasets.
Additionally, we present applications of our method in image editing and
interpolation
- …