4,083 research outputs found
auDeep: Unsupervised Learning of Representations from Audio with Deep Recurrent Neural Networks
auDeep is a Python toolkit for deep unsupervised representation learning from
acoustic data. It is based on a recurrent sequence to sequence autoencoder
approach which can learn representations of time series data by taking into
account their temporal dynamics. We provide an extensive command line interface
in addition to a Python API for users and developers, both of which are
comprehensively documented and publicly available at
https://github.com/auDeep/auDeep. Experimental results indicate that auDeep
features are competitive with state-of-the art audio classification
A Deep Representation for Invariance And Music Classification
Representations in the auditory cortex might be based on mechanisms similar
to the visual ventral stream; modules for building invariance to
transformations and multiple layers for compositionality and selectivity. In
this paper we propose the use of such computational modules for extracting
invariant and discriminative audio representations. Building on a theory of
invariance in hierarchical architectures, we propose a novel, mid-level
representation for acoustical signals, using the empirical distributions of
projections on a set of templates and their transformations. Under the
assumption that, by construction, this dictionary of templates is composed from
similar classes, and samples the orbit of variance-inducing signal
transformations (such as shift and scale), the resulting signature is
theoretically guaranteed to be unique, invariant to transformations and stable
to deformations. Modules of projection and pooling can then constitute layers
of deep networks, for learning composite representations. We present the main
theoretical and computational aspects of a framework for unsupervised learning
of invariant audio representations, empirically evaluated on music genre
classification.Comment: 5 pages, CBMM Memo No. 002, (to appear) IEEE 2014 International
Conference on Acoustics, Speech, and Signal Processing (ICASSP 2014
Learning Sparse Adversarial Dictionaries For Multi-Class Audio Classification
Audio events are quite often overlapping in nature, and more prone to noise
than visual signals. There has been increasing evidence for the superior
performance of representations learned using sparse dictionaries for
applications like audio denoising and speech enhancement. This paper
concentrates on modifying the traditional reconstructive dictionary learning
algorithms, by incorporating a discriminative term into the objective function
in order to learn class-specific adversarial dictionaries that are good at
representing samples of their own class at the same time poor at representing
samples belonging to any other class. We quantitatively demonstrate the
effectiveness of our learned dictionaries as a stand-alone solution for both
binary as well as multi-class audio classification problems.Comment: Accepted in Asian Conference of Pattern Recognition (ACPR-2017
Audio-based music classification with a pretrained convolutional network
Recently the ‘Million Song Dataset’, containing audio features and metadata for one million songs, was made available. In this paper, we build a convolutional network that is then trained to perform artist recognition, genre recognition and key detection. The network is tailored to summarize the audio features over musically significant timescales. It is infeasible to train the network on all available data in a supervised fashion, so we use unsupervised pretraining to be able to harness the entire dataset: we train a convolutional deep belief network on all data, and then use the learnt parameters to initialize a convolutional multilayer perceptron with the same architecture. The MLP is then trained on a labeled subset of the data for each task. We also train the same MLP with randomly initialized weights. We find that our convolutional approach improves accuracy for the genre recognition and artist recognition tasks. Unsupervised pretraining improves convergence speed in all cases. For artist recognition it improves accuracy as well
Unsupervised decoding of long-term, naturalistic human neural recordings with automated video and audio annotations
Fully automated decoding of human activities and intentions from direct
neural recordings is a tantalizing challenge in brain-computer interfacing.
Most ongoing efforts have focused on training decoders on specific, stereotyped
tasks in laboratory settings. Implementing brain-computer interfaces (BCIs) in
natural settings requires adaptive strategies and scalable algorithms that
require minimal supervision. Here we propose an unsupervised approach to
decoding neural states from human brain recordings acquired in a naturalistic
context. We demonstrate our approach on continuous long-term
electrocorticographic (ECoG) data recorded over many days from the brain
surface of subjects in a hospital room, with simultaneous audio and video
recordings. We first discovered clusters in high-dimensional ECoG recordings
and then annotated coherent clusters using speech and movement labels extracted
automatically from audio and video recordings. To our knowledge, this
represents the first time techniques from computer vision and speech processing
have been used for natural ECoG decoding. Our results show that our
unsupervised approach can discover distinct behaviors from ECoG data, including
moving, speaking and resting. We verify the accuracy of our approach by
comparing to manual annotations. Projecting the discovered cluster centers back
onto the brain, this technique opens the door to automated functional brain
mapping in natural settings
- …