156 research outputs found
Kymatio: Scattering Transforms in Python
The wavelet scattering transform is an invariant signal representation
suitable for many signal processing and machine learning applications. We
present the Kymatio software package, an easy-to-use, high-performance Python
implementation of the scattering transform in 1D, 2D, and 3D that is compatible
with modern deep learning frameworks. All transforms may be executed on a GPU
(in addition to CPU), offering a considerable speed up over CPU
implementations. The package also has a small memory footprint, resulting
inefficient memory usage. The source code, documentation, and examples are
available undera BSD license at https://www.kymat.io
Discriminative Segmental Cascades for Feature-Rich Phone Recognition
Discriminative segmental models, such as segmental conditional random fields
(SCRFs) and segmental structured support vector machines (SSVMs), have had
success in speech recognition via both lattice rescoring and first-pass
decoding. However, such models suffer from slow decoding, hampering the use of
computationally expensive features, such as segment neural networks or other
high-order features. A typical solution is to use approximate decoding, either
by beam pruning in a single pass or by beam pruning to generate a lattice
followed by a second pass. In this work, we study discriminative segmental
models trained with a hinge loss (i.e., segmental structured SVMs). We show
that beam search is not suitable for learning rescoring models in this
approach, though it gives good approximate decoding performance when the model
is already well-trained. Instead, we consider an approach inspired by
structured prediction cascades, which use max-marginal pruning to generate
lattices. We obtain a high-accuracy phonetic recognition system with several
expensive feature types: a segment neural network, a second-order language
model, and second-order phone boundary features
Adaptive DCTNet for Audio Signal Classification
In this paper, we investigate DCTNet for audio signal classification. Its
output feature is related to Cohen's class of time-frequency distributions. We
introduce the use of adaptive DCTNet (A-DCTNet) for audio signals feature
extraction. The A-DCTNet applies the idea of constant-Q transform, with its
center frequencies of filterbanks geometrically spaced. The A-DCTNet is
adaptive to different acoustic scales, and it can better capture low frequency
acoustic information that is sensitive to human audio perception than features
such as Mel-frequency spectral coefficients (MFSC). We use features extracted
by the A-DCTNet as input for classifiers. Experimental results show that the
A-DCTNet and Recurrent Neural Networks (RNN) achieve state-of-the-art
performance in bird song classification rate, and improve artist identification
accuracy in music data. They demonstrate A-DCTNet's applicability to signal
processing problems.Comment: International Conference of Acoustic and Speech Signal Processing
(ICASSP). New Orleans, United States, March, 201
A Deep Representation for Invariance And Music Classification
Representations in the auditory cortex might be based on mechanisms similar
to the visual ventral stream; modules for building invariance to
transformations and multiple layers for compositionality and selectivity. In
this paper we propose the use of such computational modules for extracting
invariant and discriminative audio representations. Building on a theory of
invariance in hierarchical architectures, we propose a novel, mid-level
representation for acoustical signals, using the empirical distributions of
projections on a set of templates and their transformations. Under the
assumption that, by construction, this dictionary of templates is composed from
similar classes, and samples the orbit of variance-inducing signal
transformations (such as shift and scale), the resulting signature is
theoretically guaranteed to be unique, invariant to transformations and stable
to deformations. Modules of projection and pooling can then constitute layers
of deep networks, for learning composite representations. We present the main
theoretical and computational aspects of a framework for unsupervised learning
of invariant audio representations, empirically evaluated on music genre
classification.Comment: 5 pages, CBMM Memo No. 002, (to appear) IEEE 2014 International
Conference on Acoustics, Speech, and Signal Processing (ICASSP 2014
- …