22,474 research outputs found
Audio Classification from Time-Frequency Texture
Time-frequency representations of audio signals often resemble texture
images. This paper derives a simple audio classification algorithm based on
treating sound spectrograms as texture images. The algorithm is inspired by an
earlier visual classification scheme particularly efficient at classifying
textures. While solely based on time-frequency texture features, the algorithm
achieves surprisingly good performance in musical instrument classification
experiments
Histogram of gradients of Time-Frequency Representations for Audio scene detection
This paper addresses the problem of audio scenes classification and
contributes to the state of the art by proposing a novel feature. We build this
feature by considering histogram of gradients (HOG) of time-frequency
representation of an audio scene. Contrarily to classical audio features like
MFCC, we make the hypothesis that histogram of gradients are able to encode
some relevant informations in a time-frequency {representation:} namely, the
local direction of variation (in time and frequency) of the signal spectral
power. In addition, in order to gain more invariance and robustness, histogram
of gradients are locally pooled. We have evaluated the relevance of {the novel
feature} by comparing its performances with state-of-the-art competitors, on
several datasets, including a novel one that we provide, as part of our
contribution. This dataset, that we make publicly available, involves
classes and contains about minutes of audio scene recording. We thus
believe that it may be the next standard dataset for evaluating audio scene
classification algorithms. Our comparison results clearly show that our
HOG-based features outperform its competitor
On Using Backpropagation for Speech Texture Generation and Voice Conversion
Inspired by recent work on neural network image generation which rely on
backpropagation towards the network inputs, we present a proof-of-concept
system for speech texture synthesis and voice conversion based on two
mechanisms: approximate inversion of the representation learned by a speech
recognition neural network, and on matching statistics of neuron activations
between different source and target utterances. Similar to image texture
synthesis and neural style transfer, the system works by optimizing a cost
function with respect to the input waveform samples. To this end we use a
differentiable mel-filterbank feature extraction pipeline and train a
convolutional CTC speech recognition network. Our system is able to extract
speaker characteristics from very limited amounts of target speaker data, as
little as a few seconds, and can be used to generate realistic speech babble or
reconstruct an utterance in a different voice.Comment: Accepted to ICASSP 201
A Compact and Discriminative Feature Based on Auditory Summary Statistics for Acoustic Scene Classification
One of the biggest challenges of acoustic scene classification (ASC) is to
find proper features to better represent and characterize environmental sounds.
Environmental sounds generally involve more sound sources while exhibiting less
structure in temporal spectral representations. However, the background of an
acoustic scene exhibits temporal homogeneity in acoustic properties, suggesting
it could be characterized by distribution statistics rather than temporal
details. In this work, we investigated using auditory summary statistics as the
feature for ASC tasks. The inspiration comes from a recent neuroscience study,
which shows the human auditory system tends to perceive sound textures through
time-averaged statistics. Based on these statistics, we further proposed to use
linear discriminant analysis to eliminate redundancies among these statistics
while keeping the discriminative information, providing an extreme com-pact
representation for acoustic scenes. Experimental results show the outstanding
performance of the proposed feature over the conventional handcrafted features.Comment: Accepted as a conference paper of Interspeech 201
- …