29,096 research outputs found
Audio Classification from Time-Frequency Texture
Time-frequency representations of audio signals often resemble texture
images. This paper derives a simple audio classification algorithm based on
treating sound spectrograms as texture images. The algorithm is inspired by an
earlier visual classification scheme particularly efficient at classifying
textures. While solely based on time-frequency texture features, the algorithm
achieves surprisingly good performance in musical instrument classification
experiments
Eventness: Object Detection on Spectrograms for Temporal Localization of Audio Events
In this paper, we introduce the concept of Eventness for audio event
detection, which can, in part, be thought of as an analogue to Objectness from
computer vision. The key observation behind the eventness concept is that audio
events reveal themselves as 2-dimensional time-frequency patterns with specific
textures and geometric structures in spectrograms. These time-frequency
patterns can then be viewed analogously to objects occurring in natural images
(with the exception that scaling and rotation invariance properties do not
apply). With this key observation in mind, we pose the problem of detecting
monophonic or polyphonic audio events as an equivalent visual object(s)
detection problem under partial occlusion and clutter in spectrograms. We adapt
a state-of-the-art visual object detection model to evaluate the audio event
detection task on publicly available datasets. The proposed network has
comparable results with a state-of-the-art baseline and is more robust on
minority events. Provided large-scale datasets, we hope that our proposed
conceptual model of eventness will be beneficial to the audio signal processing
community towards improving performance of audio event detection.Comment: 5 pages, 3 figures, accepted to ICASSP 201
To “Sketch-a-Scratch”
A surface can be harsh and raspy, or smooth and silky, and everything in between. We are used to sense these features with our fingertips as well as with our eyes and ears: the exploration of a surface is a multisensory experience. Tools, too, are often employed in the interaction with surfaces, since they augment our manipulation capabilities. “Sketch-a-Scratch” is a tool for the multisensory exploration and sketching of surface textures. The user’s actions drive a physical sound model of real materials’ response to interactions such as scraping, rubbing or rolling. Moreover, different input signals can be converted into 2D visual surface profiles, thus enabling to experience them visually, aurally and haptically
On Using Backpropagation for Speech Texture Generation and Voice Conversion
Inspired by recent work on neural network image generation which rely on
backpropagation towards the network inputs, we present a proof-of-concept
system for speech texture synthesis and voice conversion based on two
mechanisms: approximate inversion of the representation learned by a speech
recognition neural network, and on matching statistics of neuron activations
between different source and target utterances. Similar to image texture
synthesis and neural style transfer, the system works by optimizing a cost
function with respect to the input waveform samples. To this end we use a
differentiable mel-filterbank feature extraction pipeline and train a
convolutional CTC speech recognition network. Our system is able to extract
speaker characteristics from very limited amounts of target speaker data, as
little as a few seconds, and can be used to generate realistic speech babble or
reconstruct an utterance in a different voice.Comment: Accepted to ICASSP 201
Ambient Sound Provides Supervision for Visual Learning
The sound of crashing waves, the roar of fast-moving cars -- sound conveys
important information about the objects in our surroundings. In this work, we
show that ambient sounds can be used as a supervisory signal for learning
visual models. To demonstrate this, we train a convolutional neural network to
predict a statistical summary of the sound associated with a video frame. We
show that, through this process, the network learns a representation that
conveys information about objects and scenes. We evaluate this representation
on several recognition tasks, finding that its performance is comparable to
that of other state-of-the-art unsupervised learning methods. Finally, we show
through visualizations that the network learns units that are selective to
objects that are often associated with characteristic sounds.Comment: ECCV 201
Audio style transfer
'Style transfer' among images has recently emerged as a very active research
topic, fuelled by the power of convolution neural networks (CNNs), and has
become fast a very popular technology in social media. This paper investigates
the analogous problem in the audio domain: How to transfer the style of a
reference audio signal to a target audio content? We propose a flexible
framework for the task, which uses a sound texture model to extract statistics
characterizing the reference audio style, followed by an optimization-based
audio texture synthesis to modify the target content. In contrast to mainstream
optimization-based visual transfer method, the proposed process is initialized
by the target content instead of random noise and the optimized loss is only
about texture, not structure. These differences proved key for audio style
transfer in our experiments. In order to extract features of interest, we
investigate different architectures, whether pre-trained on other tasks, as
done in image style transfer, or engineered based on the human auditory system.
Experimental results on different types of audio signal confirm the potential
of the proposed approach.Comment: ICASSP 2018 - 2018 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), Apr 2018, Calgary, France. IEE
- …