Search CORE

29,096 research outputs found

Audio Classification from Time-Frequency Texture

Author: Slotine Jean-Jacques
Yu Guoshen
Publication venue
Publication date: 25/09/2008
Field of study

Time-frequency representations of audio signals often resemble texture images. This paper derives a simple audio classification algorithm based on treating sound spectrograms as texture images. The algorithm is inspired by an earlier visual classification scheme particularly efficient at classifying textures. While solely based on time-frequency texture features, the algorithm achieves surprisingly good performance in musical instrument classification experiments

arXiv.org e-Print Archive

DSpace@MIT

Crossref

Eventness: Object Detection on Spectrograms for Temporal Localization of Audio Events

Author: Das Samarjit
Li Juncheng
Pham Phuong
Szurley Joseph
Publication venue
Publication date: 19/02/2018
Field of study

In this paper, we introduce the concept of Eventness for audio event detection, which can, in part, be thought of as an analogue to Objectness from computer vision. The key observation behind the eventness concept is that audio events reveal themselves as 2-dimensional time-frequency patterns with specific textures and geometric structures in spectrograms. These time-frequency patterns can then be viewed analogously to objects occurring in natural images (with the exception that scaling and rotation invariance properties do not apply). With this key observation in mind, we pose the problem of detecting monophonic or polyphonic audio events as an equivalent visual object(s) detection problem under partial occlusion and clutter in spectrograms. We adapt a state-of-the-art visual object detection model to evaluate the audio event detection task on publicly available datasets. The proposed network has comparable results with a state-of-the-art baseline and is more robust on minority events. Provided large-scale datasets, we hope that our proposed conceptual model of eventness will be beneficial to the audio signal processing community towards improving performance of audio event detection.Comment: 5 pages, 3 figures, accepted to ICASSP 201

arXiv.org e-Print Archive

Crossref

To “Sketch-a-Scratch”

Author: Alan Del Piccolo
Davide Andrea Mauro
Davide Rocchesso
Stefano Delle Monache
Stefano Papetti
Publication venue: place:Maynooth
Publication date: 01/01/2015
Field of study

A surface can be harsh and raspy, or smooth and silky, and everything in between. We are used to sense these features with our fingertips as well as with our eyes and ears: the exploration of a surface is a multisensory experience. Tools, too, are often employed in the interaction with surfaces, since they augment our manipulation capabilities. “Sketch-a-Scratch” is a tool for the multisensory exploration and sketching of surface textures. The user’s actions drive a physical sound model of real materials’ response to interactions such as scraping, rubbing or rolling. Moreover, different input signals can be converted into 2D visual surface profiles, thus enabling to experience them visually, aurally and haptically

Archivio istituzionale della ricerca - Università di Palermo

On Using Backpropagation for Speech Texture Generation and Voice Conversion

Author: Bengio Samy
Chorowski Jan
Saurous Rif A.
Weiss Ron J.
Publication venue
Publication date: 08/03/2018
Field of study

Inspired by recent work on neural network image generation which rely on backpropagation towards the network inputs, we present a proof-of-concept system for speech texture synthesis and voice conversion based on two mechanisms: approximate inversion of the representation learned by a speech recognition neural network, and on matching statistics of neuron activations between different source and target utterances. Similar to image texture synthesis and neural style transfer, the system works by optimizing a cost function with respect to the input waveform samples. To this end we use a differentiable mel-filterbank feature extraction pipeline and train a convolutional CTC speech recognition network. Our system is able to extract speaker characteristics from very limited amounts of target speaker data, as little as a few seconds, and can be used to generate realistic speech babble or reconstruct an utterance in a different voice.Comment: Accepted to ICASSP 201

arXiv.org e-Print Archive

Crossref

Ambient Sound Provides Supervision for Visual Learning

Author: Freeman William T.
McDermott Josh H.
Owens Andrew
Torralba Antonio
Wu Jiajun
Publication venue
Publication date: 01/09/2016
Field of study

The sound of crashing waves, the roar of fast-moving cars -- sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, through this process, the network learns a representation that conveys information about objects and scenes. We evaluate this representation on several recognition tasks, finding that its performance is comparable to that of other state-of-the-art unsupervised learning methods. Finally, we show through visualizations that the network learns units that are selective to objects that are often associated with characteristic sounds.Comment: ECCV 201

arXiv.org e-Print Archive

DSpace@MIT

Crossref

Audio style transfer

Author: Duong Ngoc
Grinstein Eric
Ozerov Alexey
Pérez Patrick
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 07/11/2018
Field of study

'Style transfer' among images has recently emerged as a very active research topic, fuelled by the power of convolution neural networks (CNNs), and has become fast a very popular technology in social media. This paper investigates the analogous problem in the audio domain: How to transfer the style of a reference audio signal to a target audio content? We propose a flexible framework for the task, which uses a sound texture model to extract statistics characterizing the reference audio style, followed by an optimization-based audio texture synthesis to modify the target content. In contrast to mainstream optimization-based visual transfer method, the proposed process is initialized by the target content instead of random noise and the optimized loss is only about texture, not structure. These differences proved key for audio style transfer in our experiments. In order to extract features of interest, we investigate different architectures, whether pre-trained on other tasks, as done in image style transfer, or engineered based on the human auditory system. Experimental results on different types of audio signal confirm the potential of the proposed approach.Comment: ICASSP 2018 - 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr 2018, Calgary, France. IEE

arXiv.org e-Print Archive

Crossref