17,238 research outputs found
Acoustic Space Learning for Sound Source Separation and Localization on Binaural Manifolds
In this paper we address the problems of modeling the acoustic space
generated by a full-spectrum sound source and of using the learned model for
the localization and separation of multiple sources that simultaneously emit
sparse-spectrum sounds. We lay theoretical and methodological grounds in order
to introduce the binaural manifold paradigm. We perform an in-depth study of
the latent low-dimensional structure of the high-dimensional interaural
spectral data, based on a corpus recorded with a human-like audiomotor robot
head. A non-linear dimensionality reduction technique is used to show that
these data lie on a two-dimensional (2D) smooth manifold parameterized by the
motor states of the listener, or equivalently, the sound source directions. We
propose a probabilistic piecewise affine mapping model (PPAM) specifically
designed to deal with high-dimensional data exhibiting an intrinsic piecewise
linear structure. We derive a closed-form expectation-maximization (EM)
procedure for estimating the model parameters, followed by Bayes inversion for
obtaining the full posterior density function of a sound source direction. We
extend this solution to deal with missing data and redundancy in real world
spectrograms, and hence for 2D localization of natural sound sources such as
speech. We further generalize the model to the challenging case of multiple
sound sources and we propose a variational EM framework. The associated
algorithm, referred to as variational EM for source separation and localization
(VESSL) yields a Bayesian estimation of the 2D locations and time-frequency
masks of all the sources. Comparisons of the proposed approach with several
existing methods reveal that the combination of acoustic-space learning with
Bayesian inference enables our method to outperform state-of-the-art methods.Comment: 19 pages, 9 figures, 3 table
Co-Localization of Audio Sources in Images Using Binaural Features and Locally-Linear Regression
This paper addresses the problem of localizing audio sources using binaural
measurements. We propose a supervised formulation that simultaneously localizes
multiple sources at different locations. The approach is intrinsically
efficient because, contrary to prior work, it relies neither on source
separation, nor on monaural segregation. The method starts with a training
stage that establishes a locally-linear Gaussian regression model between the
directional coordinates of all the sources and the auditory features extracted
from binaural measurements. While fixed-length wide-spectrum sounds (white
noise) are used for training to reliably estimate the model parameters, we show
that the testing (localization) can be extended to variable-length
sparse-spectrum sounds (such as speech), thus enabling a wide range of
realistic applications. Indeed, we demonstrate that the method can be used for
audio-visual fusion, namely to map speech signals onto images and hence to
spatially align the audio and visual modalities, thus enabling to discriminate
between speaking and non-speaking faces. We release a novel corpus of real-room
recordings that allow quantitative evaluation of the co-localization method in
the presence of one or two sound sources. Experiments demonstrate increased
accuracy and speed relative to several state-of-the-art methods.Comment: 15 pages, 8 figure
Objects that Sound
In this paper our objectives are, first, networks that can embed audio and
visual inputs into a common space that is suitable for cross-modal retrieval;
and second, a network that can localize the object that sounds in an image,
given the audio signal. We achieve both these objectives by training from
unlabelled video using only audio-visual correspondence (AVC) as the objective
function. This is a form of cross-modal self-supervision from video.
To this end, we design new network architectures that can be trained for
cross-modal retrieval and localizing the sound source in an image, by using the
AVC task. We make the following contributions: (i) show that audio and visual
embeddings can be learnt that enable both within-mode (e.g. audio-to-audio) and
between-mode retrieval; (ii) explore various architectures for the AVC task,
including those for the visual stream that ingest a single image, or multiple
images, or a single image and multi-frame optical flow; (iii) show that the
semantic object that sounds within an image can be localized (using only the
sound, no motion or flow information); and (iv) give a cautionary tale on how
to avoid undesirable shortcuts in the data preparation.Comment: Appears in: European Conference on Computer Vision (ECCV) 201
Localization and Rendering of Sound Sources in Acoustic Fields
DisertaÄŤnĂ práce se zabĂ˝vá lokalizacĂ zdrojĹŻ zvuku a akustickĂ˝m zoomem. HlavnĂm cĂlem tĂ©to práce je navrhnout systĂ©m s akustickĂ˝m zoomem, kterĂ˝ pĹ™iblĂžà zvuk jednoho mluvÄŤĂho mezi skupinou mluvÄŤĂch, a to i kdyĹľ mluvĂ souÄŤasnÄ›. Tento systĂ©m je kompatibilnĂ s technikou prostorovĂ©ho zvuku. HlavnĂ pĹ™Ănosy disertaÄŤnĂ práce jsou následujĂcĂ: 1. Návrh metody pro odhad vĂce smÄ›rĹŻ pĹ™icházejĂcĂho zvuku. 2. Návrh metody pro akustickĂ© zoomovánĂ pomocĂ DirAC. 3. Návrh kombinovanĂ©ho systĂ©mu pomocĂ pĹ™edchozĂch krokĹŻ, kterĂ˝ mĹŻĹľe bĂ˝t pouĹľit v telekonferencĂch.This doctoral thesis deals with sound source localization and acoustic zooming. The primary goal of this dissertation is to design an acoustic zooming system, which can zoom the sound of one speaker among multiple speakers even when they speak simultaneously. The system is compatible with surround sound techniques. In particular, the main contributions of the doctoral thesis are as follows: 1. Design of a method for multiple sound directions estimations. 2. Proposing a method for acoustic zooming using DirAC. 3. Design a combined system using the previous mentioned steps, which can be used in teleconferencing.
- …