2,102 research outputs found
Acoustic Space Learning for Sound Source Separation and Localization on Binaural Manifolds
In this paper we address the problems of modeling the acoustic space
generated by a full-spectrum sound source and of using the learned model for
the localization and separation of multiple sources that simultaneously emit
sparse-spectrum sounds. We lay theoretical and methodological grounds in order
to introduce the binaural manifold paradigm. We perform an in-depth study of
the latent low-dimensional structure of the high-dimensional interaural
spectral data, based on a corpus recorded with a human-like audiomotor robot
head. A non-linear dimensionality reduction technique is used to show that
these data lie on a two-dimensional (2D) smooth manifold parameterized by the
motor states of the listener, or equivalently, the sound source directions. We
propose a probabilistic piecewise affine mapping model (PPAM) specifically
designed to deal with high-dimensional data exhibiting an intrinsic piecewise
linear structure. We derive a closed-form expectation-maximization (EM)
procedure for estimating the model parameters, followed by Bayes inversion for
obtaining the full posterior density function of a sound source direction. We
extend this solution to deal with missing data and redundancy in real world
spectrograms, and hence for 2D localization of natural sound sources such as
speech. We further generalize the model to the challenging case of multiple
sound sources and we propose a variational EM framework. The associated
algorithm, referred to as variational EM for source separation and localization
(VESSL) yields a Bayesian estimation of the 2D locations and time-frequency
masks of all the sources. Comparisons of the proposed approach with several
existing methods reveal that the combination of acoustic-space learning with
Bayesian inference enables our method to outperform state-of-the-art methods.Comment: 19 pages, 9 figures, 3 table
Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions
Audio-visual speaker tracking has drawn increasing attention over the past
few years due to its academic values and wide application. Audio and visual
modalities can provide complementary information for localization and tracking.
With audio and visual information, the Bayesian-based filter can solve the
problem of data association, audio-visual fusion and track management. In this
paper, we conduct a comprehensive overview of audio-visual speaker tracking. To
our knowledge, this is the first extensive survey over the past five years. We
introduce the family of Bayesian filters and summarize the methods for
obtaining audio-visual measurements. In addition, the existing trackers and
their performance on AV16.3 dataset are summarized. In the past few years, deep
learning techniques have thrived, which also boosts the development of audio
visual speaker tracking. The influence of deep learning techniques in terms of
measurement extraction and state estimation is also discussed. At last, we
discuss the connections between audio-visual speaker tracking and other areas
such as speech separation and distributed speaker tracking
- …