24,978 research outputs found
Who Spoke What? A Latent Variable Framework for the Joint Decoding of Multiple Speakers and their Keywords
In this paper, we present a latent variable (LV) framework to identify all
the speakers and their keywords given a multi-speaker mixture signal. We
introduce two separate LVs to denote active speakers and the keywords uttered.
The dependency of a spoken keyword on the speaker is modeled through a
conditional probability mass function. The distribution of the mixture signal
is expressed in terms of the LV mass functions and speaker-specific-keyword
models. The proposed framework admits stochastic models, representing the
probability density function of the observation vectors given that a particular
speaker uttered a specific keyword, as speaker-specific-keyword models. The LV
mass functions are estimated in a Maximum Likelihood framework using the
Expectation Maximization (EM) algorithm. The active speakers and their keywords
are detected as modes of the joint distribution of the two LVs. In mixture
signals, containing two speakers uttering the keywords simultaneously, the
proposed framework achieves an accuracy of 82% for detecting both the speakers
and their respective keywords, using Student's-t mixture models as
speaker-specific-keyword models.Comment: 6 pages, 2 figures Submitted to : IEEE Signal Processing Letter
Online Localization and Tracking of Multiple Moving Speakers in Reverberant Environments
We address the problem of online localization and tracking of multiple moving
speakers in reverberant environments. The paper has the following
contributions. We use the direct-path relative transfer function (DP-RTF), an
inter-channel feature that encodes acoustic information robust against
reverberation, and we propose an online algorithm well suited for estimating
DP-RTFs associated with moving audio sources. Another crucial ingredient of the
proposed method is its ability to properly assign DP-RTFs to audio-source
directions. Towards this goal, we adopt a maximum-likelihood formulation and we
propose to use an exponentiated gradient (EG) to efficiently update
source-direction estimates starting from their currently available values. The
problem of multiple speaker tracking is computationally intractable because the
number of possible associations between observed source directions and physical
speakers grows exponentially with time. We adopt a Bayesian framework and we
propose a variational approximation of the posterior filtering distribution
associated with multiple speaker tracking, as well as an efficient variational
expectation-maximization (VEM) solver. The proposed online localization and
tracking method is thoroughly evaluated using two datasets that contain
recordings performed in real environments.Comment: IEEE Journal of Selected Topics in Signal Processing, 201
- …