209 research outputs found
Co-Localization of Audio Sources in Images Using Binaural Features and Locally-Linear Regression
This paper addresses the problem of localizing audio sources using binaural
measurements. We propose a supervised formulation that simultaneously localizes
multiple sources at different locations. The approach is intrinsically
efficient because, contrary to prior work, it relies neither on source
separation, nor on monaural segregation. The method starts with a training
stage that establishes a locally-linear Gaussian regression model between the
directional coordinates of all the sources and the auditory features extracted
from binaural measurements. While fixed-length wide-spectrum sounds (white
noise) are used for training to reliably estimate the model parameters, we show
that the testing (localization) can be extended to variable-length
sparse-spectrum sounds (such as speech), thus enabling a wide range of
realistic applications. Indeed, we demonstrate that the method can be used for
audio-visual fusion, namely to map speech signals onto images and hence to
spatially align the audio and visual modalities, thus enabling to discriminate
between speaking and non-speaking faces. We release a novel corpus of real-room
recordings that allow quantitative evaluation of the co-localization method in
the presence of one or two sound sources. Experiments demonstrate increased
accuracy and speed relative to several state-of-the-art methods.Comment: 15 pages, 8 figure
Multichannel Speech Separation and Enhancement Using the Convolutive Transfer Function
This paper addresses the problem of speech separation and enhancement from
multichannel convolutive and noisy mixtures, \emph{assuming known mixing
filters}. We propose to perform the speech separation and enhancement task in
the short-time Fourier transform domain, using the convolutive transfer
function (CTF) approximation. Compared to time-domain filters, CTF has much
less taps, consequently it has less near-common zeros among channels and less
computational complexity. The work proposes three speech-source recovery
methods, namely: i) the multichannel inverse filtering method, i.e. the
multiple input/output inverse theorem (MINT), is exploited in the CTF domain,
and for the multi-source case, ii) a beamforming-like multichannel inverse
filtering method applying single source MINT and using power minimization,
which is suitable whenever the source CTFs are not all known, and iii) a
constrained Lasso method, where the sources are recovered by minimizing the
-norm to impose their spectral sparsity, with the constraint that the
-norm fitting cost, between the microphone signals and the mixing model
involving the unknown source signals, is less than a tolerance. The noise can
be reduced by setting a tolerance onto the noise power. Experiments under
various acoustic conditions are carried out to evaluate the three proposed
methods. The comparison between them as well as with the baseline methods is
presented.Comment: Submitted to IEEE/ACM Transactions on Audio, Speech and Language
Processin
Online Localization and Tracking of Multiple Moving Speakers in Reverberant Environments
We address the problem of online localization and tracking of multiple moving
speakers in reverberant environments. The paper has the following
contributions. We use the direct-path relative transfer function (DP-RTF), an
inter-channel feature that encodes acoustic information robust against
reverberation, and we propose an online algorithm well suited for estimating
DP-RTFs associated with moving audio sources. Another crucial ingredient of the
proposed method is its ability to properly assign DP-RTFs to audio-source
directions. Towards this goal, we adopt a maximum-likelihood formulation and we
propose to use an exponentiated gradient (EG) to efficiently update
source-direction estimates starting from their currently available values. The
problem of multiple speaker tracking is computationally intractable because the
number of possible associations between observed source directions and physical
speakers grows exponentially with time. We adopt a Bayesian framework and we
propose a variational approximation of the posterior filtering distribution
associated with multiple speaker tracking, as well as an efficient variational
expectation-maximization (VEM) solver. The proposed online localization and
tracking method is thoroughly evaluated using two datasets that contain
recordings performed in real environments.Comment: IEEE Journal of Selected Topics in Signal Processing, 201
Informed Source Separation from compressed mixtures using spatial wiener filter and quantization noise estimation
International audienceIn a previous work, we proposed an Informed Source Separation sys- tem based on Wiener filtering for active listening of music from un- compressed (16-bit PCM) multichannel mix signals. In the present work, the system is improved to work with (MPEG-2 AAC) com- pressed mix signals: quantization noise is estimated from the AAC bitstream at the decoder and explicitly taken into account in the source separation process. Also a direct MDCT-to-STFT transform is used to optimize the computational efficiency of the process in the STFT domain from AAC-decoded MDCT coefficients
What the Future Brings: Investigating the Impact of Lookahead for Incremental Neural TTS
In incremental text to speech synthesis (iTTS), the synthesizer produces an
audio output before it has access to the entire input sentence. In this paper,
we study the behavior of a neural sequence-to-sequence TTS system when used in
an incremental mode, i.e. when generating speech output for token n, the system
has access to n + k tokens from the text sequence. We first analyze the impact
of this incremental policy on the evolution of the encoder representations of
token n for different values of k (the lookahead parameter). The results show
that, on average, tokens travel 88% of the way to their full context
representation with a one-word lookahead and 94% after 2 words. We then
investigate which text features are the most influential on the evolution
towards the final representation using a random forest analysis. The results
show that the most salient factors are related to token length. We finally
evaluate the effects of lookahead k at the decoder level, using a MUSHRA
listening test. This test shows results that contrast with the above high
figures: speech synthesis quality obtained with 2 word-lookahead is
significantly lower than the one obtained with the full sentence.Comment: 5 pages, 4 figure
Semi-supervised multichannel speech enhancement with variational autoencoders and non-negative matrix factorization
In this paper we address speaker-independent multichannel speech enhancement
in unknown noisy environments. Our work is based on a well-established
multichannel local Gaussian modeling framework. We propose to use a neural
network for modeling the speech spectro-temporal content. The parameters of
this supervised model are learned using the framework of variational
autoencoders. The noisy recording environment is supposed to be unknown, so the
noise spectro-temporal modeling remains unsupervised and is based on
non-negative matrix factorization (NMF). We develop a Monte Carlo
expectation-maximization algorithm and we experimentally show that the proposed
approach outperforms its NMF-based counterpart, where speech is modeled using
supervised NMF.Comment: 5 pages, 2 figures, audio examples and code available online at
https://team.inria.fr/perception/icassp-2019-mvae
Mixture of Dynamical Variational Autoencoders for Multi-Source Trajectory Modeling and Separation
In this paper, we propose a latent-variable generative model called mixture
of dynamical variational autoencoders (MixDVAE) to model the dynamics of a
system composed of multiple moving sources. A DVAE model is pre-trained on a
single-source dataset to capture the source dynamics. Then, multiple instances
of the pre-trained DVAE model are integrated into a multi-source mixture model
with a discrete observation-to-source assignment latent variable. The posterior
distributions of both the discrete observation-to-source assignment variable
and the continuous DVAE variables representing the sources content/position are
estimated using a variational expectation-maximization algorithm, leading to
multi-source trajectories estimation. We illustrate the versatility of the
proposed MixDVAE model on two tasks: a computer vision task, namely
multi-object tracking, and an audio processing task, namely single-channel
audio source separation. Experimental results show that the proposed method
works well on these two tasks, and outperforms several baseline methods.Comment: arXiv admin note: substantial text overlap with arXiv:2202.0931
Non-Stationary Noise Power Spectral Density Estimation Based on Regional Statistics
International audienceEstimating the noise power spectral density (PSD) is essential for single channel speech enhancement algorithms. In this paper, we propose a noise PSD estimation approach based on regional statistics. The proposed regional statistics consist of four features representing the statistics of the past and present periodograms in a short-time period. We show that these features are efficient in characterizing the statistical difference between noise PSD and noisy speech PSD. We therefore propose to use these features for estimating the speech presence probability (SPP). The noise PSD is recursively estimated by averaging past spectral power values with a time-varying smoothing parameter controlled by the SPP. The proposed method exhibits good tracking capability for non-stationary noise, even for abruptly increasing noise level
- …