413 research outputs found
An investigation of the utility of monaural sound source separation via nonnegative matrix factorization applied to acoustic echo and reverberation mitigation for hands-free telephony
In this thesis we investigate the applicability and utility of Monaural Sound Source Separation (MSSS) via Nonnegative Matrix Factorization (NMF) for various problems related to audio for hands-free telephony. We first investigate MSSS via NMF as an alternative acoustic echo reduction approach to existing approaches such as Acoustic Echo Cancellation (AEC). To this end, we present the single-channel acoustic echo problem as an MSSS problem, in which the objective is to extract the users signal from a mixture also containing acoustic echo and noise. To perform separation, NMF is used to decompose the near-end microphone signal onto the union of two nonnegative bases in the magnitude Short Time Fourier Transform domain. One of these bases is for the spectral energy of the acoustic echo signal, and is formed from the in- coming far-end user’s speech, while the other basis is for the spectral energy of the near-end speaker, and is trained with speech data a priori. In comparison to AEC, the speaker extraction approach obviates Double-Talk Detection (DTD), and is demonstrated to attain its maximal echo mitigation performance immediately upon initiation and to maintain that performance during and after room changes for similar computational requirements. Speaker extraction is also shown to introduce distortion of the near-end speech signal during double-talk, which is quantified by means of a speech distortion measure and compared to that of AEC. Subsequently, we address Double-Talk Detection (DTD) for block-based AEC algorithms. We propose a novel block-based DTD algorithm that uses the available signals and the estimate of the echo signal that is produced by NMF-based speaker extraction to compute a suitably normalized correlation-based decision variable, which is compared to a fixed threshold to decide on doubletalk. Using a standard evaluation technique, the proposed algorithm is shown to have comparable detection performance to an existing conventional block-based DTD algorithm. It is also demonstrated to inherit the room change insensitivity of speaker extraction, with the proposed DTD algorithm generating minimal false doubletalk indications upon initiation and in response to room changes in comparison to the existing conventional DTD. We also show that this property allows its paired AEC to converge at a rate close to the optimum. Another focus of this thesis is the problem of inverting a single measurement of a non- minimum phase Room Impulse Response (RIR). We describe the process by which percep- tually detrimental all-pass phase distortion arises in reverberant speech filtered by the inverse of the minimum phase component of the RIR; in short, such distortion arises from inverting the magnitude response of the high-Q maximum phase zeros of the RIR. We then propose two novel partial inversion schemes that precisely mitigate this distortion. One of these schemes employs NMF-based MSSS to separate the all-pass phase distortion from the target speech in the magnitude STFT domain, while the other approach modifies the inverse minimum phase filter such that the magnitude response of the maximum phase zeros of the RIR is not fully compensated. Subjective listening tests reveal that the proposed schemes generally produce better quality output speech than a comparable inversion technique
Rectified Gaussian Scale Mixtures and the Sparse Non-Negative Least Squares Problem
In this paper, we develop a Bayesian evidence maximization framework to solve
the sparse non-negative least squares (S-NNLS) problem. We introduce a family
of probability densities referred to as the Rectified Gaussian Scale Mixture
(R- GSM) to model the sparsity enforcing prior distribution for the solution.
The R-GSM prior encompasses a variety of heavy-tailed densities such as the
rectified Laplacian and rectified Student- t distributions with a proper choice
of the mixing density. We utilize the hierarchical representation induced by
the R-GSM prior and develop an evidence maximization framework based on the
Expectation-Maximization (EM) algorithm. Using the EM based method, we estimate
the hyper-parameters and obtain a point estimate for the solution. We refer to
the proposed method as rectified sparse Bayesian learning (R-SBL). We provide
four R- SBL variants that offer a range of options for computational complexity
and the quality of the E-step computation. These methods include the Markov
chain Monte Carlo EM, linear minimum mean-square-error estimation, approximate
message passing and a diagonal approximation. Using numerical experiments, we
show that the proposed R-SBL method outperforms existing S-NNLS solvers in
terms of both signal and support recovery performance, and is also very robust
against the structure of the design matrix.Comment: Under Review by IEEE Transactions on Signal Processin
A Unified Framework for Sparse Non-Negative Least Squares using Multiplicative Updates and the Non-Negative Matrix Factorization Problem
We study the sparse non-negative least squares (S-NNLS) problem. S-NNLS
occurs naturally in a wide variety of applications where an unknown,
non-negative quantity must be recovered from linear measurements. We present a
unified framework for S-NNLS based on a rectified power exponential scale
mixture prior on the sparse codes. We show that the proposed framework
encompasses a large class of S-NNLS algorithms and provide a computationally
efficient inference procedure based on multiplicative update rules. Such update
rules are convenient for solving large sets of S-NNLS problems simultaneously,
which is required in contexts like sparse non-negative matrix factorization
(S-NMF). We provide theoretical justification for the proposed approach by
showing that the local minima of the objective function being optimized are
sparse and the S-NNLS algorithms presented are guaranteed to converge to a set
of stationary points of the objective function. We then extend our framework to
S-NMF, showing that our framework leads to many well known S-NMF algorithms
under specific choices of prior and providing a guarantee that a popular
subclass of the proposed algorithms converges to a set of stationary points of
the objective function. Finally, we study the performance of the proposed
approaches on synthetic and real-world data.Comment: To appear in Signal Processin
Few-Shot Audio-Visual Learning of Environment Acoustics
Room impulse response (RIR) functions capture how the surrounding physical
environment transforms the sounds heard by a listener, with implications for
various applications in AR, VR, and robotics. Whereas traditional methods to
estimate RIRs assume dense geometry and/or sound measurements throughout the
environment, we explore how to infer RIRs based on a sparse set of images and
echoes observed in the space. Towards that goal, we introduce a
transformer-based method that uses self-attention to build a rich acoustic
context, then predicts RIRs of arbitrary query source-receiver locations
through cross-attention. Additionally, we design a novel training objective
that improves the match in the acoustic signature between the RIR predictions
and the targets. In experiments using a state-of-the-art audio-visual simulator
for 3D environments, we demonstrate that our method successfully generates
arbitrary RIRs, outperforming state-of-the-art methods and -- in a major
departure from traditional methods -- generalizing to novel environments in a
few-shot manner. Project: http://vision.cs.utexas.edu/projects/fs_rir.Comment: Accepted to NeurIPS 202
- …