42 research outputs found
Reverberation: models, estimation and application
The use of reverberation models is required in many applications such as acoustic measurements,
speech dereverberation and robust automatic speech recognition. The aim of this thesis is to
investigate different models and propose a perceptually-relevant reverberation model with suitable
parameter estimation techniques for different applications.
Reverberation can be modelled in both the time and frequency domain. The model parameters
give direct information of both physical and perceptual characteristics. These characteristics
create a multidimensional parameter space of reverberation, which can be to a large extent captured
by a time-frequency domain model. In this thesis, the relationship between physical and perceptual
model parameters will be discussed. In the first application, an intrusive technique is proposed to
measure the reverberation or reverberance, perception of reverberation and the colouration. The
room decay rate parameter is of particular interest.
In practical applications, a blind estimate of the decay rate of acoustic energy in a room
is required. A statistical model for the distribution of the decay rate of the reverberant signal
named the eagleMax distribution is proposed. The eagleMax distribution describes the reverberant
speech decay rates as a random variable that is the maximum of the room decay rates and anechoic
speech decay rates. Three methods were developed to estimate the mean room decay rate from
the eagleMax distributions alone. The estimated room decay rates form a reverberation model that
will be discussed in the context of room acoustic measurements, speech dereverberation and robust
automatic speech recognition individually
Direction-Aware Adaptive Online Neural Speech Enhancement with an Augmented Reality Headset in Real Noisy Conversational Environments
This paper describes the practical response- and performance-aware
development of online speech enhancement for an augmented reality (AR) headset
that helps a user understand conversations made in real noisy echoic
environments (e.g., cocktail party). One may use a state-of-the-art blind
source separation method called fast multichannel nonnegative matrix
factorization (FastMNMF) that works well in various environments thanks to its
unsupervised nature. Its heavy computational cost, however, prevents its
application to real-time processing. In contrast, a supervised beamforming
method that uses a deep neural network (DNN) for estimating spatial information
of speech and noise readily fits real-time processing, but suffers from drastic
performance degradation in mismatched conditions. Given such complementary
characteristics, we propose a dual-process robust online speech enhancement
method based on DNN-based beamforming with FastMNMF-guided adaptation. FastMNMF
(back end) is performed in a mini-batch style and the noisy and enhanced speech
pairs are used together with the original parallel training data for updating
the direction-aware DNN (front end) with backpropagation at a
computationally-allowable interval. This method is used with a blind
dereverberation method called weighted prediction error (WPE) for transcribing
the noisy reverberant speech of a speaker, which can be detected from video or
selected by a user's hand gesture or eye gaze, in a streaming manner and
spatially showing the transcriptions with an AR technique. Our experiment
showed that the word error rate was improved by more than 10 points with the
run-time adaptation using only twelve minutes of observation.Comment: IEEE/RSJ IROS 202
Informed algorithms for sound source separation in enclosed reverberant environments
While humans can separate a sound of interest amidst a cacophony of contending sounds in an echoic environment, machine-based methods lag behind in solving this task. This thesis thus aims at improving performance of audio separation algorithms when they are informed i.e. have access to source location information. These locations are assumed to be known a priori in this work, for example by video processing.
Initially, a multi-microphone array based method combined with binary
time-frequency masking is proposed. A robust least squares frequency invariant data independent beamformer designed with the location information is
utilized to estimate the sources. To further enhance the estimated sources, binary time-frequency masking based post-processing is used but cepstral domain smoothing is required to mitigate musical noise.
To tackle the under-determined case and further improve separation performance
at higher reverberation times, a two-microphone based method
which is inspired by human auditory processing and generates soft time-frequency masks is described. In this approach interaural level difference,
interaural phase difference and mixing vectors are probabilistically modeled in the time-frequency domain and the model parameters are learned
through the expectation-maximization (EM) algorithm. A direction vector is estimated for each source, using the location information, which is used as
the mean parameter of the mixing vector model. Soft time-frequency masks are used to reconstruct the sources. A spatial covariance model is then integrated into the probabilistic model framework that encodes the spatial
characteristics of the enclosure and further improves the separation performance
in challenging scenarios i.e. when sources are in close proximity and
when the level of reverberation is high.
Finally, new dereverberation based pre-processing is proposed based on the cascade of three dereverberation stages where each enhances the twomicrophone
reverberant mixture. The dereverberation stages are based on amplitude spectral subtraction, where the late reverberation is estimated and suppressed. The combination of such dereverberation based pre-processing and use of soft mask separation yields the best separation performance. All methods are evaluated with real and synthetic mixtures formed for example from speech signals from the TIMIT database and measured room impulse responses
Convolutive Blind Source Separation Methods
In this chapter, we provide an overview of existing algorithms for blind source separation of convolutive audio mixtures. We provide a taxonomy, wherein many of the existing algorithms can be organized, and we present published results from those algorithms that have been applied to real-world audio separation tasks
An investigation of the utility of monaural sound source separation via nonnegative matrix factorization applied to acoustic echo and reverberation mitigation for hands-free telephony
In this thesis we investigate the applicability and utility of Monaural Sound Source Separation (MSSS) via Nonnegative Matrix Factorization (NMF) for various problems related to audio for hands-free telephony. We first investigate MSSS via NMF as an alternative acoustic echo reduction approach to existing approaches such as Acoustic Echo Cancellation (AEC). To this end, we present the single-channel acoustic echo problem as an MSSS problem, in which the objective is to extract the users signal from a mixture also containing acoustic echo and noise. To perform separation, NMF is used to decompose the near-end microphone signal onto the union of two nonnegative bases in the magnitude Short Time Fourier Transform domain. One of these bases is for the spectral energy of the acoustic echo signal, and is formed from the in- coming far-end user’s speech, while the other basis is for the spectral energy of the near-end speaker, and is trained with speech data a priori. In comparison to AEC, the speaker extraction approach obviates Double-Talk Detection (DTD), and is demonstrated to attain its maximal echo mitigation performance immediately upon initiation and to maintain that performance during and after room changes for similar computational requirements. Speaker extraction is also shown to introduce distortion of the near-end speech signal during double-talk, which is quantified by means of a speech distortion measure and compared to that of AEC. Subsequently, we address Double-Talk Detection (DTD) for block-based AEC algorithms. We propose a novel block-based DTD algorithm that uses the available signals and the estimate of the echo signal that is produced by NMF-based speaker extraction to compute a suitably normalized correlation-based decision variable, which is compared to a fixed threshold to decide on doubletalk. Using a standard evaluation technique, the proposed algorithm is shown to have comparable detection performance to an existing conventional block-based DTD algorithm. It is also demonstrated to inherit the room change insensitivity of speaker extraction, with the proposed DTD algorithm generating minimal false doubletalk indications upon initiation and in response to room changes in comparison to the existing conventional DTD. We also show that this property allows its paired AEC to converge at a rate close to the optimum. Another focus of this thesis is the problem of inverting a single measurement of a non- minimum phase Room Impulse Response (RIR). We describe the process by which percep- tually detrimental all-pass phase distortion arises in reverberant speech filtered by the inverse of the minimum phase component of the RIR; in short, such distortion arises from inverting the magnitude response of the high-Q maximum phase zeros of the RIR. We then propose two novel partial inversion schemes that precisely mitigate this distortion. One of these schemes employs NMF-based MSSS to separate the all-pass phase distortion from the target speech in the magnitude STFT domain, while the other approach modifies the inverse minimum phase filter such that the magnitude response of the maximum phase zeros of the RIR is not fully compensated. Subjective listening tests reveal that the proposed schemes generally produce better quality output speech than a comparable inversion technique
An investigation of the utility of monaural sound source separation via nonnegative matrix factorization applied to acoustic echo and reverberation mitigation for hands-free telephony
In this thesis we investigate the applicability and utility of Monaural Sound Source Separation (MSSS) via Nonnegative Matrix Factorization (NMF) for various problems related to audio for hands-free telephony. We first investigate MSSS via NMF as an alternative acoustic echo reduction approach to existing approaches such as Acoustic Echo Cancellation (AEC). To this end, we present the single-channel acoustic echo problem as an MSSS problem, in which the objective is to extract the users signal from a mixture also containing acoustic echo and noise. To perform separation, NMF is used to decompose the near-end microphone signal onto the union of two nonnegative bases in the magnitude Short Time Fourier Transform domain. One of these bases is for the spectral energy of the acoustic echo signal, and is formed from the in- coming far-end user’s speech, while the other basis is for the spectral energy of the near-end speaker, and is trained with speech data a priori. In comparison to AEC, the speaker extraction approach obviates Double-Talk Detection (DTD), and is demonstrated to attain its maximal echo mitigation performance immediately upon initiation and to maintain that performance during and after room changes for similar computational requirements. Speaker extraction is also shown to introduce distortion of the near-end speech signal during double-talk, which is quantified by means of a speech distortion measure and compared to that of AEC. Subsequently, we address Double-Talk Detection (DTD) for block-based AEC algorithms. We propose a novel block-based DTD algorithm that uses the available signals and the estimate of the echo signal that is produced by NMF-based speaker extraction to compute a suitably normalized correlation-based decision variable, which is compared to a fixed threshold to decide on doubletalk. Using a standard evaluation technique, the proposed algorithm is shown to have comparable detection performance to an existing conventional block-based DTD algorithm. It is also demonstrated to inherit the room change insensitivity of speaker extraction, with the proposed DTD algorithm generating minimal false doubletalk indications upon initiation and in response to room changes in comparison to the existing conventional DTD. We also show that this property allows its paired AEC to converge at a rate close to the optimum. Another focus of this thesis is the problem of inverting a single measurement of a non- minimum phase Room Impulse Response (RIR). We describe the process by which percep- tually detrimental all-pass phase distortion arises in reverberant speech filtered by the inverse of the minimum phase component of the RIR; in short, such distortion arises from inverting the magnitude response of the high-Q maximum phase zeros of the RIR. We then propose two novel partial inversion schemes that precisely mitigate this distortion. One of these schemes employs NMF-based MSSS to separate the all-pass phase distortion from the target speech in the magnitude STFT domain, while the other approach modifies the inverse minimum phase filter such that the magnitude response of the maximum phase zeros of the RIR is not fully compensated. Subjective listening tests reveal that the proposed schemes generally produce better quality output speech than a comparable inversion technique
Deep neural networks for monaural source separation
PhD ThesisIn monaural source separation (MSS) only one recording is available and the
spatial information, generally, cannot be extracted. It is also an undetermined inverse problem. Rcently, the development of the deep neural network
(DNN) provides the framework to address this problem. How to select the
types of neural network models and training targets is the research question.
Moreover, in real room environments, the reverberations from floor, walls,
ceiling and furnitures in a room are challenging, which distort the received
mixture and degrade the separation performance. In many real-world applications, due to the size of hardware, the number of microphones cannot
always be multiple. Hence, deep learning based MSS is the focus of this
thesis.
The first contribution is on improving the separation performance by enhancing the generalization ability of the deep learning-base MSS methods.
According to no free lunch (NFL) theorem, it is impossible to find the neural
network model which can estimate the training target perfectly in all cases.
From the acquired speech mixture, the information of clean speech signal
could be over- or underestimated. Besides, the discriminative criterion objective function can be used to address ambiguous information problem in
the training stage of deep learning. Based on this, the adaptive discriminative criterion is proposed and better separation performance is obtained. In
addition to this, another alternative method is using the sequentially trained
neural network models within different training targets to further estimate
iv
Abstract v
the clean speech signal. By using different training targets, the generalization ability of the neural network models is improved, and thereby better
separation performance.
The second contribution is addressing MSS problem in reverberant room
environments. To achieve this goal, a novel time-frequency (T-F) mask, e.g.
dereverberation mask (DM) is proposed to estimate the relationship between
the reverberant noisy speech mixture and the dereverberated mixture. Then,
a separation mask is exploited to extract the desired clean speech signal from
the noisy speech mixture. The DM can be integrated with ideal ratio mask
(IRM) to generate ideal enhanced mask (IEM) to address both dereverberation and separation problems. Based on the DM and the IEM, a two-stage
approach is proposed with different system structures.
In the final contribution, both phase information of clean speech signal
and long short-term memory (LSTM) recurrent neural network (RNN) are
introduced. A novel complex signal approximation (SA)-based method is
proposed with the complex domain of signals. By utilizing the LSTM RNN
as the neural network model, the temporal information is better used, and
the desired speech signal can be estimated more accurately. Besides, the
phase information of clean speech signal is applied to mitigate the negative
influence from noisy phase information.
The proposed MSS algorithms are evaluated with various challenging
datasets such as the TIMIT, IEEE corpora and NOISEX database. The
algorithms are assessed with state-of-the-art techniques and performance
measures to confirm that the proposed MSS algorithms provide novel solution
Towards Single-Channel Speech Separation in Noise and Reverberation
Many speech technologies, such as automatic speech recognition and speaker identification, are conventionally designed to only work on single speech streams. As a result, these systems can suffer severely degraded performance in cases of overlapping speech, i.e. when two or more people are speaking at the same time. Speech separation systems aim to address this problem by taking a recording of a speech mixture and outputting a single recording for each speaker in the mixture, where the interfering speech has been removed. The advancements in speech technology provided by deep neural networks have extended to speech separation, resulting in the first effectively functional single-channel speech separation systems. As performance of these systems has improved, there has been a desire to extend their capabilities beyond the clean studio recordings using close-talking microphones that the technology was initially developed on. In this dissertation, we focus on the extension of these technologies to the noisy and reverberant conditions more representative of real-world applications. Contributions of this dissertation include producing and releasing new data appropriate for training and evaluation of single-channel speech separation techniques, performing benchmark experiments to establish the degradation of conventional methods in more realistic settings, theoretical analysis of the impact, and development of new techniques targeted at improving system performance in these adverse conditions
Learning earthquake sources using symmetric autoencoders
We introduce Symmetric Autoencoder (SymAE), a neural-network architecture
designed to automatically extract earthquake information from far-field seismic
waves. SymAE represents the measured displacement field using a code that is
partitioned into two interpretable components: source and path-scattering
information. We achieve this source-path representation using the scale
separation principle and stochastic regularization, which traditional
autoencoding methods lack. According to the scale separation principle, the
variations in far-field band-limited seismic measurements resulting from finite
faulting occur across two spatial scales: a slower scale associated with the
source processes and a faster scale corresponding to path effects. Once
trained, SymAE facilitates the generation of virtual seismograms, engineered to
not contain subsurface scattering effects. We present time-reversal imaging of
virtual seismograms to accurately infer the kinematic rupture parameters
without knowledge of empirical Green's function. SymAE is an unsupervised
learning method that can efficiently scale with large amounts of seismic data
and does not require labeled seismograms, making it the first framework that
can learn from all available previous earthquakes to accurately characterize a
given earthquake. The paper presents the results of an analysis of nearly
thirty complex earthquake events, revealing differences between earthquakes in
energy rise times, stopping phases, and providing insights into their rupture
complexity