96 research outputs found
Rank-1 Constrained Multichannel Wiener Filter for Speech Recognition in Noisy Environments
Multichannel linear filters, such as the Multichannel Wiener Filter (MWF) and
the Generalized Eigenvalue (GEV) beamformer are popular signal processing
techniques which can improve speech recognition performance. In this paper, we
present an experimental study on these linear filters in a specific speech
recognition task, namely the CHiME-4 challenge, which features real recordings
in multiple noisy environments. Specifically, the rank-1 MWF is employed for
noise reduction and a new constant residual noise power constraint is derived
which enhances the recognition performance. To fulfill the underlying rank-1
assumption, the speech covariance matrix is reconstructed based on eigenvectors
or generalized eigenvectors. Then the rank-1 constrained MWF is evaluated with
alternative multichannel linear filters under the same framework, which
involves a Bidirectional Long Short-Term Memory (BLSTM) network for mask
estimation. The proposed filter outperforms alternative ones, leading to a 40%
relative Word Error Rate (WER) reduction compared with the baseline Weighted
Delay and Sum (WDAS) beamformer on the real test set, and a 15% relative WER
reduction compared with the GEV-BAN method. The results also suggest that the
speech recognition accuracy correlates more with the Mel-frequency cepstral
coefficients (MFCC) feature variance than with the noise reduction or the
speech distortion level.Comment: for Computer Speech and Languag
Contributions to speech processing and ambient sound analysis
We are constantly surrounded by sounds that we continuously exploit to adapt our actions to situations we are facing. Some of the sounds like speech can have a particular structure from which we can infer some information, explicit or not. This is one reason why speech is possibly that is the most intuitive way to communicate between humans. Within the last decade, there has been significant progress in the domain of speech andaudio processing and in particular in the domain of machine learning applied to speech and audio processing. Thanks to these progresses, speech has become a central element in many human to human distant communication tools as well as in human to machine communication systems. These solutions work pretty well on clean speech or under controlled condition. However, in scenarios that involve the presence of acoustic perturbation such as noise or reverberation systems performance tends to degrade severely. In this thesis we focus on processing speech and its environments from an audio perspective. The algorithms proposed here are relying on a variety of solutions from signal processing based approaches to data-driven solutions based on supervised matrix factorization or deep neural networks. We propose solutions to problems ranging from speech recognition, to speech enhancement or ambient sound analysis. The target is to offer a panorama of the different aspects that could improve a speech processing algorithm working in a real environments. We start by describing automatic speech recognition as a potential end application and progressively unravel the limitations and the proposed solutions ending-up to the more general ambient sound analysis.Nous sommes constamment entourĂ©s de sons que nous exploitons pour adapter nos actions aux situations auxquelles nous sommes confrontĂ©s. Certains sons comme la parole peuvent avoir une structure particuliĂšre Ă partir de laquelle nous pouvons dĂ©duire des informations, explicites ou non. Câest lâune des raisons pour lesquelles la parole est peut-ĂȘtre le moyen le plus intuitif de communiquer entre humains. Au cours de la dĂ©cennie Ă©coulĂ©e, des progrĂšs significatifs ont Ă©tĂ© rĂ©alisĂ©s dans le domaine du traitement de la parole et du son et en particulier dans le domaine de lâapprentissage automatique appliquĂ© au traitement de la parole et du son. GrĂące Ă ces progrĂšs, la parole est devenue un Ă©lĂ©ment central de nombreux outils de communication Ă distance dâhumain Ă humain ainsi que dans les systĂšmes de communication humain-machine. Ces solutions fonctionnent bien sur un signal de parole propre ou dans des conditions contrĂŽlĂ©es. Cependant, dans les scĂ©narios qui impliquent la prĂ©sence de perturbations acoustiques telles que du bruit ou de la rĂ©verbĂ©ration les performances peuvent avoir tendance Ă se dĂ©grader gravement. Dans cette HDR, nous nous concentrons sur le traitement de la parole et de son environnement dâun point de vue audio. Les algorithmes proposĂ©s ici reposent sur une variĂ©tĂ© de solutions allant des approches basĂ©es sur le traitement du signal aux solutions orientĂ©es donnĂ©es Ă base de factorisation matricielle supervisĂ©e ou de rĂ©seaux de neurones profonds. Nous proposons des solutions Ă des problĂšmes allant de la reconnaissance vocale au rehaussement de la parole ou Ă lâanalyse des sons ambiants. Lâobjectif est dâoffrir un panorama des diffĂ©rents aspects qui pourraient ĂȘtre amĂ©liorer un algorithme de traitement de la parole fonctionnant dans un environnement rĂ©el. Nous commençons par dĂ©crire la reconnaissance automatique de la parole comme une application finale potentielle et analysons progressivement les limites et les solutions proposĂ©es aboutissant Ă lâanalyse plus gĂ©nĂ©rale des sons ambiants
RTF-Based Binaural MVDR Beamformer Exploiting an External Microphone in a Diffuse Noise Field
Besides suppressing all undesired sound sources, an important objective of a
binaural noise reduction algorithm for hearing devices is the preservation of
the binaural cues, aiming at preserving the spatial perception of the acoustic
scene. A well-known binaural noise reduction algorithm is the binaural minimum
variance distortionless response beamformer, which can be steered using the
relative transfer function (RTF) vector of the desired source, relating the
acoustic transfer functions between the desired source and all microphones to a
reference microphone. In this paper, we propose a computationally efficient
method to estimate the RTF vector in a diffuse noise field, requiring an
additional microphone that is spatially separated from the head-mounted
microphones. Assuming that the spatial coherence between the noise components
in the head-mounted microphone signals and the additional microphone signal is
zero, we show that an unbiased estimate of the RTF vector can be obtained.
Based on real-world recordings, experimental results for several reverberation
times show that the proposed RTF estimator outperforms the widely used RTF
estimator based on covariance whitening and a simple biased RTF estimator in
terms of noise reduction and binaural cue preservation performance.Comment: Accepted at ITG Conference on Speech Communication 201
DNN-based mask estimation for distributed speech enhancement in spatially unconstrained microphone arrays
Deep neural network (DNN)-based speech enhancement algorithms in microphone
arrays have now proven to be efficient solutions to speech understanding and
speech recognition in noisy environments. However, in the context of ad-hoc
microphone arrays, many challenges remain and raise the need for distributed
processing. In this paper, we propose to extend a previously introduced
distributed DNN-based time-frequency mask estimation scheme that can
efficiently use spatial information in form of so-called compressed signals
which are pre-filtered target estimations. We study the performance of this
algorithm under realistic acoustic conditions and investigate practical aspects
of its optimal application. We show that the nodes in the microphone array
cooperate by taking profit of their spatial coverage in the room. We also
propose to use the compressed signals not only to convey the target estimation
but also the noise estimation in order to exploit the acoustic diversity
recorded throughout the microphone array.Comment: Submitted to TASL
Relative Transfer Function Vector Estimation for Acoustic Sensor Networks Exploiting Covariance Matrix Structure
In many multi-microphone algorithms for noise reduction, an estimate of the
relative transfer function (RTF) vector of the target speaker is required. The
state-of-the-art covariance whitening (CW) method estimates the RTF vector as
the principal eigenvector of the whitened noisy covariance matrix, where
whitening is performed using an estimate of the noise covariance matrix. In
this paper, we consider an acoustic sensor network consisting of multiple
microphone nodes. Assuming uncorrelated noise between the nodes but not within
the nodes, we propose two RTF vector estimation methods that leverage the
block-diagonal structure of the noise covariance matrix. The first method
modifies the CW method by considering only the diagonal blocks of the estimated
noise covariance matrix. In contrast, the second method only considers the
off-diagonal blocks of the noisy covariance matrix, but cannot be solved using
a simple eigenvalue decomposition. When applying the estimated RTF vector in a
minimum variance distortionless response beamformer, simulation results for
real-world recordings in a reverberant environment with multiple noise sources
show that the modified CW method performs slightly better than the CW method in
terms of SNR improvement, while the off-diagonal selection method outperforms a
biased RTF vector estimate obtained as the principal eigenvector of the noisy
covariance matrix.Comment: Proc. IEEE Workshop on Applications of Signal Processing to Audio and
Acoustics (WASPAA), New Paltz NY, USA, Oct. 202
Dual-Channel Speech Enhancement Based on Extended Kalman Filter Relative Transfer Function Estimation
This paper deals with speech enhancement in dual-microphone smartphones using
beamforming along with postfiltering techniques. The performance of these algorithms relies on
a good estimation of the acoustic channel and speech and noise statistics. In this work we present
a speech enhancement system that combines the estimation of the relative transfer function (RTF)
between microphones using an extended Kalman filter framework with a novel speech presence
probability estimator intended to track the noise statisticsâ variability. The available dual-channel
information is exploited to obtain more reliable estimates of clean speech statistics. Noise reduction
is further improved by means of postfiltering techniques that take advantage of the speech presence
estimation. Our proposal is evaluated in different reverberant and noisy environments when the
smartphone is used in both close-talk and far-talk positions. The experimental results show that our
system achieves improvements in terms of noise reduction, low speech distortion and better speech
intelligibility compared to other state-of-the-art approaches.Spanish MINECO/FEDER Project TEC2016-80141-PSpanish
Ministry of Education through the National Program FPU under Grant FPU15/0416
A Phoneme-Scale Assessment of Multichannel Speech Enhancement Algorithms
In the intricate acoustic landscapes where speech intelligibility is
challenged by noise and reverberation, multichannel speech enhancement emerges
as a promising solution for individuals with hearing loss. Such algorithms are
commonly evaluated at the utterance level. However, this approach overlooks the
granular acoustic nuances revealed by phoneme-specific analysis, potentially
obscuring key insights into their performance. This paper presents an in-depth
phoneme-scale evaluation of 3 state-of-the-art multichannel speech enhancement
algorithms. These algorithms -- FasNet, MVDR, and Tango -- are extensively
evaluated across different noise conditions and spatial setups, employing
realistic acoustic simulations with measured room impulse responses, and
leveraging diversity offered by multiple microphones in a binaural hearing
setup. The study emphasizes the fine-grained phoneme-level analysis, revealing
that while some phonemes like plosives are heavily impacted by environmental
acoustics and challenging to deal with by the algorithms, others like nasals
and sibilants see substantial improvements after enhancement. These
investigations demonstrate important improvements in phoneme clarity in noisy
conditions, with insights that could drive the development of more personalized
and phoneme-aware hearing aid technologies.Comment: This is the preprint of the paper that we submitted to the Trends in
Hearing Journa
Sampling Rate Offset Estimation and Compensation for Distributed Adaptive Node-Specific Signal Estimation in Wireless Acoustic Sensor Networks
Sampling rate offsets (SROs) between devices in a heterogeneous wireless
acoustic sensor network (WASN) can hinder the ability of distributed adaptive
algorithms to perform as intended when they rely on coherent signal processing.
In this paper, we present an SRO estimation and compensation method to allow
the deployment of the distributed adaptive node-specific signal estimation
(DANSE) algorithm in WASNs composed of asynchronous devices. The signals
available at each node are first utilised in a coherence-drift-based method to
blindly estimate SROs which are then compensated for via phase shifts in the
frequency domain. A modification of the weighted overlap-add (WOLA)
implementation of DANSE is introduced to account for SRO-induced full-sample
drifts, permitting per-sample signal transmission via an approximation of the
WOLA process as a time-domain convolution. The performance of the proposed
algorithm is evaluated in the context of distributed noise reduction for the
estimation of a target speech signal in an asynchronous WASN.Comment: 9 pages, 6 figure
- âŠ