30 research outputs found
Rank-1 Constrained Multichannel Wiener Filter for Speech Recognition in Noisy Environments
Multichannel linear filters, such as the Multichannel Wiener Filter (MWF) and
the Generalized Eigenvalue (GEV) beamformer are popular signal processing
techniques which can improve speech recognition performance. In this paper, we
present an experimental study on these linear filters in a specific speech
recognition task, namely the CHiME-4 challenge, which features real recordings
in multiple noisy environments. Specifically, the rank-1 MWF is employed for
noise reduction and a new constant residual noise power constraint is derived
which enhances the recognition performance. To fulfill the underlying rank-1
assumption, the speech covariance matrix is reconstructed based on eigenvectors
or generalized eigenvectors. Then the rank-1 constrained MWF is evaluated with
alternative multichannel linear filters under the same framework, which
involves a Bidirectional Long Short-Term Memory (BLSTM) network for mask
estimation. The proposed filter outperforms alternative ones, leading to a 40%
relative Word Error Rate (WER) reduction compared with the baseline Weighted
Delay and Sum (WDAS) beamformer on the real test set, and a 15% relative WER
reduction compared with the GEV-BAN method. The results also suggest that the
speech recognition accuracy correlates more with the Mel-frequency cepstral
coefficients (MFCC) feature variance than with the noise reduction or the
speech distortion level.Comment: for Computer Speech and Languag
Speech enhancement using ego-noise references with a microphone array embedded in an unmanned aerial vehicle
A method is proposed for performing speech enhancement using ego-noise
references with a microphone array embedded in an unmanned aerial vehicle
(UAV). The ego-noise reference signals are captured with microphones located
near the UAV's propellers and used in the prior knowledge multichannel Wiener
filter (PK-MWF) to obtain the speech correlation matrix estimate. Speech
presence probability (SPP) can be estimated for detecting speech activity from
an external microphone near the speech source, providing a performance
benchmark, or from one of the embedded microphones, assuming a more realistic
scenario. Experimental measurements are performed in a semi-anechoic chamber,
with a UAV mounted on a stand and a loudspeaker playing a speech signal, while
setting three distinct and fixed propeller rotation speeds, resulting in three
different signal-to-noise ratios (SNRs). The recordings obtained and made
available online are used to compare the proposed method to the use of the
standard multichannel Wiener filter (MWF) estimated with and without the
propellers' microphones being used in its formulation. Results show that
compared to those, the use of PK-MWF achieves higher levels of improvement in
speech intelligibility and quality, measured by STOI and PESQ, while the SNR
improvement is similar
DNN-based mask estimation for distributed speech enhancement in spatially unconstrained microphone arrays
Deep neural network (DNN)-based speech enhancement algorithms in microphone
arrays have now proven to be efficient solutions to speech understanding and
speech recognition in noisy environments. However, in the context of ad-hoc
microphone arrays, many challenges remain and raise the need for distributed
processing. In this paper, we propose to extend a previously introduced
distributed DNN-based time-frequency mask estimation scheme that can
efficiently use spatial information in form of so-called compressed signals
which are pre-filtered target estimations. We study the performance of this
algorithm under realistic acoustic conditions and investigate practical aspects
of its optimal application. We show that the nodes in the microphone array
cooperate by taking profit of their spatial coverage in the room. We also
propose to use the compressed signals not only to convey the target estimation
but also the noise estimation in order to exploit the acoustic diversity
recorded throughout the microphone array.Comment: Submitted to TASL
Sampling Rate Offset Estimation and Compensation for Distributed Adaptive Node-Specific Signal Estimation in Wireless Acoustic Sensor Networks
Sampling rate offsets (SROs) between devices in a heterogeneous wireless
acoustic sensor network (WASN) can hinder the ability of distributed adaptive
algorithms to perform as intended when they rely on coherent signal processing.
In this paper, we present an SRO estimation and compensation method to allow
the deployment of the distributed adaptive node-specific signal estimation
(DANSE) algorithm in WASNs composed of asynchronous devices. The signals
available at each node are first utilised in a coherence-drift-based method to
blindly estimate SROs which are then compensated for via phase shifts in the
frequency domain. A modification of the weighted overlap-add (WOLA)
implementation of DANSE is introduced to account for SRO-induced full-sample
drifts, permitting per-sample signal transmission via an approximation of the
WOLA process as a time-domain convolution. The performance of the proposed
algorithm is evaluated in the context of distributed noise reduction for the
estimation of a target speech signal in an asynchronous WASN.Comment: 9 pages, 6 figure
A Phoneme-Scale Assessment of Multichannel Speech Enhancement Algorithms
In the intricate acoustic landscapes where speech intelligibility is
challenged by noise and reverberation, multichannel speech enhancement emerges
as a promising solution for individuals with hearing loss. Such algorithms are
commonly evaluated at the utterance level. However, this approach overlooks the
granular acoustic nuances revealed by phoneme-specific analysis, potentially
obscuring key insights into their performance. This paper presents an in-depth
phoneme-scale evaluation of 3 state-of-the-art multichannel speech enhancement
algorithms. These algorithms -- FasNet, MVDR, and Tango -- are extensively
evaluated across different noise conditions and spatial setups, employing
realistic acoustic simulations with measured room impulse responses, and
leveraging diversity offered by multiple microphones in a binaural hearing
setup. The study emphasizes the fine-grained phoneme-level analysis, revealing
that while some phonemes like plosives are heavily impacted by environmental
acoustics and challenging to deal with by the algorithms, others like nasals
and sibilants see substantial improvements after enhancement. These
investigations demonstrate important improvements in phoneme clarity in noisy
conditions, with insights that could drive the development of more personalized
and phoneme-aware hearing aid technologies.Comment: This is the preprint of the paper that we submitted to the Trends in
Hearing Journa
DREGON: Dataset and Methods for UAV-Embedded Sound Source Localization
International audienceThis paper introduces DREGON, a novel publicly-available dataset that aims at pushing research in sound source localization using a microphone array embedded in an unmanned aerial vehicle (UAV). The dataset contains both clean and noisy in-flight audio recordings continuously annotated with the 3D position of the target sound source using an accurate motion capture system. In addition, various signals of interests are available such as the rotational speed of individual rotors and inertial measurements at all time. Besides introducing the dataset, this paper sheds light on the specific properties, challenges and opportunities brought by the emerging task of UAV-embedded sound source localization. Several baseline methods are evaluated and compared on the dataset, with real-time applicability in mind. Very promising results are obtained for the localization of a broad-band source in loud noise conditions, while speech localization remains a challenge under extreme noise levels
Instantaneous PSD Estimation for Speech Enhancement based on Generalized Principal Components
Power spectral density (PSD) estimates of various microphone signal
components are essential to many speech enhancement procedures. As speech is
highly non-nonstationary, performance improvements may be gained by maintaining
time-variations in PSD estimates. In this paper, we propose an instantaneous
PSD estimation approach based on generalized principal components. Similarly to
other eigenspace-based PSD estimation approaches, we rely on recursive
averaging in order to obtain a microphone signal correlation matrix estimate to
be decomposed. However, instead of estimating the PSDs directly from the
temporally smooth generalized eigenvalues of this matrix, yielding temporally
smooth PSD estimates, we propose to estimate the PSDs from newly defined
instantaneous generalized eigenvalues, yielding instantaneous PSD estimates.
The instantaneous generalized eigenvalues are defined from the generalized
principal components, i.e. a generalized eigenvector-based transform of the
microphone signals. We further show that the smooth generalized eigenvalues can
be understood as a recursive average of the instantaneous generalized
eigenvalues. Simulation results comparing the multi-channel Wiener filter (MWF)
with smooth and instantaneous PSD estimates indicate better speech enhancement
performance for the latter. A MATLAB implementation is available online
DNN-Based Distributed Multichannel Mask Estimation for Speech Enhancement in Microphone Arrays
Submitted to ICASSP2020International audienceMultichannel processing is widely used for speech enhancement but several limitations appear when trying to deploy these solutions to the real-world. Distributed sensor arrays that consider several devices with a few microphones is a viable alternative that allows for exploiting the multiple devices equipped with microphones that we are using in our everyday life. In this context, we propose to extend the distributed adaptive node-specific signal estimation approach to a neural networks framework. At each node, a local filtering is performed to send one signal to the other nodes where a mask is estimated by a neural network in order to compute a global multi-channel Wiener filter. In an array of two nodes, we show that this additional signal can be efficiently taken into account to predict the masks and leads to better speech enhancement performances than when the mask estimation relies only on the local signals
Contributions to speech processing and ambient sound analysis
We are constantly surrounded by sounds that we continuously exploit to adapt our actions to situations we are facing. Some of the sounds like speech can have a particular structure from which we can infer some information, explicit or not. This is one reason why speech is possibly that is the most intuitive way to communicate between humans. Within the last decade, there has been significant progress in the domain of speech andaudio processing and in particular in the domain of machine learning applied to speech and audio processing. Thanks to these progresses, speech has become a central element in many human to human distant communication tools as well as in human to machine communication systems. These solutions work pretty well on clean speech or under controlled condition. However, in scenarios that involve the presence of acoustic perturbation such as noise or reverberation systems performance tends to degrade severely. In this thesis we focus on processing speech and its environments from an audio perspective. The algorithms proposed here are relying on a variety of solutions from signal processing based approaches to data-driven solutions based on supervised matrix factorization or deep neural networks. We propose solutions to problems ranging from speech recognition, to speech enhancement or ambient sound analysis. The target is to offer a panorama of the different aspects that could improve a speech processing algorithm working in a real environments. We start by describing automatic speech recognition as a potential end application and progressively unravel the limitations and the proposed solutions ending-up to the more general ambient sound analysis.Nous sommes constamment entourés de sons que nous exploitons pour adapter nos actions aux situations auxquelles nous sommes confrontés. Certains sons comme la parole peuvent avoir une structure particulière à partir de laquelle nous pouvons déduire des informations, explicites ou non. C’est l’une des raisons pour lesquelles la parole est peut-être le moyen le plus intuitif de communiquer entre humains. Au cours de la décennie écoulée, des progrès significatifs ont été réalisés dans le domaine du traitement de la parole et du son et en particulier dans le domaine de l’apprentissage automatique appliqué au traitement de la parole et du son. Grâce à ces progrès, la parole est devenue un élément central de nombreux outils de communication à distance d’humain à humain ainsi que dans les systèmes de communication humain-machine. Ces solutions fonctionnent bien sur un signal de parole propre ou dans des conditions contrôlées. Cependant, dans les scénarios qui impliquent la présence de perturbations acoustiques telles que du bruit ou de la réverbération les performances peuvent avoir tendance à se dégrader gravement. Dans cette HDR, nous nous concentrons sur le traitement de la parole et de son environnement d’un point de vue audio. Les algorithmes proposés ici reposent sur une variété de solutions allant des approches basées sur le traitement du signal aux solutions orientées données à base de factorisation matricielle supervisée ou de réseaux de neurones profonds. Nous proposons des solutions à des problèmes allant de la reconnaissance vocale au rehaussement de la parole ou à l’analyse des sons ambiants. L’objectif est d’offrir un panorama des différents aspects qui pourraient être améliorer un algorithme de traitement de la parole fonctionnant dans un environnement réel. Nous commençons par décrire la reconnaissance automatique de la parole comme une application finale potentielle et analysons progressivement les limites et les solutions proposées aboutissant à l’analyse plus générale des sons ambiants