94 research outputs found
Rank-1 Constrained Multichannel Wiener Filter for Speech Recognition in Noisy Environments
Multichannel linear filters, such as the Multichannel Wiener Filter (MWF) and
the Generalized Eigenvalue (GEV) beamformer are popular signal processing
techniques which can improve speech recognition performance. In this paper, we
present an experimental study on these linear filters in a specific speech
recognition task, namely the CHiME-4 challenge, which features real recordings
in multiple noisy environments. Specifically, the rank-1 MWF is employed for
noise reduction and a new constant residual noise power constraint is derived
which enhances the recognition performance. To fulfill the underlying rank-1
assumption, the speech covariance matrix is reconstructed based on eigenvectors
or generalized eigenvectors. Then the rank-1 constrained MWF is evaluated with
alternative multichannel linear filters under the same framework, which
involves a Bidirectional Long Short-Term Memory (BLSTM) network for mask
estimation. The proposed filter outperforms alternative ones, leading to a 40%
relative Word Error Rate (WER) reduction compared with the baseline Weighted
Delay and Sum (WDAS) beamformer on the real test set, and a 15% relative WER
reduction compared with the GEV-BAN method. The results also suggest that the
speech recognition accuracy correlates more with the Mel-frequency cepstral
coefficients (MFCC) feature variance than with the noise reduction or the
speech distortion level.Comment: for Computer Speech and Languag
Contributions to speech processing and ambient sound analysis
We are constantly surrounded by sounds that we continuously exploit to adapt our actions to situations we are facing. Some of the sounds like speech can have a particular structure from which we can infer some information, explicit or not. This is one reason why speech is possibly that is the most intuitive way to communicate between humans. Within the last decade, there has been significant progress in the domain of speech andaudio processing and in particular in the domain of machine learning applied to speech and audio processing. Thanks to these progresses, speech has become a central element in many human to human distant communication tools as well as in human to machine communication systems. These solutions work pretty well on clean speech or under controlled condition. However, in scenarios that involve the presence of acoustic perturbation such as noise or reverberation systems performance tends to degrade severely. In this thesis we focus on processing speech and its environments from an audio perspective. The algorithms proposed here are relying on a variety of solutions from signal processing based approaches to data-driven solutions based on supervised matrix factorization or deep neural networks. We propose solutions to problems ranging from speech recognition, to speech enhancement or ambient sound analysis. The target is to offer a panorama of the different aspects that could improve a speech processing algorithm working in a real environments. We start by describing automatic speech recognition as a potential end application and progressively unravel the limitations and the proposed solutions ending-up to the more general ambient sound analysis.Nous sommes constamment entourés de sons que nous exploitons pour adapter nos actions aux situations auxquelles nous sommes confrontés. Certains sons comme la parole peuvent avoir une structure particulière à partir de laquelle nous pouvons déduire des informations, explicites ou non. C’est l’une des raisons pour lesquelles la parole est peut-être le moyen le plus intuitif de communiquer entre humains. Au cours de la décennie écoulée, des progrès significatifs ont été réalisés dans le domaine du traitement de la parole et du son et en particulier dans le domaine de l’apprentissage automatique appliqué au traitement de la parole et du son. Grâce à ces progrès, la parole est devenue un élément central de nombreux outils de communication à distance d’humain à humain ainsi que dans les systèmes de communication humain-machine. Ces solutions fonctionnent bien sur un signal de parole propre ou dans des conditions contrôlées. Cependant, dans les scénarios qui impliquent la présence de perturbations acoustiques telles que du bruit ou de la réverbération les performances peuvent avoir tendance à se dégrader gravement. Dans cette HDR, nous nous concentrons sur le traitement de la parole et de son environnement d’un point de vue audio. Les algorithmes proposés ici reposent sur une variété de solutions allant des approches basées sur le traitement du signal aux solutions orientées données à base de factorisation matricielle supervisée ou de réseaux de neurones profonds. Nous proposons des solutions à des problèmes allant de la reconnaissance vocale au rehaussement de la parole ou à l’analyse des sons ambiants. L’objectif est d’offrir un panorama des différents aspects qui pourraient être améliorer un algorithme de traitement de la parole fonctionnant dans un environnement réel. Nous commençons par décrire la reconnaissance automatique de la parole comme une application finale potentielle et analysons progressivement les limites et les solutions proposées aboutissant à l’analyse plus générale des sons ambiants
Binaural Source Separation with Convolutional Neural Networks
This work is a study on source separation techniques for binaural music mixtures. The chosen framework uses a Convolutional Neural Network (CNN) to estimate time-frequency soft masks. This masks are used to extract the different sources from the original two-channel mixture signal. Its baseline single-channel architecture performed state-of-the-art results on monaural music mixtures under low-latency conditions. It has been extended to perform separation in two-channel signals, being the first two-channel CNN joint estimation architecture. This means that filters are learned for each source by taking in account both channels information. Furthermore, a specific binaural condition is included during training stage. It uses Interaural Level Difference (ILD) information to improve spatial images of extracted sources. Concurrently, we present a novel tool to create binaural scenes for testing purposes. Multiple binaural scenes are rendered from a music dataset of four instruments (voice, drums, bass and others). The CNN framework have been tested for these binaural scenes and compared with monaural and stereo results. The system showed a great amount of adaptability and good separation results in all the scenarios. These results are used to evaluate spatial information impact on separation performance
RTF-Based Binaural MVDR Beamformer Exploiting an External Microphone in a Diffuse Noise Field
Besides suppressing all undesired sound sources, an important objective of a
binaural noise reduction algorithm for hearing devices is the preservation of
the binaural cues, aiming at preserving the spatial perception of the acoustic
scene. A well-known binaural noise reduction algorithm is the binaural minimum
variance distortionless response beamformer, which can be steered using the
relative transfer function (RTF) vector of the desired source, relating the
acoustic transfer functions between the desired source and all microphones to a
reference microphone. In this paper, we propose a computationally efficient
method to estimate the RTF vector in a diffuse noise field, requiring an
additional microphone that is spatially separated from the head-mounted
microphones. Assuming that the spatial coherence between the noise components
in the head-mounted microphone signals and the additional microphone signal is
zero, we show that an unbiased estimate of the RTF vector can be obtained.
Based on real-world recordings, experimental results for several reverberation
times show that the proposed RTF estimator outperforms the widely used RTF
estimator based on covariance whitening and a simple biased RTF estimator in
terms of noise reduction and binaural cue preservation performance.Comment: Accepted at ITG Conference on Speech Communication 201
DNN-based mask estimation for distributed speech enhancement in spatially unconstrained microphone arrays
Deep neural network (DNN)-based speech enhancement algorithms in microphone
arrays have now proven to be efficient solutions to speech understanding and
speech recognition in noisy environments. However, in the context of ad-hoc
microphone arrays, many challenges remain and raise the need for distributed
processing. In this paper, we propose to extend a previously introduced
distributed DNN-based time-frequency mask estimation scheme that can
efficiently use spatial information in form of so-called compressed signals
which are pre-filtered target estimations. We study the performance of this
algorithm under realistic acoustic conditions and investigate practical aspects
of its optimal application. We show that the nodes in the microphone array
cooperate by taking profit of their spatial coverage in the room. We also
propose to use the compressed signals not only to convey the target estimation
but also the noise estimation in order to exploit the acoustic diversity
recorded throughout the microphone array.Comment: Submitted to TASL
Relative Transfer Function Vector Estimation for Acoustic Sensor Networks Exploiting Covariance Matrix Structure
In many multi-microphone algorithms for noise reduction, an estimate of the
relative transfer function (RTF) vector of the target speaker is required. The
state-of-the-art covariance whitening (CW) method estimates the RTF vector as
the principal eigenvector of the whitened noisy covariance matrix, where
whitening is performed using an estimate of the noise covariance matrix. In
this paper, we consider an acoustic sensor network consisting of multiple
microphone nodes. Assuming uncorrelated noise between the nodes but not within
the nodes, we propose two RTF vector estimation methods that leverage the
block-diagonal structure of the noise covariance matrix. The first method
modifies the CW method by considering only the diagonal blocks of the estimated
noise covariance matrix. In contrast, the second method only considers the
off-diagonal blocks of the noisy covariance matrix, but cannot be solved using
a simple eigenvalue decomposition. When applying the estimated RTF vector in a
minimum variance distortionless response beamformer, simulation results for
real-world recordings in a reverberant environment with multiple noise sources
show that the modified CW method performs slightly better than the CW method in
terms of SNR improvement, while the off-diagonal selection method outperforms a
biased RTF vector estimate obtained as the principal eigenvector of the noisy
covariance matrix.Comment: Proc. IEEE Workshop on Applications of Signal Processing to Audio and
Acoustics (WASPAA), New Paltz NY, USA, Oct. 202
Dual-Channel Speech Enhancement Based on Extended Kalman Filter Relative Transfer Function Estimation
This paper deals with speech enhancement in dual-microphone smartphones using
beamforming along with postfiltering techniques. The performance of these algorithms relies on
a good estimation of the acoustic channel and speech and noise statistics. In this work we present
a speech enhancement system that combines the estimation of the relative transfer function (RTF)
between microphones using an extended Kalman filter framework with a novel speech presence
probability estimator intended to track the noise statistics’ variability. The available dual-channel
information is exploited to obtain more reliable estimates of clean speech statistics. Noise reduction
is further improved by means of postfiltering techniques that take advantage of the speech presence
estimation. Our proposal is evaluated in different reverberant and noisy environments when the
smartphone is used in both close-talk and far-talk positions. The experimental results show that our
system achieves improvements in terms of noise reduction, low speech distortion and better speech
intelligibility compared to other state-of-the-art approaches.Spanish MINECO/FEDER Project TEC2016-80141-PSpanish
Ministry of Education through the National Program FPU under Grant FPU15/0416
Sampling Rate Offset Estimation and Compensation for Distributed Adaptive Node-Specific Signal Estimation in Wireless Acoustic Sensor Networks
Sampling rate offsets (SROs) between devices in a heterogeneous wireless
acoustic sensor network (WASN) can hinder the ability of distributed adaptive
algorithms to perform as intended when they rely on coherent signal processing.
In this paper, we present an SRO estimation and compensation method to allow
the deployment of the distributed adaptive node-specific signal estimation
(DANSE) algorithm in WASNs composed of asynchronous devices. The signals
available at each node are first utilised in a coherence-drift-based method to
blindly estimate SROs which are then compensated for via phase shifts in the
frequency domain. A modification of the weighted overlap-add (WOLA)
implementation of DANSE is introduced to account for SRO-induced full-sample
drifts, permitting per-sample signal transmission via an approximation of the
WOLA process as a time-domain convolution. The performance of the proposed
algorithm is evaluated in the context of distributed noise reduction for the
estimation of a target speech signal in an asynchronous WASN.Comment: 9 pages, 6 figure
Independent component analysis and source analysis of auditory evoked potentials for assessment of cochlear implant users
Source analysis of the Auditory Evoked Potential (AEP) has been used before to evaluate the maturation of the auditory system in both adult and children; in the same way, this technique could be applied to ongoing EEG recordings, in response to acoustic specific frequency stimuli, from children with cochlear implants (CI). This is done in oder to objectively assess the performance of this electronic device and the maturation of the child?s hearing. However, these recordings are contaminated by an artifact produced by the normal operation of the CI; this artifact in particular makes the detection and analysis of AEPs much harder and generates errors in the source analysis process. The artifact can be spatially filtered using Independent Component Analysis (ICA); in this research, three different ICA algorithms were compared in order to establish the more suited algorithm to remove the CI artifact. Additionally, we show that pre-processing the EEG recording, using a temporal ICA algorithm, facilitates not only the identification of the AEP peaks but also the source analysis procedure. From results obtained in this research and limited dataset of CI vs normal recordings, it is possible to conclude that the AEPs source locations change from the inferior temporal areas in the first 2 years after implantation to the superior temporal area after three years using the CIs, close to the locations obtained in normal hearing children. It is intended that the results of this research are used as an objective technique for a general evaluation of the performance of children with CIs
- …