2,921 research outputs found
End-to-End Multi-Look Keyword Spotting
The performance of keyword spotting (KWS), measured in false alarms and false
rejects, degrades significantly under the far field and noisy conditions. In
this paper, we propose a multi-look neural network modeling for speech
enhancement which simultaneously steers to listen to multiple sampled look
directions. The multi-look enhancement is then jointly trained with KWS to form
an end-to-end KWS model which integrates the enhanced signals from multiple
look directions and leverages an attention mechanism to dynamically tune the
model's attention to the reliable sources. We demonstrate, on our large noisy
and far-field evaluation sets, that the proposed approach significantly
improves the KWS performance against the baseline KWS system and a recent
beamformer based multi-beam KWS system.Comment: Submitted to Interspeech202
Speech Separation Using Partially Asynchronous Microphone Arrays Without Resampling
We consider the problem of separating speech sources captured by multiple
spatially separated devices, each of which has multiple microphones and samples
its signals at a slightly different rate. Most asynchronous array processing
methods rely on sample rate offset estimation and resampling, but these offsets
can be difficult to estimate if the sources or microphones are moving. We
propose a source separation method that does not require offset estimation or
signal resampling. Instead, we divide the distributed array into several
synchronous subarrays. All arrays are used jointly to estimate the time-varying
signal statistics, and those statistics are used to design separate
time-varying spatial filters in each array. We demonstrate the method for
speech mixtures recorded on both stationary and moving microphone arrays.Comment: To appear at the International Workshop on Acoustic Signal
Enhancement (IWAENC 2018
Low-Latency Speaker-Independent Continuous Speech Separation
Speaker independent continuous speech separation (SI-CSS) is a task of
converting a continuous audio stream, which may contain overlapping voices of
unknown speakers, into a fixed number of continuous signals each of which
contains no overlapping speech segment. A separated, or cleaned, version of
each utterance is generated from one of SI-CSS's output channels
nondeterministically without being split up and distributed to multiple
channels. A typical application scenario is transcribing multi-party
conversations, such as meetings, recorded with microphone arrays. The output
signals can be simply sent to a speech recognition engine because they do not
include speech overlaps. The previous SI-CSS method uses a neural network
trained with permutation invariant training and a data-driven beamformer and
thus requires much processing latency. This paper proposes a low-latency SI-CSS
method whose performance is comparable to that of the previous method in a
microphone array-based meeting transcription task.This is achieved (1) by using
a new speech separation network architecture combined with a double buffering
scheme and (2) by performing enhancement with a set of fixed beamformers
followed by a neural post-filter
Enhanced Robot Audition Based on Microphone Array Source Separation with Post-Filter
We propose a system that gives a mobile robot the ability to separate
simultaneous sound sources. A microphone array is used along with a real-time
dedicated implementation of Geometric Source Separation and a post-filter that
gives us a further reduction of interferences from other sources. We present
results and comparisons for separation of multiple non-stationary speech
sources combined with noise sources. The main advantage of our approach for
mobile robots resides in the fact that both the frequency-domain Geometric
Source Separation algorithm and the post-filter are able to adapt rapidly to
new sources and non-stationarity. Separation results are presented for three
simultaneous interfering speakers in the presence of noise. A reduction of log
spectral distortion (LSD) and increase of signal-to-noise ratio (SNR) of
approximately 10 dB and 14 dB are observed.Comment: 6 page
Guided Source Separation Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR
In this paper, we present Hitachi and Paderborn University's joint effort for
automatic speech recognition (ASR) in a dinner party scenario. The main
challenges of ASR systems for dinner party recordings obtained by multiple
microphone arrays are (1) heavy speech overlaps, (2) severe noise and
reverberation, (3) very natural conversational content, and possibly (4)
insufficient training data. As an example of a dinner party scenario, we have
chosen the data presented during the CHiME-5 speech recognition challenge,
where the baseline ASR had a 73.3% word error rate (WER), and even the best
performing system at the CHiME-5 challenge had a 46.1% WER. We extensively
investigated a combination of the guided source separation-based speech
enhancement technique and an already proposed strong ASR backend and found that
a tight combination of these techniques provided substantial accuracy
improvements. Our final system achieved WERs of 39.94% and 41.64% for the
development and evaluation data, respectively, both of which are the best
published results for the dataset. We also investigated with additional
training data on the official small data in the CHiME-5 corpus to assess the
intrinsic difficulty of this ASR task.Comment: Accepted to INTERSPEECH 201
A unified convolutional beamformer for simultaneous denoising and dereverberation
This paper proposes a method for estimating a convolutional beamformer that
can perform denoising and dereverberation simultaneously in an optimal way. The
application of dereverberation based on a weighted prediction error (WPE)
method followed by denoising based on a minimum variance distortionless
response (MVDR) beamformer has conventionally been considered a promising
approach, however, the optimality of this approach cannot be guaranteed. To
realize the optimal integration of denoising and dereverberation, we present a
method that unifies the WPE dereverberation method and a variant of the MVDR
beamformer, namely a minimum power distortionless response (MPDR) beamformer,
into a single convolutional beamformer, and we optimize it based on a single
unified optimization criterion. The proposed beamformer is referred to as a
Weighted Power minimization Distortionless response (WPD) beamformer.
Experiments show that the proposed method substantially improves the speech
enhancement performance in terms of both objective speech enhancement measures
and automatic speech recognition (ASR) performance.Comment: Published in IEEE Signal Processing Letter
Neural Spatio-Temporal Beamformer for Target Speech Separation
Purely neural network (NN) based speech separation and enhancement methods,
although can achieve good objective scores, inevitably cause nonlinear speech
distortions that are harmful for the automatic speech recognition (ASR). On the
other hand, the minimum variance distortionless response (MVDR) beamformer with
NN-predicted masks, although can significantly reduce speech distortions, has
limited noise reduction capability. In this paper, we propose a multi-tap MVDR
beamformer with complex-valued masks for speech separation and enhancement.
Compared to the state-of-the-art NN-mask based MVDR beamformer, the multi-tap
MVDR beamformer exploits the inter-frame correlation in addition to the
inter-microphone correlation that is already utilized in prior arts. Further
improvements include the replacement of the real-valued masks with the
complex-valued masks and the joint training of the complex-mask NN. The
evaluation on our multi-modal multi-channel target speech separation and
enhancement platform demonstrates that our proposed multi-tap MVDR beamformer
improves both the ASR accuracy and the perceptual speech quality against prior
arts.Comment: accepted to Interspeech2020, Demo:
https://yongxuustc.github.io/mtmvdr
Spatial Diffuseness Features for DNN-Based Speech Recognition in Noisy and Reverberant Environments
We propose a spatial diffuseness feature for deep neural network (DNN)-based
automatic speech recognition to improve recognition accuracy in reverberant and
noisy environments. The feature is computed in real-time from multiple
microphone signals without requiring knowledge or estimation of the direction
of arrival, and represents the relative amount of diffuse noise in each time
and frequency bin. It is shown that using the diffuseness feature as an
additional input to a DNN-based acoustic model leads to a reduced word error
rate for the REVERB challenge corpus, both compared to logmelspec features
extracted from noisy signals, and features enhanced by spectral subtraction.Comment: accepted for ICASSP201
Noise Robust IOA/CAS Speech Separation and Recognition System For The Third 'CHIME' Challenge
This paper presents the contribution to the third 'CHiME' speech separation
and recognition challenge including both front-end signal processing and
back-end speech recognition. In the front-end, Multi-channel Wiener filter
(MWF) is designed to achieve background noise reduction. Different from
traditional MWF, optimized parameter for the tradeoff between noise reduction
and target signal distortion is built according to the desired noise reduction
level. In the back-end, several techniques are taken advantage to improve the
noisy Automatic Speech Recognition (ASR) performance including Deep Neural
Network (DNN), Convolutional Neural Network (CNN) and Long short-term memory
(LSTM) using medium vocabulary, Lattice rescoring with a big vocabulary
language model finite state transducer, and ROVER scheme. Experimental results
show the proposed system combining front-end and back-end is effective to
improve the ASR performance.Comment: 5 pages, 1 figur
Microphone Subset Selection for MVDR Beamformer Based Noise Reduction
In large-scale wireless acoustic sensor networks (WASNs), many of the sensors
will only have a marginal contribution to a certain estimation task. Involving
all sensors increases the energy budget unnecessarily and decreases the
lifetime of the WASN. Using microphone subset selection, also termed as sensor
selection, the most informative sensors can be chosen from a set of candidate
sensors to achieve a prescribed inference performance. In this paper, we
consider microphone subset selection for minimum variance distortionless
response (MVDR) beamformer based noise reduction. The best subset of sensors is
determined by minimizing the transmission cost while constraining the output
noise power (or signal-to-noise ratio). Assuming the statistical information on
correlation matrices of the sensor measurements is available, the sensor
selection problem for this model-driven scheme is first solved by utilizing
convex optimization techniques. In addition, to avoid estimating the statistics
related to all the candidate sensors beforehand, we also propose a data-driven
approach to select the best subset using a greedy strategy. The performance of
the greedy algorithm converges to that of the model-driven method, while it
displays advantages in dynamic scenarios as well as on computational
complexity. Compared to a sparse MVDR or radius-based beamformer, experiments
show that the proposed methods can guarantee the desired performance with
significantly less transmission costs
- …