7 research outputs found
Multi-Talker MVDR Beamforming Based on Extended Complex Gaussian Mixture Model
In this letter, we present a novel multi-talker minimum variance
distortionless response (MVDR) beamforming as the front-end of an automatic
speech recognition (ASR) system in a dinner party scenario. The CHiME-5 dataset
is selected to evaluate our proposal for overlapping multi-talker scenario with
severe noise. A detailed study on beamforming is conducted based on the
proposed extended complex Gaussian mixture model (CGMM) integrated with various
speech separation and speech enhancement masks. Three main changes are made to
adopt the original CGMM-based MVDR for the multi-talker scenario. First, the
number of Gaussian distributions is extended to 3 with an additional inference
speaker model. Second, the mixture coefficients are introduced as a supervisor
to generate more elaborate masks and avoid the permutation problems. Moreover,
we reorganize the MVDR and mask-based speech separation to achieve both noise
reduction and target speaker extraction. With the official baseline ASR
back-end, our front-end algorithm gained an absolute WER reduction of 13.87%
compared with the baseline front-end
Multi-Channel Speech Enhancement using Graph Neural Networks
Multi-channel speech enhancement aims to extract clean speech from a noisy
mixture using signals captured from multiple microphones. Recently proposed
methods tackle this problem by incorporating deep neural network models with
spatial filtering techniques such as the minimum variance distortionless
response (MVDR) beamformer. In this paper, we introduce a different research
direction by viewing each audio channel as a node lying in a non-Euclidean
space and, specifically, a graph. This formulation allows us to apply graph
neural networks (GNN) to find spatial correlations among the different channels
(nodes). We utilize graph convolution networks (GCN) by incorporating them in
the embedding space of a U-Net architecture. We use LibriSpeech dataset and
simulate room acoustics data to extensively experiment with our approach using
different array types, and number of microphones. Results indicate the
superiority of our approach when compared to prior state-of-the-art method
Robust Multi-channel Speech Recognition using Frequency Aligned Network
Conventional speech enhancement technique such as beamforming has known
benefits for far-field speech recognition. Our own work in frequency-domain
multi-channel acoustic modeling has shown additional improvements by training a
spatial filtering layer jointly within an acoustic model. In this paper, we
further develop this idea and use frequency aligned network for robust
multi-channel automatic speech recognition (ASR). Unlike an affine layer in the
frequency domain, the proposed frequency aligned component prevents one
frequency bin influencing other frequency bins. We show that this modification
not only reduces the number of parameters in the model but also significantly
and improves the ASR performance. We investigate effects of frequency aligned
network through ASR experiments on the real-world far-field data where users
are interacting with an ASR system in uncontrolled acoustic environments. We
show that our multi-channel acoustic model with a frequency aligned network
shows up to 18% relative reduction in word error rate
Implicit Filter-and-sum Network for Multi-channel Speech Separation
Various neural network architectures have been proposed in recent years for
the task of multi-channel speech separation. Among them, the filter-and-sum
network (FaSNet) performs end-to-end time-domain filter-and-sum beamforming and
has shown effective in both ad-hoc and fixed microphone array geometries. In
this paper, we investigate multiple ways to improve the performance of FaSNet.
From the problem formulation perspective, we change the explicit time-domain
filter-and-sum operation which involves all the microphones into an implicit
filter-and-sum operation in the latent space of only the reference microphone.
The filter-and-sum operation is applied on a context around the frame to be
separated. This allows the problem formulation to better match the objective of
end-to-end separation. From the feature extraction perspective, we modify the
calculation of sample-level normalized cross correlation (NCC) features into
feature-level NCC (fNCC) features. This makes the model better matches the
implicit filter-and-sum formulation. Experiment results on both ad-hoc and
fixed microphone array geometries show that the proposed modification to the
FaSNet, which we refer to as iFaSNet, is able to significantly outperform the
benchmark FaSNet across all conditions with an on par model complexity
Multichannel Loss Function for Supervised Speech Source Separation by Mask-based Beamforming
In this paper, we propose two mask-based beamforming methods using a deep
neural network (DNN) trained by multichannel loss functions. Beamforming
technique using time-frequency (TF)-masks estimated by a DNN have been applied
to many applications where TF-masks are used for estimating spatial covariance
matrices. To train a DNN for mask-based beamforming, loss functions designed
for monaural speech enhancement/separation have been employed. Although such a
training criterion is simple, it does not directly correspond to the
performance of mask-based beamforming. To overcome this problem, we use
multichannel loss functions which evaluate the estimated spatial covariance
matrices based on the multichannel Itakura--Saito divergence. DNNs trained by
the multichannel loss functions can be applied to construct several
beamformers. Experimental results confirmed their effectiveness and robustness
to microphone configurations.Comment: 5 pages, Accepted at INTERSPEECH 201
Exploring Optimal DNN Architecture for End-to-End Beamformers Based on Time-frequency References
Acoustic beamformers have been widely used to enhance audio signals.
Currently, the best methods are the deep neural network (DNN)-powered variants
of the generalized eigenvalue and minimum-variance distortionless response
beamformers and the DNN-based filter-estimation methods that are used to
directly compute beamforming filters. Both approaches are effective; however,
they have blind spots in their generalizability. Therefore, we propose a novel
approach for combining these two methods into a single framework that attempts
to exploit the best features of both. The resulting model, called the W-Net
beamformer, includes two components; the first computes time-frequency
references that the second uses to estimate beamforming filters. The results on
data that include a wide variety of room and noise conditions, including static
and mobile noise sources, show that the proposed beamformer outperforms other
methods on all tested evaluation metrics, which signifies that the proposed
architecture allows for effective computation of the beamforming filters.Comment: arXiv admin note: substantial text overlap with arXiv:1910.1426
Block-Online Guided Source Separation
We propose a block-online algorithm of guided source separation (GSS). GSS is
a speech separation method that uses diarization information to update
parameters of the generative model of observation signals. Previous studies
have shown that GSS performs well in multi-talker scenarios. However, it
requires a large amount of calculation time, which is an obstacle to the
deployment of online applications. It is also a problem that the offline GSS is
an utterance-wise algorithm so that it produces latency according to the length
of the utterance. With the proposed algorithm, block-wise input samples and
corresponding time annotations are concatenated with those in the preceding
context and used to update the parameters. Using the context enables the
algorithm to estimate time-frequency masks accurately only from one iteration
of optimization for each block, and its latency does not depend on the
utterance length but predetermined block length. It also reduces calculation
cost by updating only the parameters of active speakers in each block and its
context. Evaluation on the CHiME-6 corpus and a meeting corpus showed that the
proposed algorithm achieved almost the same performance as the conventional
offline GSS algorithm but with 32x faster calculation, which is sufficient for
real-time applications.Comment: Accepted to SLT 202