23 research outputs found
DNN-Based Multi-Frame MVDR Filtering for Single-Microphone Speech Enhancement
Multi-frame approaches for single-microphone speech enhancement, e.g., the
multi-frame minimum-variance-distortionless-response (MVDR) filter, are able to
exploit speech correlations across neighboring time frames. In contrast to
single-frame approaches such as the Wiener gain, it has been shown that
multi-frame approaches achieve a substantial noise reduction with hardly any
speech distortion, provided that an accurate estimate of the correlation
matrices and especially the speech interframe correlation vector is available.
Typical estimation procedures of the correlation matrices and the speech
interframe correlation (IFC) vector require an estimate of the speech presence
probability (SPP) in each time-frequency bin. In this paper, we propose to use
a bi-directional long short-term memory deep neural network (DNN) to estimate a
speech mask and a noise mask for each time-frequency bin, using which two
different SPP estimates are derived. Aiming at achieving a robust performance,
the DNN is trained for various noise types and signal-to-noise ratios.
Experimental results show that the multi-frame MVDR in combination with the
proposed data-driven SPP estimator yields an increased speech quality compared
to a state-of-the-art model-based estimator
Deep Beamforming for Speech Enhancement and Speaker Localization with an Array Response-Aware Loss Function
Recent research advances in deep neural network (DNN)-based beamformers have
shown great promise for speech enhancement under adverse acoustic conditions.
Different network architectures and input features have been explored in
estimating beamforming weights. In this paper, we propose a deep beamformer
based on an efficient convolutional recurrent network (CRN) trained with a
novel ARray RespOnse-aWare (ARROW) loss function. The ARROW loss exploits the
array responses of the target and interferer by using the ground truth relative
transfer functions (RTFs). The DNN-based beamforming system, trained with ARROW
loss through supervised learning, is able to perform speech enhancement and
speaker localization jointly. Experimental results have shown that the proposed
deep beamformer, trained with the linearly weighted scale-invariant
source-to-noise ratio (SI-SNR) and ARROW loss functions, achieves superior
performance in speech enhancement and speaker localization compared to two
baselines.Comment: 6 page
Similarity-and-Independence-Aware Beamformer: Method for Target Source Extraction using Magnitude Spectrogram as Reference
This study presents a novel method for source extraction, referred to as the
similarity-and-independence-aware beamformer (SIBF). The SIBF extracts the
target signal using a rough magnitude spectrogram as the reference signal. The
advantage of the SIBF is that it can obtain an accurate target signal, compared
to the spectrogram generated by target-enhancing methods such as the speech
enhancement based on deep neural networks (DNNs). For the extraction, we extend
the framework of the deflationary independent component analysis, by
considering the similarity between the reference and extracted target, as well
as the mutual independence of all potential sources. To solve the extraction
problem by maximum-likelihood estimation, we introduce two source model types
that can reflect the similarity. The experimental results from the CHiME3
dataset show that the target signal extracted by the SIBF is more accurate than
the reference signal generated by the DNN.
Index Terms: semiblind source separation, similarity-and-independence-aware
beamformer, deflationary independent component analysis, source modelComment: Accepted in INTERSPEECH 202