2,808 research outputs found
STFT-Domain Neural Speech Enhancement with Very Low Algorithmic Latency
Deep learning based speech enhancement in the short-term Fourier transform
(STFT) domain typically uses a large window length such as 32 ms. A larger
window contains more samples and the frequency resolution can be higher for
potentially better enhancement. This however incurs an algorithmic latency of
32 ms in an online setup, because the overlap-add algorithm used in the inverse
STFT (iSTFT) is also performed based on the same 32 ms window size. To reduce
this inherent latency, we adapt a conventional dual window size approach, where
a regular input window size is used for STFT but a shorter output window is
used for the overlap-add in the iSTFT, for STFT-domain deep learning based
frame-online speech enhancement. Based on this STFT and iSTFT configuration, we
employ single- or multi-microphone complex spectral mapping for frame-online
enhancement, where a deep neural network (DNN) is trained to predict the real
and imaginary (RI) components of target speech from the mixture RI components.
In addition, we use the RI components predicted by the DNN to conduct
frame-online beamforming, the results of which are then used as extra features
for a second DNN to perform frame-online post-filtering. The frequency-domain
beamforming in between the two DNNs can be easily integrated with complex
spectral mapping and is designed to not incur any algorithmic latency.
Additionally, we propose a future-frame prediction technique to further reduce
the algorithmic latency. Evaluation results on a noisy-reverberant speech
enhancement task demonstrate the effectiveness of the proposed algorithms.
Compared with Conv-TasNet, our STFT-domain system can achieve better
enhancement performance for a comparable amount of computation, or comparable
performance with less computation, maintaining strong performance at an
algorithmic latency as low as 2 ms.Comment: in submissio
Multichannel Online Dereverberation based on Spectral Magnitude Inverse Filtering
This paper addresses the problem of multichannel online dereverberation. The
proposed method is carried out in the short-time Fourier transform (STFT)
domain, and for each frequency band independently. In the STFT domain, the
time-domain room impulse response is approximately represented by the
convolutive transfer function (CTF). The multichannel CTFs are adaptively
identified based on the cross-relation method, and using the recursive least
square criterion. Instead of the complex-valued CTF convolution model, we use a
nonnegative convolution model between the STFT magnitude of the source signal
and the CTF magnitude, which is just a coarse approximation of the former
model, but is shown to be more robust against the CTF perturbations. Based on
this nonnegative model, we propose an online STFT magnitude inverse filtering
method. The inverse filters of the CTF magnitude are formulated based on the
multiple-input/output inverse theorem (MINT), and adaptively estimated based on
the gradient descent criterion. Finally, the inverse filtering is applied to
the STFT magnitude of the microphone signals, obtaining an estimate of the STFT
magnitude of the source signal. Experiments regarding both speech enhancement
and automatic speech recognition are conducted, which demonstrate that the
proposed method can effectively suppress reverberation, even for the difficult
case of a moving speaker.Comment: Paper submitted to IEEE/ACM Transactions on Audio, Speech and
Language Processing. IEEE Signal Processing Letters, 201
- …