1,620 research outputs found
Online Monaural Speech Enhancement Using Delayed Subband LSTM
This paper proposes a delayed subband LSTM network for online monaural
(single-channel) speech enhancement. The proposed method is developed in the
short time Fourier transform (STFT) domain. Online processing requires
frame-by-frame signal reception and processing. A paramount feature of the
proposed method is that the same LSTM is used across frequencies, which
drastically reduces the number of network parameters, the amount of training
data and the computational burden. Training is performed in a subband manner:
the input consists of one frequency, together with a few context frequencies.
The network learns a speech-to-noise discriminative function relying on the
signal stationarity and on the local spectral pattern, based on which it
predicts a clean-speech mask at each frequency. To exploit future information,
i.e. look-ahead, we propose an output-delayed subband architecture, which
allows the unidirectional forward network to process a few future frames in
addition to the current frame. We leverage the proposed method to participate
to the DNS real-time speech enhancement challenge. Experiments with the DNS
dataset show that the proposed method achieves better performance-measuring
scores than the DNS baseline method, which learns the full-band spectra using a
gated recurrent unit network.Comment: Paper submitted to Interspeech 202
Structured Sparsity Models for Multiparty Speech Recovery from Reverberant Recordings
We tackle the multi-party speech recovery problem through modeling the
acoustic of the reverberant chambers. Our approach exploits structured sparsity
models to perform room modeling and speech recovery. We propose a scheme for
characterizing the room acoustic from the unknown competing speech sources
relying on localization of the early images of the speakers by sparse
approximation of the spatial spectra of the virtual sources in a free-space
model. The images are then clustered exploiting the low-rank structure of the
spectro-temporal components belonging to each source. This enables us to
identify the early support of the room impulse response function and its unique
map to the room geometry. To further tackle the ambiguity of the reflection
ratios, we propose a novel formulation of the reverberation model and estimate
the absorption coefficients through a convex optimization exploiting joint
sparsity model formulated upon spatio-spectral sparsity of concurrent speech
representation. The acoustic parameters are then incorporated for separating
individual speech signals through either structured sparse recovery or inverse
filtering the acoustic channels. The experiments conducted on real data
recordings demonstrate the effectiveness of the proposed approach for
multi-party speech recovery and recognition.Comment: 31 page
- …