3 research outputs found
Real-time binaural speech separation with preserved spatial cues
Deep learning speech separation algorithms have achieved great success in
improving the quality and intelligibility of separated speech from mixed audio.
Most previous methods focused on generating a single-channel output for each of
the target speakers, hence discarding the spatial cues needed for the
localization of sound sources in space. However, preserving the spatial
information is important in many applications that aim to accurately render the
acoustic scene such as in hearing aids and augmented reality (AR). Here, we
propose a speech separation algorithm that preserves the interaural cues of
separated sound sources and can be implemented with low latency and high
fidelity, therefore enabling a real-time modification of the acoustic scene.
Based on the time-domain audio separation network (TasNet), a single-channel
time-domain speech separation system that can be implemented in real-time, we
propose a multi-input-multi-output (MIMO) end-to-end extension of TasNet that
takes binaural mixed audio as input and simultaneously separates target
speakers in both channels. Experimental results show that the proposed
end-to-end MIMO system is able to significantly improve the separation
performance and keep the perceived location of the modified sources intact in
various acoustic scenes.Comment: To appear in ICASSP 202
SAGRNN: Self-Attentive Gated RNN for Binaural Speaker Separation with Interaural Cue Preservation
Most existing deep learning based binaural speaker separation systems focus
on producing a monaural estimate for each of the target speakers, and thus do
not preserve the interaural cues, which are crucial for human listeners to
perform sound localization and lateralization. In this study, we address
talker-independent binaural speaker separation with interaural cues preserved
in the estimated binaural signals. Specifically, we extend a newly-developed
gated recurrent neural network for monaural separation by additionally
incorporating self-attention mechanisms and dense connectivity. We develop an
end-to-end multiple-input multiple-output system, which directly maps from the
binaural waveform of the mixture to those of the speech signals. The
experimental results show that our proposed approach achieves significantly
better separation performance than a recent binaural separation approach. In
addition, our approach effectively preserves the interaural cues, which
improves the accuracy of sound localization.Comment: 5 pages, accepted by IEEE Signal Processing Letter
Online Self-Attentive Gated RNNs for Real-Time Speaker Separation
Deep neural networks have recently shown great success in the task of blind
source separation, both under monaural and binaural settings. Although these
methods were shown to produce high-quality separations, they were mainly
applied under offline settings, in which the model has access to the full input
signal while separating the signal. In this study, we convert a non-causal
state-of-the-art separation model into a causal and real-time model and
evaluate its performance under both online and offline settings. We compare the
performance of the proposed model to several baseline methods under anechoic,
noisy, and noisy-reverberant recording conditions while exploring both monaural
and binaural inputs and outputs. Our findings shed light on the relative
difference between causal and non-causal models when performing separation. Our
stateful implementation for online separation leads to a minor drop in
performance compared to the offline model; 0.8dB for monaural inputs and 0.3dB
for binaural inputs while reaching a real-time factor of 0.65. Samples can be
found under the following link:
https://kwanum.github.io/sagrnnc-stream-results/.Comment: Appears at the Workshop on Machine Learning in Speech and Language
Processing 202