1 research outputs found
Improved Source Counting and Separation for Monaural Mixture
Single-channel speech separation in time domain and frequency domain has been
widely studied for voice-driven applications over the past few years. Most of
previous works assume known number of speakers in advance, however, which is
not easily accessible through monaural mixture in practice. In this paper, we
propose a novel model of single-channel multi-speaker separation by jointly
learning the time-frequency feature and the unknown number of speakers.
Specifically, our model integrates the time-domain convolution encoded feature
map and the frequency-domain spectrogram by attention mechanism, and the
integrated features are projected into high-dimensional embedding vectors which
are then clustered with deep attractor network to modify the encoded feature.
Meanwhile, the number of speakers is counted by computing the Gerschgorin disks
of the embedding vectors which are orthogonal for different speakers. Finally,
the modified encoded feature is inverted to the sound waveform using a linear
decoder. Experimental evaluation on the GRID dataset shows that the proposed
method with a single model can accurately estimate the number of speakers with
96.7 % probability of success, while achieving the state-of-the-art separation
results on multi-speaker mixtures in terms of scale-invariant signal-to-noise
ratio improvement (SI-SNRi) and signal-to-distortion ratio improvement (SDRi)