88 research outputs found
An empirical study of Conv-TasNet
Conv-TasNet is a recently proposed waveform-based deep neural network that
achieves state-of-the-art performance in speech source separation. Its
architecture consists of a learnable encoder/decoder and a separator that
operates on top of this learned space. Various improvements have been proposed
to Conv-TasNet. However, they mostly focus on the separator, leaving its
encoder/decoder as a (shallow) linear operator. In this paper, we conduct an
empirical study of Conv-TasNet and propose an enhancement to the
encoder/decoder that is based on a (deep) non-linear variant of it. In
addition, we experiment with the larger and more diverse LibriTTS dataset and
investigate the generalization capabilities of the studied models when trained
on a much larger dataset. We propose cross-dataset evaluation that includes
assessing separations from the WSJ0-2mix, LibriTTS and VCTK databases. Our
results show that enhancements to the encoder/decoder can improve average
SI-SNR performance by more than 1 dB. Furthermore, we offer insights into the
generalization capabilities of Conv-TasNet and the potential value of
improvements to the encoder/decoder.Comment: In proceedings of ICASSP202
LibriMix: An Open-Source Dataset for Generalizable Speech Separation
In recent years, wsj0-2mix has become the reference dataset for
single-channel speech separation. Most deep learning-based speech separation
models today are benchmarked on it. However, recent studies have shown
important performance drops when models trained on wsj0-2mix are evaluated on
other, similar datasets. To address this generalization issue, we created
LibriMix, an open-source alternative to wsj0-2mix, and to its noisy extension,
WHAM!. Based on LibriSpeech, LibriMix consists of two- or three-speaker
mixtures combined with ambient noise samples from WHAM!. Using Conv-TasNet, we
achieve competitive performance on all LibriMix versions. In order to fairly
evaluate across datasets, we introduce a third test set based on VCTK for
speech and WHAM! for noise. Our experiments show that the generalization error
is smaller for models trained with LibriMix than with WHAM!, in both clean and
noisy conditions. Aiming towards evaluation in more realistic,
conversation-like scenarios, we also release a sparsely overlapping version of
LibriMix's test set.Comment: submitted to INTERSPEECH 202
Music Source Separation in the Waveform Domain
Source separation for music is the task of isolating contributions, or stems, from different instruments recorded individually and arranged together to form a song. Such components include voice, bass, drums and any other accompaniments. Contrarily to many audio synthesis tasks where the best performances are achieved by models that directly generate the waveform, the state-of-the-art in source separation for music is to compute masks on the magnitude spectrum. In this paper, we first show that an adaptation of Conv-Tasnet (Luo & Mesgarani, 2019), a waveform-to-waveform model for source separation for speech, significantly beats the state-of-the-art on the MusDB dataset, the standard benchmark of multi-instrument source separation. Second, we observe that Conv-Tasnet follows a masking approach on the input signal, which has the potential drawback of removing parts of the relevant source without the capacity to reconstruct it. We propose Demucs, a new waveform-to-waveform model, which has an architecture closer to models for audio generation with more capacity on the decoder. Experiments on the MusDB dataset show that Demucs beats previously reported results in terms of signal to distortion ratio (SDR), but lower than Conv-Tasnet. Human evaluations show that Demucs has significantly higher quality (as assessed by mean opinion score) than Conv-Tasnet, but slightly more contamination from other sources, which explains the difference in SDR. Additional experiments with a larger dataset suggest that the gap in SDR between Demucs and Conv-Tasnet shrinks, showing that our approach is promising
Music Source Separation in the Waveform Domain
Source separation for music is the task of isolating contributions, or stems, from different instruments recorded individually and arranged together to form a song. Such components include voice, bass, drums and any other accompaniments. Contrarily to many audio synthesis tasks where the best performances are achieved by models that directly generate the waveform, the state-of-the-art in source separation for music is to compute masks on the magnitude spectrum. In this paper, we first show that an adaptation of Conv-Tasnet (Luo & Mesgarani, 2019), a waveform-to-waveform model for source separation for speech, significantly beats the state-of-the-art on the MusDB dataset, the standard benchmark of multi-instrument source separation. Second, we observe that Conv-Tasnet follows a masking approach on the input signal, which has the potential drawback of removing parts of the relevant source without the capacity to reconstruct it. We propose Demucs, a new waveform-to-waveform model, which has an architecture closer to models for audio generation with more capacity on the decoder. Experiments on the MusDB dataset show that Demucs beats previously reported results in terms of signal to distortion ratio (SDR), but lower than Conv-Tasnet. Human evaluations show that Demucs has significantly higher quality (as assessed by mean opinion score) than Conv-Tasnet, but slightly more contamination from other sources, which explains the difference in SDR. Additional experiments with a larger dataset suggest that the gap in SDR between Demucs and Conv-Tasnet shrinks, showing that our approach is promising
On End-to-end Multi-channel Time Domain Speech Separation in Reverberant Environments
This paper introduces a new method for multi-channel time domain speech
separation in reverberant environments. A fully-convolutional neural network
structure has been used to directly separate speech from multiple microphone
recordings, with no need of conventional spatial feature extraction. To reduce
the influence of reverberation on spatial feature extraction, a dereverberation
pre-processing method has been applied to further improve the separation
performance. A spatialized version of wsj0-2mix dataset has been simulated to
evaluate the proposed system. Both source separation and speech recognition
performance of the separated signals have been evaluated objectively.
Experiments show that the proposed fully-convolutional network improves the
source separation metric and the word error rate (WER) by more than 13% and 50%
relative, respectively, over a reference system with conventional features.
Applying dereverberation as pre-processing to the proposed system can further
reduce the WER by 29% relative using an acoustic model trained on clean and
reverberated data.Comment: Presented at IEEE ICASSP 202
- …