Self-supervised learning (SSL) methods such as WavLM have shown promising
speech separation (SS) results in small-scale simulation-based experiments. In
this work, we extend the exploration of the SSL-based SS by massively scaling
up both the pre-training data (more than 300K hours) and fine-tuning data (10K
hours). We also investigate various techniques to efficiently integrate the
pre-trained model with the SS network under a limited computation budget,
including a low frame rate SSL model training setup and a fine-tuning scheme
using only the part of the pre-trained model. Compared with a supervised
baseline and the WavLM-based SS model using feature embeddings obtained with
the previously released 94K hours trained WavLM, our proposed model obtains
15.9% and 11.2% of relative word error rate (WER) reductions, respectively, for
a simulated far-field speech mixture test set. For conversation transcription
on real meeting recordings using continuous speech separation, the proposed
model achieves 6.8% and 10.6% of relative WER reductions over the purely
supervised baseline on AMI and ICSI evaluation sets, respectively, while
reducing the computational cost by 38%