Investigating Zero-Shot Generalizability on Mandarin-English
Code-Switched ASR and Speech-to-text Translation of Recent Foundation Models
with Self-Supervision and Weak Supervision
This work evaluated several cutting-edge large-scale foundation models based
on self-supervision or weak supervision, including SeamlessM4T, SeamlessM4T v2,
and Whisper-large-v3, on three code-switched corpora. We found that
self-supervised models can achieve performances close to the supervised model,
indicating the effectiveness of multilingual self-supervised pre-training. We
also observed that these models still have room for improvement as they kept
making similar mistakes and had unsatisfactory performances on modeling
intra-sentential code-switching. In addition, the validity of several variants
of Whisper was explored, and we concluded that they remained effective in a
code-switching scenario, and similar techniques for self-supervised models are
worth studying to boost the performance of code-switched tasks.Comment: Submitted to ICASSP 2024 Self-supervision in Audio, Speech and Beyond
worksho