While audio-visual speech models can yield superior performance and
robustness compared to audio-only models, their development and adoption are
hindered by the lack of labeled and unlabeled audio-visual data and the cost to
deploy one model per modality. In this paper, we present u-HuBERT, a
self-supervised pre-training framework that can leverage both multimodal and
unimodal speech with a unified masked cluster prediction objective. By
utilizing modality dropout during pre-training, we demonstrate that a single
fine-tuned model can achieve performance on par or better than the
state-of-the-art modality-specific models. Moreover, our model fine-tuned only
on audio can perform well with audio-visual and visual speech input, achieving
zero-shot modality generalization for speech recognition and speaker
verification. In particular, our single model yields 1.2%/1.4%/27.2% speech
recognition word error rate on LRS3 with audio-visual/audio/visual input