It has been shown that learning audiovisual features can lead to improved
speech recognition performance over audio-only features, especially for noisy
speech. However, in many common applications, the visual features are partially
or entirely missing, e.g.~the speaker might move off screen. Multi-modal models
need to be robust: missing video frames should not degrade the performance of
an audiovisual model to be worse than that of a single-modality audio-only
model. While there have been many attempts at building robust models, there is
little consensus on how robustness should be evaluated. To address this, we
introduce a framework that allows claims about robustness to be evaluated in a
precise and testable way. We also conduct a systematic empirical study of the
robustness of common audiovisual speech recognition architectures on a range of
acoustic noise conditions and test suites. Finally, we show that an
architecture-agnostic solution based on cascades can consistently achieve
robustness to missing video, even in settings where existing techniques for
robustness like dropout fall short