Video self-supervised learning (VSSL) has made significant progress in recent
years. However, the exact behavior and dynamics of these models under different
forms of distribution shift are not yet known. In this paper, we
comprehensively study the behavior of six popular self-supervised methods
(v-SimCLR, v-MoCo, v-BYOL, v-SimSiam, v-DINO, v-MAE) in response to various
forms of natural distribution shift, i.e., (i) context shift, (ii) viewpoint
shift, (iii) actor shift, (iv) source shift, (v) generalizability to unknown
classes (zero-shot), and (vi) open-set recognition. To perform this extensive
study, we carefully craft a test bed consisting of 17 in-distribution and
out-of-distribution benchmark pairs using available public datasets and a
series of evaluation protocols to stress-test the different methods under the
intended shifts. Our study uncovers a series of intriguing findings and
interesting behaviors of VSSL methods. For instance, we observe that while
video models generally struggle with context shifts, v-MAE and supervised
learning exhibit more robustness. Moreover, our study shows that v-MAE is a
strong temporal learner, whereas contrastive methods, v-SimCLR and v-MoCo,
exhibit strong performances against viewpoint shifts. When studying the notion
of open-set recognition, we notice a trade-off between closed-set and open-set
recognition performance if the pretrained VSSL encoders are used without
finetuning. We hope that our work will contribute to the development of robust
video representation learning frameworks for various real-world scenarios. The
project page and code are available at: https://pritamqu.github.io/OOD-VSSL.Comment: NeurIPS 2023 Spotligh