Image pre-training, the current de-facto paradigm for a wide range of visual
tasks, is generally less favored in the field of video recognition. By
contrast, a common strategy is to directly train with spatiotemporal
convolutional neural networks (CNNs) from scratch. Nonetheless, interestingly,
by taking a closer look at these from-scratch learned CNNs, we note there exist
certain 3D kernels that exhibit much stronger appearance modeling ability than
others, arguably suggesting appearance information is already well disentangled
in learning. Inspired by this observation, we hypothesize that the key to
effectively leveraging image pre-training lies in the decomposition of learning
spatial and temporal features, and revisiting image pre-training as the
appearance prior to initializing 3D kernels. In addition, we propose
Spatial-Temporal Separable (STS) convolution, which explicitly splits the
feature channels into spatial and temporal groups, to further enable a more
thorough decomposition of spatiotemporal features for fine-tuning 3D CNNs. Our
experiments show that simply replacing 3D convolution with STS notably improves
a wide range of 3D CNNs without increasing parameters and computation on both
Kinetics-400 and Something-Something V2. Moreover, this new training pipeline
consistently achieves better results on video recognition with significant
speedup. For instance, we achieve +0.6% top-1 of Slowfast on Kinetics-400 over
the strong 256-epoch 128-GPU baseline while fine-tuning for only 50 epochs with
4 GPUs. The code and models are available at
https://github.com/UCSC-VLAA/Image-Pretraining-for-Video.Comment: Published as a conference paper at ECCV 202