Zero-shot video recognition (ZSVR) is a task that aims to recognize video
categories that have not been seen during the model training process. Recently,
vision-language models (VLMs) pre-trained on large-scale image-text pairs have
demonstrated impressive transferability for ZSVR. To make VLMs applicable to
the video domain, existing methods often use an additional temporal learning
module after the image-level encoder to learn the temporal relationships among
video frames. Unfortunately, for video from unseen categories, we observe an
abnormal phenomenon where the model that uses spatial-temporal feature performs
much worse than the model that removes temporal learning module and uses only
spatial feature. We conjecture that improper temporal modeling on video
disrupts the spatial feature of the video. To verify our hypothesis, we propose
Feature Factorization to retain the orthogonal temporal feature of the video
and use interpolation to construct refined spatial-temporal feature. The model
using appropriately refined spatial-temporal feature performs better than the
one using only spatial feature, which verifies the effectiveness of the
orthogonal temporal feature for the ZSVR task. Therefore, an Orthogonal
Temporal Interpolation module is designed to learn a better refined
spatial-temporal video feature during training. Additionally, a Matching Loss
is introduced to improve the quality of the orthogonal temporal feature. We
propose a model called OTI for ZSVR by employing orthogonal temporal
interpolation and the matching loss based on VLMs. The ZSVR accuracies on
popular video datasets (i.e., Kinetics-600, UCF101 and HMDB51) show that OTI
outperforms the previous state-of-the-art method by a clear margin