Few-shot action recognition in videos is challenging for its lack of
supervision and difficulty in generalizing to unseen actions. To address this
task, we propose a simple yet effective method, called knowledge prompting,
which leverages commonsense knowledge of actions from external resources to
prompt a powerful pre-trained vision-language model for few-shot
classification. We first collect large-scale language descriptions of actions,
defined as text proposals, to build an action knowledge base. The collection of
text proposals is done by filling in handcraft sentence templates with external
action-related corpus or by extracting action-related phrases from captions of
Web instruction videos.Then we feed these text proposals into the pre-trained
vision-language model along with video frames to generate matching scores of
the proposals to each frame, and the scores can be treated as action semantics
with strong generalization. Finally, we design a lightweight temporal modeling
network to capture the temporal evolution of action semantics for
classification.Extensive experiments on six benchmark datasets demonstrate that
our method generally achieves the state-of-the-art performance while reducing
the training overhead to 0.001 of existing methods