'Institute of Electrical and Electronics Engineers (IEEE)'
Abstract
This paper sets out to solve the following problem: How
can we turn a generative video captioning model into an
open-world video/action classification model? Video captioning models can naturally produce open-ended free-form
descriptions of a given video which, however, might not be
discriminative enough for video/action recognition. Unfortunately, when fine-tuned to auto-regress the class names
directly, video captioning models overfit the base classes
losing their open-world zero-shot capabilities. To alleviate
base class overfitting, in this work, we propose to use reinforcement learning to enforce the output of the video captioning model to be more class-level discriminative. Specifically, we propose ReGen, a novel reinforcement learning
based framework with a three-fold objective and reward
functions: (1) a class-level discrimination reward that enforces the generated caption to be correctly classified into
the corresponding action class, (2) a CLIP reward that encourages the generated caption to continue to be descriptive
of the input video (i.e. video-specific), and (3) a grammar
reward that preserves the grammatical correctness of the
caption. We show that ReGen can train a model to produce
captions that are: discriminative, video-specific and grammatically correct. Importantly, when evaluated on standard
benchmarks for zero- and few-shot action classification, ReGen significantly outperforms the previous state-of-the-art