80 research outputs found
Compare and Reweight: Distinctive Image Captioning Using Similar Images Sets
A wide range of image captioning models has been developed, achieving
significant improvement based on popular metrics, such as BLEU, CIDEr, and
SPICE. However, although the generated captions can accurately describe the
image, they are generic for similar images and lack distinctiveness, i.e.,
cannot properly describe the uniqueness of each image. In this paper, we aim to
improve the distinctiveness of image captions through training with sets of
similar images. First, we propose a distinctiveness metric -- between-set CIDEr
(CIDErBtw) to evaluate the distinctiveness of a caption with respect to those
of similar images. Our metric shows that the human annotations of each image
are not equivalent based on distinctiveness. Thus we propose several new
training strategies to encourage the distinctiveness of the generated caption
for each image, which are based on using CIDErBtw in a weighted loss function
or as a reinforcement learning reward. Finally, extensive experiments are
conducted, showing that our proposed approach significantly improves both
distinctiveness (as measured by CIDErBtw and retrieval metrics) and accuracy
(e.g., as measured by CIDEr) for a wide variety of image captioning baselines.
These results are further confirmed through a user study
ReGen: A good Generative zero-shot video classifier should be Rewarded
This paper sets out to solve the following problem: How
can we turn a generative video captioning model into an
open-world video/action classification model? Video captioning models can naturally produce open-ended free-form
descriptions of a given video which, however, might not be
discriminative enough for video/action recognition. Unfortunately, when fine-tuned to auto-regress the class names
directly, video captioning models overfit the base classes
losing their open-world zero-shot capabilities. To alleviate
base class overfitting, in this work, we propose to use reinforcement learning to enforce the output of the video captioning model to be more class-level discriminative. Specifically, we propose ReGen, a novel reinforcement learning
based framework with a three-fold objective and reward
functions: (1) a class-level discrimination reward that enforces the generated caption to be correctly classified into
the corresponding action class, (2) a CLIP reward that encourages the generated caption to continue to be descriptive
of the input video (i.e. video-specific), and (3) a grammar
reward that preserves the grammatical correctness of the
caption. We show that ReGen can train a model to produce
captions that are: discriminative, video-specific and grammatically correct. Importantly, when evaluated on standard
benchmarks for zero- and few-shot action classification, ReGen significantly outperforms the previous state-of-the-art
- …