975 research outputs found
Frame- and Segment-Level Features and Candidate Pool Evaluation for Video Caption Generation
We present our submission to the Microsoft Video to Language Challenge of
generating short captions describing videos in the challenge dataset. Our model
is based on the encoder--decoder pipeline, popular in image and video
captioning systems. We propose to utilize two different kinds of video
features, one to capture the video content in terms of objects and attributes,
and the other to capture the motion and action information. Using these diverse
features we train models specializing in two separate input sub-domains. We
then train an evaluator model which is used to pick the best caption from the
pool of candidates generated by these domain expert models. We argue that this
approach is better suited for the current video captioning task, compared to
using a single model, due to the diversity in the dataset.
Efficacy of our method is proven by the fact that it was rated best in MSR
Video to Language Challenge, as per human evaluation. Additionally, we were
ranked second in the automatic evaluation metrics based table
Skeleton Key: Image Captioning by Skeleton-Attribute Decomposition
Recently, there has been a lot of interest in automatically generating
descriptions for an image. Most existing language-model based approaches for
this task learn to generate an image description word by word in its original
word order. However, for humans, it is more natural to locate the objects and
their relationships first, and then elaborate on each object, describing
notable attributes. We present a coarse-to-fine method that decomposes the
original image description into a skeleton sentence and its attributes, and
generates the skeleton sentence and attribute phrases separately. By this
decomposition, our method can generate more accurate and novel descriptions
than the previous state-of-the-art. Experimental results on the MS-COCO and a
larger scale Stock3M datasets show that our algorithm yields consistent
improvements across different evaluation metrics, especially on the SPICE
metric, which has much higher correlation with human ratings than the
conventional metrics. Furthermore, our algorithm can generate descriptions with
varied length, benefiting from the separate control of the skeleton and
attributes. This enables image description generation that better accommodates
user preferences.Comment: Accepted by CVPR 201
Perception Score, A Learned Metric for Open-ended Text Generation Evaluation
Automatic evaluation for open-ended natural language generation tasks remains
a challenge. Existing metrics such as BLEU show a low correlation with human
judgment. We propose a novel and powerful learning-based evaluation metric:
Perception Score. The method measures the overall quality of the generation and
scores holistically instead of only focusing on one evaluation criteria, such
as word overlapping. Moreover, it also shows the amount of uncertainty about
its evaluation result. By connecting the uncertainty, Perception Score gives a
more accurate evaluation for the generation system. Perception Score provides
state-of-the-art results on two conditional generation tasks and two
unconditional generation tasks.Comment: 8 pages, 2 figure
- …