Search CORE

975 research outputs found

Frame- and Segment-Level Features and Candidate Pool Evaluation for Video Caption Generation

Author: Laaksonen Jorma
Shetty Rakshith
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 17/08/2016
Field of study

We present our submission to the Microsoft Video to Language Challenge of generating short captions describing videos in the challenge dataset. Our model is based on the encoder--decoder pipeline, popular in image and video captioning systems. We propose to utilize two different kinds of video features, one to capture the video content in terms of objects and attributes, and the other to capture the motion and action information. Using these diverse features we train models specializing in two separate input sub-domains. We then train an evaluator model which is used to pick the best caption from the pool of candidates generated by these domain expert models. We argue that this approach is better suited for the current video captioning task, compared to using a single model, due to the diversity in the dataset. Efficacy of our method is proven by the fact that it was rated best in MSR Video to Language Challenge, as per human evaluation. Additionally, we were ranked second in the automatic evaluation metrics based table

arXiv.org e-Print Archive

Crossref

Skeleton Key: Image Captioning by Skeleton-Attribute Decomposition

Author: Cohen Scott
Cottrell Garrison W.
Lin Zhe
Shen Xiaohui
Wang Yufei
Publication venue
Publication date: 23/04/2017
Field of study

Recently, there has been a lot of interest in automatically generating descriptions for an image. Most existing language-model based approaches for this task learn to generate an image description word by word in its original word order. However, for humans, it is more natural to locate the objects and their relationships first, and then elaborate on each object, describing notable attributes. We present a coarse-to-fine method that decomposes the original image description into a skeleton sentence and its attributes, and generates the skeleton sentence and attribute phrases separately. By this decomposition, our method can generate more accurate and novel descriptions than the previous state-of-the-art. Experimental results on the MS-COCO and a larger scale Stock3M datasets show that our algorithm yields consistent improvements across different evaluation metrics, especially on the SPICE metric, which has much higher correlation with human ratings than the conventional metrics. Furthermore, our algorithm can generate descriptions with varied length, benefiting from the separate control of the skeleton and attributes. This enables image description generation that better accommodates user preferences.Comment: Accepted by CVPR 201

arXiv.org e-Print Archive

Crossref

Perception Score, A Learned Metric for Open-ended Text Generation Evaluation

Author: Gu Jing
Wu Qingyang
Yu Zhou
Publication venue
Publication date: 18/08/2020
Field of study

Automatic evaluation for open-ended natural language generation tasks remains a challenge. Existing metrics such as BLEU show a low correlation with human judgment. We propose a novel and powerful learning-based evaluation metric: Perception Score. The method measures the overall quality of the generation and scores holistically instead of only focusing on one evaluation criteria, such as word overlapping. Moreover, it also shows the amount of uncertainty about its evaluation result. By connecting the uncertainty, Perception Score gives a more accurate evaluation for the generation system. Perception Score provides state-of-the-art results on two conditional generation tasks and two unconditional generation tasks.Comment: 8 pages, 2 figure

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications