Coherent Multi-Sentence Video Description with Variable Level of Detail

Amin, Sikandar; Andriluka, Mykhaylo; Friedrich, Annemarie; Pinkal, Manfred; Qiu, Wei; Rohrbach, Marcus; Schiele, Bernt; Senina, Anna

research

Coherent Multi-Sentence Video Description with Variable Level of Detail

Authors: Sikandar Amin
Mykhaylo Andriluka
Annemarie Friedrich
Manfred Pinkal
Wei Qiu
Marcus Rohrbach
Bernt Schiele
Anna Senina
Publication date: 1 January 2014
Publisher: 'Springer Science and Business Media LLC'
Doi

Abstract

Humans can easily describe what they see in a coherent way and at varying level of detail. However, existing approaches for automatic video description are mainly focused on single sentence generation and produce descriptions at a fixed level of detail. In this paper, we address both of these limitations: for a variable level of detail we produce coherent multi-sentence descriptions of complex videos. We follow a two-step approach where we first learn to predict a semantic representation (SR) from video and then generate natural language descriptions from the SR. To produce consistent multi-sentence descriptions, we model across-sentence consistency at the level of the SR by enforcing a consistent topic. We also contribute both to the visual recognition of objects proposing a hand-centric approach as well as to the robust generation of sentences using a word lattice. Human judges rate our multi-sentence descriptions as more readable, correct, and relevant than related work. To understand the difference between more detailed and shorter descriptions, we collect and analyze a video description corpus of three levels of detail.Comment: 10 page

Similar works

Full text

Available Versions

MPG.PuRe

oai:escidoc.org:escidoc:202901...

Last time updated on 23/08/2016

MPG.PuRe

oai:pure.mpg.de:item_2029010

Last time updated on 15/06/2019