1 research outputs found
Reconstruct and Represent Video Contents for Captioning via Reinforcement Learning
In this paper, the problem of describing visual contents of a video sequence
with natural language is addressed. Unlike previous video captioning work
mainly exploiting the cues of video contents to make a language description, we
propose a reconstruction network (RecNet) in a novel
encoder-decoder-reconstructor architecture, which leverages both forward (video
to sentence) and backward (sentence to video) flows for video captioning.
Specifically, the encoder-decoder component makes use of the forward flow to
produce a sentence description based on the encoded video semantic features.
Two types of reconstructors are subsequently proposed to employ the backward
flow and reproduce the video features from local and global perspectives,
respectively, capitalizing on the hidden state sequence generated by the
decoder. Moreover, in order to make a comprehensive reconstruction of the video
features, we propose to fuse the two types of reconstructors together. The
generation loss yielded by the encoder-decoder component and the reconstruction
loss introduced by the reconstructor are jointly cast into training the
proposed RecNet in an end-to-end fashion. Furthermore, the RecNet is fine-tuned
by CIDEr optimization via reinforcement learning, which significantly boosts
the captioning performance. Experimental results on benchmark datasets
demonstrate that the proposed reconstructor can boost the performance of video
captioning consistently.Comment: Accepted by TPAMI. arXiv admin note: substantial text overlap with
arXiv:1803.1143