1,202 research outputs found
Generating Video Descriptions with Topic Guidance
Generating video descriptions in natural language (a.k.a. video captioning)
is a more challenging task than image captioning as the videos are
intrinsically more complicated than images in two aspects. First, videos cover
a broader range of topics, such as news, music, sports and so on. Second,
multiple topics could coexist in the same video. In this paper, we propose a
novel caption model, topic-guided model (TGM), to generate topic-oriented
descriptions for videos in the wild via exploiting topic information. In
addition to predefined topics, i.e., category tags crawled from the web, we
also mine topics in a data-driven way based on training captions by an
unsupervised topic mining model. We show that data-driven topics reflect a
better topic schema than the predefined topics. As for testing video topic
prediction, we treat the topic mining model as teacher to train the student,
the topic prediction model, by utilizing the full multi-modalities in the video
especially the speech modality. We propose a series of caption models to
exploit topic guidance, including implicitly using the topics as input features
to generate words related to the topic and explicitly modifying the weights in
the decoder with topics to function as an ensemble of topic-aware language
decoders. Our comprehensive experimental results on the current largest video
caption dataset MSR-VTT prove the effectiveness of our topic-guided model,
which significantly surpasses the winning performance in the 2016 MSR video to
language challenge.Comment: Appeared at ICMR 201
Video Captioning with Guidance of Multimodal Latent Topics
The topic diversity of open-domain videos leads to various vocabularies and
linguistic expressions in describing video contents, and therefore, makes the
video captioning task even more challenging. In this paper, we propose an
unified caption framework, M&M TGM, which mines multimodal topics in
unsupervised fashion from data and guides the caption decoder with these
topics. Compared to pre-defined topics, the mined multimodal topics are more
semantically and visually coherent and can reflect the topic distribution of
videos better. We formulate the topic-aware caption generation as a multi-task
learning problem, in which we add a parallel task, topic prediction, in
addition to the caption task. For the topic prediction task, we use the mined
topics as the teacher to train a student topic prediction model, which learns
to predict the latent topics from multimodal contents of videos. The topic
prediction provides intermediate supervision to the learning process. As for
the caption task, we propose a novel topic-aware decoder to generate more
accurate and detailed video descriptions with the guidance from latent topics.
The entire learning procedure is end-to-end and it optimizes both tasks
simultaneously. The results from extensive experiments conducted on the MSR-VTT
and Youtube2Text datasets demonstrate the effectiveness of our proposed model.
M&M TGM not only outperforms prior state-of-the-art methods on multiple
evaluation metrics and on both benchmark datasets, but also achieves better
generalization ability.Comment: ACM Multimedia 201
- …