1,036 research outputs found
Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning
Recent progress has been made in using attention based encoder-decoder
framework for video captioning. However, most existing decoders apply the
attention mechanism to every generated word including both visual words (e.g.,
"gun" and "shooting") and non-visual words (e.g. "the", "a"). However, these
non-visual words can be easily predicted using natural language model without
considering visual signals or attention. Imposing attention mechanism on
non-visual words could mislead and decrease the overall performance of video
captioning. To address this issue, we propose a hierarchical LSTM with adjusted
temporal attention (hLSTMat) approach for video captioning. Specifically, the
proposed framework utilizes the temporal attention for selecting specific
frames to predict the related words, while the adjusted temporal attention is
for deciding whether to depend on the visual information or the language
context information. Also, a hierarchical LSTMs is designed to simultaneously
consider both low-level visual information and high-level language context
information to support the video caption generation. To demonstrate the
effectiveness of our proposed framework, we test our method on two prevalent
datasets: MSVD and MSR-VTT, and experimental results show that our approach
outperforms the state-of-the-art methods on both two datasets
Move Forward and Tell: A Progressive Generator of Video Descriptions
We present an efficient framework that can generate a coherent paragraph to
describe a given video. Previous works on video captioning usually focus on
video clips. They typically treat an entire video as a whole and generate the
caption conditioned on a single embedding. On the contrary, we consider videos
with rich temporal structures and aim to generate paragraph descriptions that
can preserve the story flow while being coherent and concise. Towards this
goal, we propose a new approach, which produces a descriptive paragraph by
assembling temporally localized descriptions. Given a video, it selects a
sequence of distinctive clips and generates sentences thereon in a coherent
manner. Particularly, the selection of clips and the production of sentences
are done jointly and progressively driven by a recurrent network -- what to
describe next depends on what have been said before. Here, the recurrent
network is learned via self-critical sequence training with both sentence-level
and paragraph-level rewards. On the ActivityNet Captions dataset, our method
demonstrated the capability of generating high-quality paragraph descriptions
for videos. Compared to those by other methods, the descriptions produced by
our method are often more relevant, more coherent, and more concise.Comment: Accepted by ECCV 201
Reinforced Video Captioning with Entailment Rewards
Sequence-to-sequence models have shown promising improvements on the temporal
task of video captioning, but they optimize word-level cross-entropy loss
during training. First, using policy gradient and mixed-loss methods for
reinforcement learning, we directly optimize sentence-level task-based metrics
(as rewards), achieving significant improvements over the baseline, based on
both automatic metrics and human evaluation on multiple datasets. Next, we
propose a novel entailment-enhanced reward (CIDEnt) that corrects
phrase-matching based metrics (such as CIDEr) to only allow for
logically-implied partial matches and avoid contradictions, achieving further
significant improvements over the CIDEr-reward model. Overall, our
CIDEnt-reward model achieves the new state-of-the-art on the MSR-VTT dataset.Comment: EMNLP 2017 (9 pages
Video Storytelling: Textual Summaries for Events
Bridging vision and natural language is a longstanding goal in computer
vision and multimedia research. While earlier works focus on generating a
single-sentence description for visual content, recent works have studied
paragraph generation. In this work, we introduce the problem of video
storytelling, which aims at generating coherent and succinct stories for long
videos. Video storytelling introduces new challenges, mainly due to the
diversity of the story and the length and complexity of the video. We propose
novel methods to address the challenges. First, we propose a context-aware
framework for multimodal embedding learning, where we design a Residual
Bidirectional Recurrent Neural Network to leverage contextual information from
past and future. Second, we propose a Narrator model to discover the underlying
storyline. The Narrator is formulated as a reinforcement learning agent which
is trained by directly optimizing the textual metric of the generated story. We
evaluate our method on the Video Story dataset, a new dataset that we have
collected to enable the study. We compare our method with multiple
state-of-the-art baselines, and show that our method achieves better
performance, in terms of quantitative measures and user study.Comment: Published in IEEE Transactions on Multimedi
- …