1,153 research outputs found
Reinforced Video Captioning with Entailment Rewards
Sequence-to-sequence models have shown promising improvements on the temporal
task of video captioning, but they optimize word-level cross-entropy loss
during training. First, using policy gradient and mixed-loss methods for
reinforcement learning, we directly optimize sentence-level task-based metrics
(as rewards), achieving significant improvements over the baseline, based on
both automatic metrics and human evaluation on multiple datasets. Next, we
propose a novel entailment-enhanced reward (CIDEnt) that corrects
phrase-matching based metrics (such as CIDEr) to only allow for
logically-implied partial matches and avoid contradictions, achieving further
significant improvements over the CIDEr-reward model. Overall, our
CIDEnt-reward model achieves the new state-of-the-art on the MSR-VTT dataset.Comment: EMNLP 2017 (9 pages
Shortcut-Stacked Sentence Encoders for Multi-Domain Inference
We present a simple sequential sentence encoder for multi-domain natural
language inference. Our encoder is based on stacked bidirectional LSTM-RNNs
with shortcut connections and fine-tuning of word embeddings. The overall
supervised model uses the above encoder to encode two input sentences into two
vectors, and then uses a classifier over the vector combination to label the
relationship between these two sentences as that of entailment, contradiction,
or neural. Our Shortcut-Stacked sentence encoders achieve strong improvements
over existing encoders on matched and mismatched multi-domain natural language
inference (top non-ensemble single-model result in the EMNLP RepEval 2017
Shared Task (Nangia et al., 2017)). Moreover, they achieve the new
state-of-the-art encoding result on the original SNLI dataset (Bowman et al.,
2015).Comment: EMNLP 2017 RepEval Multi-NLI Shared Task (6 pages
Multi-Task Video Captioning with Video and Entailment Generation
Video captioning, the task of describing the content of a video, has seen
some promising improvements in recent years with sequence-to-sequence models,
but accurately learning the temporal and logical dynamics involved in the task
still remains a challenge, especially given the lack of sufficient annotated
data. We improve video captioning by sharing knowledge with two related
directed-generation tasks: a temporally-directed unsupervised video prediction
task to learn richer context-aware video encoder representations, and a
logically-directed language entailment generation task to learn better
video-entailed caption decoder representations. For this, we present a
many-to-many multi-task learning model that shares parameters across the
encoders and decoders of the three tasks. We achieve significant improvements
and the new state-of-the-art on several standard video captioning datasets
using diverse automatic and human evaluations. We also show mutual multi-task
improvements on the entailment generation task.Comment: ACL 2017 (14 pages w/ supplementary
- …