265,499 research outputs found
Sequence to Sequence -- Video to Text
Real-world videos often have complex dynamics; and methods for generating
open-domain video descriptions should be sensitive to temporal structure and
allow both input (sequence of frames) and output (sequence of words) of
variable length. To approach this problem, we propose a novel end-to-end
sequence-to-sequence model to generate captions for videos. For this we exploit
recurrent neural networks, specifically LSTMs, which have demonstrated
state-of-the-art performance in image caption generation. Our LSTM model is
trained on video-sentence pairs and learns to associate a sequence of video
frames to a sequence of words in order to generate a description of the event
in the video clip. Our model naturally is able to learn the temporal structure
of the sequence of frames as well as the sequence model of the generated
sentences, i.e. a language model. We evaluate several variants of our model
that exploit different visual features on a standard set of YouTube videos and
two movie description datasets (M-VAD and MPII-MD).Comment: ICCV 2015 camera-ready. Includes code, project page and LSMDC
challenge result
Unconstrained Scene Text and Video Text Recognition for Arabic Script
Building robust recognizers for Arabic has always been challenging. We
demonstrate the effectiveness of an end-to-end trainable CNN-RNN hybrid
architecture in recognizing Arabic text in videos and natural scenes. We
outperform previous state-of-the-art on two publicly available video text
datasets - ALIF and ACTIV. For the scene text recognition task, we introduce a
new Arabic scene text dataset and establish baseline results. For scripts like
Arabic, a major challenge in developing robust recognizers is the lack of large
quantity of annotated data. We overcome this by synthesising millions of Arabic
text images from a large vocabulary of Arabic words and phrases. Our
implementation is built on top of the model introduced here [37] which is
proven quite effective for English scene text recognition. The model follows a
segmentation-free, sequence to sequence transcription approach. The network
transcribes a sequence of convolutional features from the input image to a
sequence of target labels. This does away with the need for segmenting input
image into constituent characters/glyphs, which is often difficult for Arabic
script. Further, the ability of RNNs to model contextual dependencies yields
superior recognition results.Comment: 5 page
Text2Action: Generative Adversarial Synthesis from Language to Action
In this paper, we propose a generative model which learns the relationship
between language and human action in order to generate a human action sequence
given a sentence describing human behavior. The proposed generative model is a
generative adversarial network (GAN), which is based on the sequence to
sequence (SEQ2SEQ) model. Using the proposed generative network, we can
synthesize various actions for a robot or a virtual agent using a text encoder
recurrent neural network (RNN) and an action decoder RNN. The proposed
generative network is trained from 29,770 pairs of actions and sentence
annotations extracted from MSR-Video-to-Text (MSR-VTT), a large-scale video
dataset. We demonstrate that the network can generate human-like actions which
can be transferred to a Baxter robot, such that the robot performs an action
based on a provided sentence. Results show that the proposed generative network
correctly models the relationship between language and action and can generate
a diverse set of actions from the same sentence.Comment: 8 pages, 10 figure
A Neural, Interactive-predictive System for Multimodal Sequence to Sequence Tasks
We present a demonstration of a neural interactive-predictive system for
tackling multimodal sequence to sequence tasks. The system generates text
predictions to different sequence to sequence tasks: machine translation, image
and video captioning. These predictions are revised by a human agent, who
introduces corrections in the form of characters. The system reacts to each
correction, providing alternative hypotheses, compelling with the feedback
provided by the user. The final objective is to reduce the human effort
required during this correction process.
This system is implemented following a client-server architecture. For
accessing the system, we developed a website, which communicates with the
neural model, hosted in a local server. From this website, the different tasks
can be tackled following the interactive-predictive framework. We open-source
all the code developed for building this system. The demonstration in hosted in
http://casmacat.prhlt.upv.es/interactive-seq2seq.Comment: ACL 2019 - System demonstration
StoryBench: A Multifaceted Benchmark for Continuous Story Visualization
Generating video stories from text prompts is a complex task. In addition to
having high visual quality, videos need to realistically adhere to a sequence
of text prompts whilst being consistent throughout the frames. Creating a
benchmark for video generation requires data annotated over time, which
contrasts with the single caption used often in video datasets. To fill this
gap, we collect comprehensive human annotations on three existing datasets, and
introduce StoryBench: a new, challenging multi-task benchmark to reliably
evaluate forthcoming text-to-video models. Our benchmark includes three video
generation tasks of increasing difficulty: action execution, where the next
action must be generated starting from a conditioning video; story
continuation, where a sequence of actions must be executed starting from a
conditioning video; and story generation, where a video must be generated from
only text prompts. We evaluate small yet strong text-to-video baselines, and
show the benefits of training on story-like data algorithmically generated from
existing video captions. Finally, we establish guidelines for human evaluation
of video stories, and reaffirm the need of better automatic metrics for video
generation. StoryBench aims at encouraging future research efforts in this
exciting new area
- …