5 research outputs found
Hierarchical Photo-Scene Encoder for Album Storytelling
In this paper, we propose a novel model with a hierarchical photo-scene
encoder and a reconstructor for the task of album storytelling. The photo-scene
encoder contains two sub-encoders, namely the photo and scene encoders, which
are stacked together and behave hierarchically to fully exploit the structure
information of the photos within an album. Specifically, the photo encoder
generates semantic representation for each photo while exploiting temporal
relationships among them. The scene encoder, relying on the obtained photo
representations, is responsible for detecting the scene changes and generating
scene representations. Subsequently, the decoder dynamically and attentively
summarizes the encoded photo and scene representations to generate a sequence
of album representations, based on which a story consisting of multiple
coherent sentences is generated. In order to fully extract the useful semantic
information from an album, a reconstructor is employed to reproduce the
summarized album representations based on the hidden states of the decoder. The
proposed model can be trained in an end-to-end manner, which results in an
improved performance over the state-of-the-arts on the public visual
storytelling (VIST) dataset. Ablation studies further demonstrate the
effectiveness of the proposed hierarchical photo-scene encoder and
reconstructor.Comment: 8 pages, 4 figure
Hide-and-Tell: Learning to Bridge Photo Streams for Visual Storytelling
Visual storytelling is a task of creating a short story based on photo
streams. Unlike existing visual captioning, storytelling aims to contain not
only factual descriptions, but also human-like narration and semantics.
However, the VIST dataset consists only of a small, fixed number of photos per
story. Therefore, the main challenge of visual storytelling is to fill in the
visual gap between photos with narrative and imaginative story. In this paper,
we propose to explicitly learn to imagine a storyline that bridges the visual
gap. During training, one or more photos is randomly omitted from the input
stack, and we train the network to produce a full plausible story even with
missing photo(s). Furthermore, we propose for visual storytelling a
hide-and-tell model, which is designed to learn non-local relations across the
photo streams and to refine and improve conventional RNN-based models. In
experiments, we show that our scheme of hide-and-tell, and the network design
are indeed effective at storytelling, and that our model outperforms previous
state-of-the-art methods in automatic metrics. Finally, we qualitatively show
the learned ability to interpolate storyline over visual gaps.Comment: AAAI 2020 pape
Keep it Consistent: Topic-Aware Storytelling from an Image Stream via Iterative Multi-agent Communication
Visual storytelling aims to generate a narrative paragraph from a sequence of
images automatically. Existing approaches construct text description
independently for each image and roughly concatenate them as a story, which
leads to the problem of generating semantically incoherent content. In this
paper, we propose a new way for visual storytelling by introducing a topic
description task to detect the global semantic context of an image stream. A
story is then constructed with the guidance of the topic description. In order
to combine the two generation tasks, we propose a multi-agent communication
framework that regards the topic description generator and the story generator
as two agents and learn them simultaneously via iterative updating mechanism.
We validate our approach on VIST dataset, where quantitative results,
ablations, and human evaluation demonstrate our method's good ability in
generating stories with higher quality compared to state-of-the-art methods.Comment: Accepted to COLING 202