2 research outputs found
Incorporating Textual Evidence in Visual Storytelling
Previous work on visual storytelling mainly focused on exploring image
sequence as evidence for storytelling and neglected textual evidence for
guiding story generation. Motivated by human storytelling process which recalls
stories for familiar images, we exploit textual evidence from similar images to
help generate coherent and meaningful stories. To pick the images which may
provide textual experience, we propose a two-step ranking method based on image
object recognition techniques. To utilize textual information, we design an
extended Seq2Seq model with two-channel encoder and attention. Experiments on
the VIST dataset show that our method outperforms state-of-the-art baseline
models without heavy engineering
BERT-hLSTMs: BERT and Hierarchical LSTMs for Visual Storytelling
Visual storytelling is a creative and challenging task, aiming to
automatically generate a story-like description for a sequence of images. The
descriptions generated by previous visual storytelling approaches lack
coherence because they use word-level sequence generation methods and do not
adequately consider sentence-level dependencies. To tackle this problem, we
propose a novel hierarchical visual storytelling framework which separately
models sentence-level and word-level semantics. We use the transformer-based
BERT to obtain embeddings for sentences and words. We then employ a
hierarchical LSTM network: the bottom LSTM receives as input the sentence
vector representation from BERT, to learn the dependencies between the
sentences corresponding to images, and the top LSTM is responsible for
generating the corresponding word vector representations, taking input from the
bottom LSTM. Experimental results demonstrate that our model outperforms most
closely related baselines under automatic evaluation metrics BLEU and CIDEr,
and also show the effectiveness of our method with human evaluation