404 research outputs found
Movie Description
Audio Description (AD) provides linguistic descriptions of movies and allows
visually impaired people to follow a movie along with their peers. Such
descriptions are by design mainly visual and thus naturally form an interesting
data source for computer vision and computational linguistics. In this work we
propose a novel dataset which contains transcribed ADs, which are temporally
aligned to full length movies. In addition we also collected and aligned movie
scripts used in prior work and compare the two sources of descriptions. In
total the Large Scale Movie Description Challenge (LSMDC) contains a parallel
corpus of 118,114 sentences and video clips from 202 movies. First we
characterize the dataset by benchmarking different approaches for generating
video descriptions. Comparing ADs to scripts, we find that ADs are indeed more
visual and describe precisely what is shown rather than what should happen
according to the scripts created prior to movie production. Furthermore, we
present and compare the results of several teams who participated in a
challenge organized in the context of the workshop "Describing and
Understanding Video & The Large Scale Movie Description Challenge (LSMDC)", at
ICCV 2015
A Dataset for Movie Description
Descriptive video service (DVS) provides linguistic descriptions of movies
and allows visually impaired people to follow a movie along with their peers.
Such descriptions are by design mainly visual and thus naturally form an
interesting data source for computer vision and computational linguistics. In
this work we propose a novel dataset which contains transcribed DVS, which is
temporally aligned to full length HD movies. In addition we also collected the
aligned movie scripts which have been used in prior work and compare the two
different sources of descriptions. In total the Movie Description dataset
contains a parallel corpus of over 54,000 sentences and video snippets from 72
HD movies. We characterize the dataset by benchmarking different approaches for
generating video descriptions. Comparing DVS to scripts, we find that DVS is
far more visual and describes precisely what is shown rather than what should
happen according to the scripts created prior to movie production
The Long-Short Story of Movie Description
Generating descriptions for videos has many applications including assisting
blind people and human-robot interaction. The recent advances in image
captioning as well as the release of large-scale movie description datasets
such as MPII Movie Description allow to study this task in more depth. Many of
the proposed methods for image captioning rely on pre-trained object classifier
CNNs and Long-Short Term Memory recurrent networks (LSTMs) for generating
descriptions. While image description focuses on objects, we argue that it is
important to distinguish verbs, objects, and places in the challenging setting
of movie description. In this work we show how to learn robust visual
classifiers from the weak annotations of the sentence descriptions. Based on
these visual classifiers we learn how to generate a description using an LSTM.
We explore different design choices to build and train the LSTM and achieve the
best performance to date on the challenging MPII-MD dataset. We compare and
analyze our approach and prior work along various dimensions to better
understand the key challenges of the movie description task
AutoAD: Movie Description in Context
The objective of this paper is an automatic Audio Description (AD) model that
ingests movies and outputs AD in text form. Generating high-quality movie AD is
challenging due to the dependency of the descriptions on context, and the
limited amount of training data available. In this work, we leverage the power
of pretrained foundation models, such as GPT and CLIP, and only train a mapping
network that bridges the two models for visually-conditioned text generation.
In order to obtain high-quality AD, we make the following four contributions:
(i) we incorporate context from the movie clip, AD from previous clips, as well
as the subtitles; (ii) we address the lack of training data by pretraining on
large-scale datasets, where visual or contextual information is unavailable,
e.g. text-only AD without movies or visual captioning datasets without context;
(iii) we improve on the currently available AD datasets, by removing label
noise in the MAD dataset, and adding character naming information; and (iv) we
obtain strong results on the movie AD task compared with previous methods.Comment: CVPR2023 Highlight. Project page:
https://www.robots.ox.ac.uk/~vgg/research/autoad
Hierarchical Boundary-Aware Neural Encoder for Video Captioning
The use of Recurrent Neural Networks for video captioning has recently gained a lot of attention, since they can be used both to encode the input video and to generate the corresponding description. In this paper, we present a recurrent video encoding scheme which can discover and leverage the hierarchical structure of the video. Unlike the classical encoder-decoder approach, in which a video is encoded continuously by a recurrent layer, we propose a novel LSTM cell, which can identify discontinuity points between frames or segments and modify the temporal connections of the encoding layer accordingly. We evaluate our approach on three large-scale datasets: the Montreal Video Annotation dataset, the MPII Movie Description dataset and the Microsoft Video Description Corpus. Experiments show that our approach can discover appropriate hierarchical representations of input videos and improve the state of the art results on movie description datasets
- …