Audio Description (AD) is the task of generating descriptions of visual
content, at suitable time intervals, for the benefit of visually impaired
audiences. For movies, this presents notable challenges -- AD must occur only
during existing pauses in dialogue, should refer to characters by name, and
ought to aid understanding of the storyline as a whole. To this end, we develop
a new model for automatically generating movie AD, given CLIP visual features
of the frames, the cast list, and the temporal locations of the speech;
addressing all three of the 'who', 'when', and 'what' questions: (i) who -- we
introduce a character bank consisting of the character's name, the actor that
played the part, and a CLIP feature of their face, for the principal cast of
each movie, and demonstrate how this can be used to improve naming in the
generated AD; (ii) when -- we investigate several models for determining
whether an AD should be generated for a time interval or not, based on the
visual content of the interval and its neighbours; and (iii) what -- we
implement a new vision-language model for this task, that can ingest the
proposals from the character bank, whilst conditioning on the visual features
using cross-attention, and demonstrate how this improves over previous
architectures for AD text generation in an apples-to-apples comparison.Comment: ICCV2023. Project page:
https://www.robots.ox.ac.uk/vgg/research/autoad