6 research outputs found
HLVU : A New Challenge to Test Deep Understanding of Movies the Way Humans do
In this paper we propose a new evaluation challenge and direction in the area
of High-level Video Understanding. The challenge we are proposing is designed
to test automatic video analysis and understanding, and how accurately systems
can comprehend a movie in terms of actors, entities, events and their
relationship to each other. A pilot High-Level Video Understanding (HLVU)
dataset of open source movies were collected for human assessors to build a
knowledge graph representing each of them. A set of queries will be derived
from the knowledge graph to test systems on retrieving relationships among
actors, as well as reasoning and retrieving non-visual concepts. The objective
is to benchmark if a computer system can "understand" non-explicit but obvious
relationships the same way humans do when they watch the same movies. This is
long-standing problem that is being addressed in the text domain and this
project moves similar research to the video domain. Work of this nature is
foundational to future video analytics and video understanding technologies.
This work can be of interest to streaming services and broadcasters hoping to
provide more intuitive ways for their customers to interact with and consume
video content
An annotated video dataset for computing video memorability
Using a collection of publicly available links to short form video clips of an average of 6 seconds duration each, 1,275 users manually annotated each video multiple times to indicate both long-term and short-term memorability of the videos. The annotations were gathered as part of an online memory game and measured a participantâs ability to recall having seen the video previously when shown a collection of videos. The recognition tasks were performed on videos seen within the previous few minutes for short-term memorability and within the previous 24 to 72 hours for long-term memorability. Data includes the reaction times for each recognition of each video. Associated with each video are text descriptions (captions) as well as a collection of image-level features applied to 3 frames extracted from each video (start, middle and end). Video-level features are also provided. The dataset was used in the Video Memorability task as part of the MediaEval benchmark in 2020