2,393 research outputs found
Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding
Computer vision has a great potential to help our daily lives by searching
for lost keys, watering flowers or reminding us to take a pill. To succeed with
such tasks, computer vision methods need to be trained from real and diverse
examples of our daily dynamic scenes. While most of such scenes are not
particularly exciting, they typically do not appear on YouTube, in movies or TV
broadcasts. So how do we collect sufficiently many diverse but boring samples
representing our lives? We propose a novel Hollywood in Homes approach to
collect such data. Instead of shooting videos in the lab, we ensure diversity
by distributing and crowdsourcing the whole process of video creation from
script writing to video recording and annotation. Following this procedure we
collect a new dataset, Charades, with hundreds of people recording videos in
their own homes, acting out casual everyday activities. The dataset is composed
of 9,848 annotated videos with an average length of 30 seconds, showing
activities of 267 people from three continents. Each video is annotated by
multiple free-text descriptions, action labels, action intervals and classes of
interacted objects. In total, Charades provides 27,847 video descriptions,
66,500 temporally localized intervals for 157 action classes and 41,104 labels
for 46 object classes. Using this rich data, we evaluate and provide baseline
results for several tasks including action recognition and automatic
description generation. We believe that the realism, diversity, and casual
nature of this dataset will present unique challenges and new opportunities for
computer vision community
CLAD: A Complex and Long Activities Dataset with Rich Crowdsourced Annotations
This paper introduces a novel activity dataset which exhibits real-life and
diverse scenarios of complex, temporally-extended human activities and actions.
The dataset presents a set of videos of actors performing everyday activities
in a natural and unscripted manner. The dataset was recorded using a static
Kinect 2 sensor which is commonly used on many robotic platforms. The dataset
comprises of RGB-D images, point cloud data, automatically generated skeleton
tracks in addition to crowdsourced annotations. Furthermore, we also describe
the methodology used to acquire annotations through crowdsourcing. Finally some
activity recognition benchmarks are presented using current state-of-the-art
techniques. We believe that this dataset is particularly suitable as a testbed
for activity recognition research but it can also be applicable for other
common tasks in robotics/computer vision research such as object detection and
human skeleton tracking
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
This paper introduces a video dataset of spatio-temporally localized Atomic
Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual
actions in 430 15-minute video clips, where actions are localized in space and
time, resulting in 1.58M action labels with multiple labels per person
occurring frequently. The key characteristics of our dataset are: (1) the
definition of atomic visual actions, rather than composite actions; (2) precise
spatio-temporal annotations with possibly multiple annotations for each person;
(3) exhaustive annotation of these atomic actions over 15-minute video clips;
(4) people temporally linked across consecutive segments; and (5) using movies
to gather a varied set of action representations. This departs from existing
datasets for spatio-temporal action recognition, which typically provide sparse
annotations for composite actions in short video clips. We will release the
dataset publicly.
AVA, with its realistic scene and action complexity, exposes the intrinsic
difficulty of action recognition. To benchmark this, we present a novel
approach for action localization that builds upon the current state-of-the-art
methods, and demonstrates better performance on JHMDB and UCF101-24 categories.
While setting a new state of the art on existing datasets, the overall results
on AVA are low at 15.6% mAP, underscoring the need for developing new
approaches for video understanding.Comment: To appear in CVPR 2018. Check dataset page
https://research.google.com/ava/ for detail
- …