29,003 research outputs found
Action Recognition in Videos: from Motion Capture Labs to the Web
This paper presents a survey of human action recognition approaches based on
visual data recorded from a single video camera. We propose an organizing
framework which puts in evidence the evolution of the area, with techniques
moving from heavily constrained motion capture scenarios towards more
challenging, realistic, "in the wild" videos. The proposed organization is
based on the representation used as input for the recognition task, emphasizing
the hypothesis assumed and thus, the constraints imposed on the type of video
that each technique is able to address. Expliciting the hypothesis and
constraints makes the framework particularly useful to select a method, given
an application. Another advantage of the proposed organization is that it
allows categorizing newest approaches seamlessly with traditional ones, while
providing an insightful perspective of the evolution of the action recognition
task up to now. That perspective is the basis for the discussion in the end of
the paper, where we also present the main open issues in the area.Comment: Preprint submitted to CVIU, survey paper, 46 pages, 2 figures, 4
table
A Dataset for Movie Description
Descriptive video service (DVS) provides linguistic descriptions of movies
and allows visually impaired people to follow a movie along with their peers.
Such descriptions are by design mainly visual and thus naturally form an
interesting data source for computer vision and computational linguistics. In
this work we propose a novel dataset which contains transcribed DVS, which is
temporally aligned to full length HD movies. In addition we also collected the
aligned movie scripts which have been used in prior work and compare the two
different sources of descriptions. In total the Movie Description dataset
contains a parallel corpus of over 54,000 sentences and video snippets from 72
HD movies. We characterize the dataset by benchmarking different approaches for
generating video descriptions. Comparing DVS to scripts, we find that DVS is
far more visual and describes precisely what is shown rather than what should
happen according to the scripts created prior to movie production
Learning activity progression in LSTMs for activity detection and early detection
In this work we improve training of temporal deep models to better learn activity progression for activity detection and early detection tasks. Conventionally, when training a Recurrent Neural Network, specifically a Long Short Term Memory (LSTM) model, the training loss only considers classification error. However, we argue that the detection score of the correct activity category, or the detection score margin between the correct and incorrect categories, should be monotonically non-decreasing as the model observes more of the activity. We design novel ranking losses that directly penalize the model on violation of such monotonicities, which are used together with classification loss in training of LSTM models. Evaluation on ActivityNet shows significant benefits of the proposed ranking losses in both activity detection and early detection tasks.https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/Ma_Learning_Activity_Progression_CVPR_2016_paper.htmlPublished versio
The Evolution of First Person Vision Methods: A Survey
The emergence of new wearable technologies such as action cameras and
smart-glasses has increased the interest of computer vision scientists in the
First Person perspective. Nowadays, this field is attracting attention and
investments of companies aiming to develop commercial devices with First Person
Vision recording capabilities. Due to this interest, an increasing demand of
methods to process these videos, possibly in real-time, is expected. Current
approaches present a particular combinations of different image features and
quantitative methods to accomplish specific objectives like object detection,
activity recognition, user machine interaction and so on. This paper summarizes
the evolution of the state of the art in First Person Vision video analysis
between 1997 and 2014, highlighting, among others, most commonly used features,
methods, challenges and opportunities within the field.Comment: First Person Vision, Egocentric Vision, Wearable Devices, Smart
Glasses, Computer Vision, Video Analytics, Human-machine Interactio
Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding
Computer vision has a great potential to help our daily lives by searching
for lost keys, watering flowers or reminding us to take a pill. To succeed with
such tasks, computer vision methods need to be trained from real and diverse
examples of our daily dynamic scenes. While most of such scenes are not
particularly exciting, they typically do not appear on YouTube, in movies or TV
broadcasts. So how do we collect sufficiently many diverse but boring samples
representing our lives? We propose a novel Hollywood in Homes approach to
collect such data. Instead of shooting videos in the lab, we ensure diversity
by distributing and crowdsourcing the whole process of video creation from
script writing to video recording and annotation. Following this procedure we
collect a new dataset, Charades, with hundreds of people recording videos in
their own homes, acting out casual everyday activities. The dataset is composed
of 9,848 annotated videos with an average length of 30 seconds, showing
activities of 267 people from three continents. Each video is annotated by
multiple free-text descriptions, action labels, action intervals and classes of
interacted objects. In total, Charades provides 27,847 video descriptions,
66,500 temporally localized intervals for 157 action classes and 41,104 labels
for 46 object classes. Using this rich data, we evaluate and provide baseline
results for several tasks including action recognition and automatic
description generation. We believe that the realism, diversity, and casual
nature of this dataset will present unique challenges and new opportunities for
computer vision community
- âŠ