12,427 research outputs found
LCrowdV: Generating Labeled Videos for Simulation-based Crowd Behavior Learning
We present a novel procedural framework to generate an arbitrary number of
labeled crowd videos (LCrowdV). The resulting crowd video datasets are used to
design accurate algorithms or training models for crowded scene understanding.
Our overall approach is composed of two components: a procedural simulation
framework for generating crowd movements and behaviors, and a procedural
rendering framework to generate different videos or images. Each video or image
is automatically labeled based on the environment, number of pedestrians,
density, behavior, flow, lighting conditions, viewpoint, noise, etc.
Furthermore, we can increase the realism by combining synthetically-generated
behaviors with real-world background videos. We demonstrate the benefits of
LCrowdV over prior lableled crowd datasets by improving the accuracy of
pedestrian detection and crowd behavior classification algorithms. LCrowdV
would be released on the WWW
Simple yet efficient real-time pose-based action recognition
Recognizing human actions is a core challenge for autonomous systems as they
directly share the same space with humans. Systems must be able to recognize
and assess human actions in real-time. In order to train corresponding
data-driven algorithms, a significant amount of annotated training data is
required. We demonstrated a pipeline to detect humans, estimate their pose,
track them over time and recognize their actions in real-time with standard
monocular camera sensors. For action recognition, we encode the human pose into
a new data format called Encoded Human Pose Image (EHPI) that can then be
classified using standard methods from the computer vision community. With this
simple procedure we achieve competitive state-of-the-art performance in
pose-based action detection and can ensure real-time performance. In addition,
we show a use case in the context of autonomous driving to demonstrate how such
a system can be trained to recognize human actions using simulation data.Comment: Submitted to IEEE Intelligent Transportation Systems Conference
(ITSC) 2019. Code will be available soon at
https://github.com/noboevbo/ehpi_action_recognitio
What's Cookin'? Interpreting Cooking Videos using Text, Speech and Vision
We present a novel method for aligning a sequence of instructions to a video
of someone carrying out a task. In particular, we focus on the cooking domain,
where the instructions correspond to the recipe. Our technique relies on an HMM
to align the recipe steps to the (automatically generated) speech transcript.
We then refine this alignment using a state-of-the-art visual food detector,
based on a deep convolutional neural network. We show that our technique
outperforms simpler techniques based on keyword spotting. It also enables
interesting applications, such as automatically illustrating recipes with
keyframes, and searching within a video for events of interest.Comment: To appear in NAACL 201
Transferring Knowledge from Text to Video: Zero-Shot Anticipation for Procedural Actions
Can we teach a robot to recognize and make predictions for activities that it
has never seen before? We tackle this problem by learning models for video from
text. This paper presents a hierarchical model that generalizes instructional
knowledge from large-scale text corpora and transfers the knowledge to video.
Given a portion of an instructional video, our model recognizes and predicts
coherent and plausible actions multiple steps into the future, all in rich
natural language. To demonstrate the capabilities of our model, we introduce
the \emph{Tasty Videos Dataset V2}, a collection of 4022 recipes for zero-shot
learning, recognition and anticipation. Extensive experiments with various
evaluation metrics demonstrate the potential of our method for generalization,
given limited video data for training models.Comment: TPAMI 2022. arXiv admin note: text overlap with arXiv:1812.0250
Augmented Reality Meets Computer Vision : Efficient Data Generation for Urban Driving Scenes
The success of deep learning in computer vision is based on availability of
large annotated datasets. To lower the need for hand labeled images, virtually
rendered 3D worlds have recently gained popularity. Creating realistic 3D
content is challenging on its own and requires significant human effort. In
this work, we propose an alternative paradigm which combines real and synthetic
data for learning semantic instance segmentation and object detection models.
Exploiting the fact that not all aspects of the scene are equally important for
this task, we propose to augment real-world imagery with virtual objects of the
target category. Capturing real-world images at large scale is easy and cheap,
and directly provides real background appearances without the need for creating
complex 3D models of the environment. We present an efficient procedure to
augment real images with virtual objects. This allows us to create realistic
composite images which exhibit both realistic background appearance and a large
number of complex object arrangements. In contrast to modeling complete 3D
environments, our augmentation approach requires only a few user interactions
in combination with 3D shapes of the target object. Through extensive
experimentation, we conclude the right set of parameters to produce augmented
data which can maximally enhance the performance of instance segmentation
models. Further, we demonstrate the utility of our approach on training
standard deep models for semantic instance segmentation and object detection of
cars in outdoor driving scenes. We test the models trained on our augmented
data on the KITTI 2015 dataset, which we have annotated with pixel-accurate
ground truth, and on Cityscapes dataset. Our experiments demonstrate that
models trained on augmented imagery generalize better than those trained on
synthetic data or models trained on limited amount of annotated real data
Learning Procedure-aware Video Representation from Instructional Videos and Their Narrations
The abundance of instructional videos and their narrations over the Internet
offers an exciting avenue for understanding procedural activities. In this
work, we propose to learn video representation that encodes both action steps
and their temporal ordering, based on a large-scale dataset of web
instructional videos and their narrations, without using human annotations. Our
method jointly learns a video representation to encode individual step
concepts, and a deep probabilistic model to capture both temporal dependencies
and immense individual variations in the step ordering. We empirically
demonstrate that learning temporal ordering not only enables new capabilities
for procedure reasoning, but also reinforces the recognition of individual
steps. Our model significantly advances the state-of-the-art results on step
classification (+2.8% / +3.3% on COIN / EPIC-Kitchens) and step forecasting
(+7.4% on COIN). Moreover, our model attains promising results in zero-shot
inference for step classification and forecasting, as well as in predicting
diverse and plausible steps for incomplete procedures. Our code is available at
https://github.com/facebookresearch/ProcedureVRL.Comment: Accepted to CVPR 202
- …