129,657 research outputs found
Learning Human Optical Flow
The optical flow of humans is well known to be useful for the analysis of
human action. Given this, we devise an optical flow algorithm specifically for
human motion and show that it is superior to generic flow methods. Designing a
method by hand is impractical, so we develop a new training database of image
sequences with ground truth optical flow. For this we use a 3D model of the
human body and motion capture data to synthesize realistic flow fields. We then
train a convolutional neural network to estimate human flow fields from pairs
of images. Since many applications in human motion analysis depend on speed,
and we anticipate mobile applications, we base our method on SpyNet with
several modifications. We demonstrate that our trained network is more accurate
than a wide range of top methods on held-out test data and that it generalizes
well to real image sequences. When combined with a person detector/tracker, the
approach provides a full solution to the problem of 2D human flow estimation.
Both the code and the dataset are available for research.Comment: British Machine Vision Conference 2018 (Oral
Im2Flow: Motion Hallucination from Static Images for Action Recognition
Existing methods to recognize actions in static images take the images at
their face value, learning the appearances---objects, scenes, and body
poses---that distinguish each action class. However, such models are deprived
of the rich dynamic structure and motions that also define human activity. We
propose an approach that hallucinates the unobserved future motion implied by a
single snapshot to help static-image action recognition. The key idea is to
learn a prior over short-term dynamics from thousands of unlabeled videos,
infer the anticipated optical flow on novel static images, and then train
discriminative models that exploit both streams of information. Our main
contributions are twofold. First, we devise an encoder-decoder convolutional
neural network and a novel optical flow encoding that can translate a static
image into an accurate flow map. Second, we show the power of hallucinated flow
for recognition, successfully transferring the learned motion into a standard
two-stream network for activity recognition. On seven datasets, we demonstrate
the power of the approach. It not only achieves state-of-the-art accuracy for
dense optical flow prediction, but also consistently enhances recognition of
actions and dynamic scenes.Comment: Published in CVPR 2018, project page:
http://vision.cs.utexas.edu/projects/im2flow
Towards Geometric Understanding of Motion
The motion of the world is inherently dependent on the spatial structure of the world and its geometry. Therefore, classical optical flow methods try to model this geometry to solve for the motion. However, recent deep learning methods take a completely different approach. They try to predict optical flow by learning from labelled data. Although deep networks have shown state-of-the-art performance on classification problems in computer vision, they have not been as effective in solving optical flow. The key reason is that deep learning methods do not explicitly model the structure of the world in a neural network, and instead expect the network to learn about the structure from data. We hypothesize that it is difficult for a network to learn about motion without any constraint on the structure of the world. Therefore, we explore several approaches to explicitly model the geometry of the world and its spatial structure in deep neural networks.
The spatial structure in images can be captured by representing it at multiple scales. To represent multiple scales of images in deep neural nets, we introduce a Spatial Pyramid Network (SpyNet). Such a network can leverage global information for estimating large motions and local information for estimating small motions. We show that SpyNet significantly improves over previous optical flow networks while also being the smallest and fastest neural network for motion estimation. SPyNet achieves a 97% reduction in model parameters over previous methods and is more accurate.
The spatial structure of the world extends to people and their motion. Humans have a very well-defined structure, and this information is useful in estimating optical flow for humans. To leverage this information, we create a synthetic dataset for human optical flow using a statistical human body model and motion capture sequences. We use this dataset to train deep networks and see significant improvement in the ability of the networks to estimate human optical flow.
The structure and geometry of the world affects the motion. Therefore, learning about the structure of the scene together with the motion can benefit both problems. To facilitate this, we introduce Competitive Collaboration, where several neural networks are constrained by geometry and can jointly learn about structure and motion in the scene without any labels. To this end, we show that jointly learning single view depth prediction, camera motion, optical flow and motion segmentation using Competitive Collaboration achieves state-of-the-art results among unsupervised approaches.
Our findings provide support for our hypothesis that explicit constraints on structure and geometry of the world lead to better methods for motion estimation
Optical Flow Estimation in the Deep Learning Age
Akin to many subareas of computer vision, the recent advances in deep
learning have also significantly influenced the literature on optical flow.
Previously, the literature had been dominated by classical energy-based models,
which formulate optical flow estimation as an energy minimization problem.
However, as the practical benefits of Convolutional Neural Networks (CNNs) over
conventional methods have become apparent in numerous areas of computer vision
and beyond, they have also seen increased adoption in the context of motion
estimation to the point where the current state of the art in terms of accuracy
is set by CNN approaches. We first review this transition as well as the
developments from early work to the current state of CNNs for optical flow
estimation. Alongside, we discuss some of their technical details and compare
them to recapitulate which technical contribution led to the most significant
accuracy improvements. Then we provide an overview of the various optical flow
approaches introduced in the deep learning age, including those based on
alternative learning paradigms (e.g., unsupervised and semi-supervised methods)
as well as the extension to the multi-frame case, which is able to yield
further accuracy improvements.Comment: To appear as a book chapter in Modelling Human Motion, N. Noceti, A.
Sciutti and F. Rea, Eds., Springer, 202
Learning Optical Flow, Depth, and Scene Flow without Real-World Labels
Self-supervised monocular depth estimation enables robots to learn 3D
perception from raw video streams. This scalable approach leverages projective
geometry and ego-motion to learn via view synthesis, assuming the world is
mostly static. Dynamic scenes, which are common in autonomous driving and
human-robot interaction, violate this assumption. Therefore, they require
modeling dynamic objects explicitly, for instance via estimating pixel-wise 3D
motion, i.e. scene flow. However, the simultaneous self-supervised learning of
depth and scene flow is ill-posed, as there are infinitely many combinations
that result in the same 3D point. In this paper we propose DRAFT, a new method
capable of jointly learning depth, optical flow, and scene flow by combining
synthetic data with geometric self-supervision. Building upon the RAFT
architecture, we learn optical flow as an intermediate task to bootstrap depth
and scene flow learning via triangulation. Our algorithm also leverages
temporal and geometric consistency losses across tasks to improve multi-task
learning. Our DRAFT architecture simultaneously establishes a new state of the
art in all three tasks in the self-supervised monocular setting on the standard
KITTI benchmark. Project page: https://sites.google.com/tri.global/draft.Comment: Accepted to RA-L + ICRA 202
- …