1,949 research outputs found
Deep Tracking: Seeing Beyond Seeing Using Recurrent Neural Networks
This paper presents to the best of our knowledge the first end-to-end object
tracking approach which directly maps from raw sensor input to object tracks in
sensor space without requiring any feature engineering or system identification
in the form of plant or sensor models. Specifically, our system accepts a
stream of raw sensor data at one end and, in real-time, produces an estimate of
the entire environment state at the output including even occluded objects. We
achieve this by framing the problem as a deep learning task and exploit
sequence models in the form of recurrent neural networks to learn a mapping
from sensor measurements to object tracks. In particular, we propose a learning
method based on a form of input dropout which allows learning in an
unsupervised manner, only based on raw, occluded sensor data without access to
ground-truth annotations. We demonstrate our approach using a synthetic dataset
designed to mimic the task of tracking objects in 2D laser data -- as commonly
encountered in robotics applications -- and show that it learns to track many
dynamic objects despite occlusions and the presence of sensor noise.Comment: Published in The Thirtieth AAAI Conference on Artificial Intelligence
(AAAI-16), Video: https://youtu.be/cdeWCpfUGWc, Code:
http://mrg.robots.ox.ac.uk/mrg_people/peter-ondruska
Hierarchical Attention Network for Action Segmentation
The temporal segmentation of events is an essential task and a precursor for
the automatic recognition of human actions in the video. Several attempts have
been made to capture frame-level salient aspects through attention but they
lack the capacity to effectively map the temporal relationships in between the
frames as they only capture a limited span of temporal dependencies. To this
end we propose a complete end-to-end supervised learning approach that can
better learn relationships between actions over time, thus improving the
overall segmentation performance. The proposed hierarchical recurrent attention
framework analyses the input video at multiple temporal scales, to form
embeddings at frame level and segment level, and perform fine-grained action
segmentation. This generates a simple, lightweight, yet extremely effective
architecture for segmenting continuous video streams and has multiple
application domains. We evaluate our system on multiple challenging public
benchmark datasets, including MERL Shopping, 50 salads, and Georgia Tech
Egocentric datasets, and achieves state-of-the-art performance. The evaluated
datasets encompass numerous video capture settings which are inclusive of
static overhead camera views and dynamic, ego-centric head-mounted camera
views, demonstrating the direct applicability of the proposed framework in a
variety of settings.Comment: Published in Pattern Recognition Letter
Deep Visual Foresight for Planning Robot Motion
A key challenge in scaling up robot learning to many skills and environments
is removing the need for human supervision, so that robots can collect their
own data and improve their own performance without being limited by the cost of
requesting human feedback. Model-based reinforcement learning holds the promise
of enabling an agent to learn to predict the effects of its actions, which
could provide flexible predictive models for a wide range of tasks and
environments, without detailed human supervision. We develop a method for
combining deep action-conditioned video prediction models with model-predictive
control that uses entirely unlabeled training data. Our approach does not
require a calibrated camera, an instrumented training set-up, nor precise
sensing and actuation. Our results show that our method enables a real robot to
perform nonprehensile manipulation -- pushing objects -- and can handle novel
objects not seen during training.Comment: ICRA 2017. Supplementary video:
https://sites.google.com/site/robotforesight
Neural Expectation Maximization
Many real world tasks such as reasoning and physical interaction require
identification and manipulation of conceptual entities. A first step towards
solving these tasks is the automated discovery of distributed symbol-like
representations. In this paper, we explicitly formalize this problem as
inference in a spatial mixture model where each component is parametrized by a
neural network. Based on the Expectation Maximization framework we then derive
a differentiable clustering method that simultaneously learns how to group and
represent individual entities. We evaluate our method on the (sequential)
perceptual grouping task and find that it is able to accurately recover the
constituent objects. We demonstrate that the learned representations are useful
for next-step prediction.Comment: Accepted to NIPS 201
- …