10,602 research outputs found
Self-Supervised Relative Depth Learning for Urban Scene Understanding
As an agent moves through the world, the apparent motion of scene elements is
(usually) inversely proportional to their depth. It is natural for a learning
agent to associate image patterns with the magnitude of their displacement over
time: as the agent moves, faraway mountains don't move much; nearby trees move
a lot. This natural relationship between the appearance of objects and their
motion is a rich source of information about the world. In this work, we start
by training a deep network, using fully automatic supervision, to predict
relative scene depth from single images. The relative depth training images are
automatically derived from simple videos of cars moving through a scene, using
recent motion segmentation techniques, and no human-provided labels. This proxy
task of predicting relative depth from a single image induces features in the
network that result in large improvements in a set of downstream tasks
including semantic segmentation, joint road segmentation and car detection, and
monocular (absolute) depth estimation, over a network trained from scratch. The
improvement on the semantic segmentation task is greater than those produced by
any other automatically supervised methods. Moreover, for monocular depth
estimation, our unsupervised pre-training method even outperforms supervised
pre-training with ImageNet. In addition, we demonstrate benefits from learning
to predict (unsupervised) relative depth in the specific videos associated with
various downstream tasks. We adapt to the specific scenes in those tasks in an
unsupervised manner to improve performance. In summary, for semantic
segmentation, we present state-of-the-art results among methods that do not use
supervised pre-training, and we even exceed the performance of supervised
ImageNet pre-trained models for monocular depth estimation, achieving results
that are comparable with state-of-the-art methods
Identification of Invariant Sensorimotor Structures as a Prerequisite for the Discovery of Objects
Perceiving the surrounding environment in terms of objects is useful for any
general purpose intelligent agent. In this paper, we investigate a fundamental
mechanism making object perception possible, namely the identification of
spatio-temporally invariant structures in the sensorimotor experience of an
agent. We take inspiration from the Sensorimotor Contingencies Theory to define
a computational model of this mechanism through a sensorimotor, unsupervised
and predictive approach. Our model is based on processing the unsupervised
interaction of an artificial agent with its environment. We show how
spatio-temporally invariant structures in the environment induce regularities
in the sensorimotor experience of an agent, and how this agent, while building
a predictive model of its sensorimotor experience, can capture them as densely
connected subgraphs in a graph of sensory states connected by motor commands.
Our approach is focused on elementary mechanisms, and is illustrated with a set
of simple experiments in which an agent interacts with an environment. We show
how the agent can build an internal model of moving but spatio-temporally
invariant structures by performing a Spectral Clustering of the graph modeling
its overall sensorimotor experiences. We systematically examine properties of
the model, shedding light more globally on the specificities of the paradigm
with respect to methods based on the supervised processing of collections of
static images.Comment: 24 pages, 10 figures, published in Frontiers Robotics and A
Ego-motion and Surrounding Vehicle State Estimation Using a Monocular Camera
Understanding ego-motion and surrounding vehicle state is essential to enable
automated driving and advanced driving assistance technologies. Typical
approaches to solve this problem use fusion of multiple sensors such as LiDAR,
camera, and radar to recognize surrounding vehicle state, including position,
velocity, and orientation. Such sensing modalities are overly complex and
costly for production of personal use vehicles. In this paper, we propose a
novel machine learning method to estimate ego-motion and surrounding vehicle
state using a single monocular camera. Our approach is based on a combination
of three deep neural networks to estimate the 3D vehicle bounding box, depth,
and optical flow from a sequence of images. The main contribution of this paper
is a new framework and algorithm that integrates these three networks in order
to estimate the ego-motion and surrounding vehicle state. To realize more
accurate 3D position estimation, we address ground plane correction in
real-time. The efficacy of the proposed method is demonstrated through
experimental evaluations that compare our results to ground truth data
available from other sensors including Can-Bus and LiDAR
- …