1,216 research outputs found
What Can I Do Around Here? Deep Functional Scene Understanding for Cognitive Robots
For robots that have the capability to interact with the physical environment
through their end effectors, understanding the surrounding scenes is not merely
a task of image classification or object recognition. To perform actual tasks,
it is critical for the robot to have a functional understanding of the visual
scene. Here, we address the problem of localizing and recognition of functional
areas from an arbitrary indoor scene, formulated as a two-stage deep learning
based detection pipeline. A new scene functionality testing-bed, which is
complied from two publicly available indoor scene datasets, is used for
evaluation. Our method is evaluated quantitatively on the new dataset,
demonstrating the ability to perform efficient recognition of functional areas
from arbitrary indoor scenes. We also demonstrate that our detection model can
be generalized onto novel indoor scenes by cross validating it with the images
from two different datasets
Action-oriented Scene Understanding
In order to allow robots to act autonomously it is crucial that they do not only describe their environment accurately but also identify how to interact with their surroundings.
While we witnessed tremendous progress in descriptive computer vision, approaches that explicitly target action are scarcer.
This cumulative dissertation approaches the goal of interpreting visual scenes “in the wild” with respect to actions implied by the scene. We call this approach action-oriented scene understanding. It involves identifying and judging opportunities for interaction with constituents of the scene (e.g. objects and their parts) as well as understanding object functions and how interactions will impact the future. All of these aspects are addressed on three levels of abstraction: elements, perception and reasoning.
On the elementary level, we investigate semantic and functional grouping of objects by analyzing annotated natural image scenes. We compare object label-based and visual context definitions with respect to their suitability for generating meaningful object class representations. Our findings suggest that representations generated from visual context are on-par in terms of semantic quality with those generated from large quantities of text.
The perceptive level concerns action identification. We propose a system to identify possible interactions for robots and humans with the environment (affordances) on a pixel level using state-of-the-art machine learning methods. Pixel-wise part annotations of images are transformed into 12 affordance maps. Using these maps, a convolutional neural network is trained to densely predict affordance maps from unknown RGB images. In contrast to previous work, this approach operates exclusively on RGB images during both, training and testing, and yet achieves state-of-the-art performance.
At the reasoning level, we extend the question from asking what actions are possible to what actions are plausible. For this, we gathered a dataset of household images associated with human ratings of the likelihoods of eight different actions. Based on the judgement provided by the human raters, we train convolutional neural networks to generate plausibility scores from unseen images.
Furthermore, having considered only static scenes previously in this thesis, we propose a system that takes video input and predicts plausible future actions. Since this requires careful identification of relevant features in the video sequence, we analyze this particular aspect in detail using a synthetic dataset for several state-of-the-art video models. We identify feature learning as a major obstacle for anticipation in natural video data.
The presented projects analyze the role of action in scene understanding from various angles and in multiple settings while highlighting the advantages of assuming an action-oriented perspective.
We conclude that action-oriented scene understanding can augment classic computer vision in many real-life applications, in particular robotics
Context-aware Human Motion Prediction
The problem of predicting human motion given a sequence of past observations
is at the core of many applications in robotics and computer vision. Current
state-of-the-art formulate this problem as a sequence-to-sequence task, in
which a historical of 3D skeletons feeds a Recurrent Neural Network (RNN) that
predicts future movements, typically in the order of 1 to 2 seconds. However,
one aspect that has been obviated so far, is the fact that human motion is
inherently driven by interactions with objects and/or other humans in the
environment. In this paper, we explore this scenario using a novel
context-aware motion prediction architecture. We use a semantic-graph model
where the nodes parameterize the human and objects in the scene and the edges
their mutual interactions. These interactions are iteratively learned through a
graph attention layer, fed with the past observations, which now include both
object and human body motions. Once this semantic graph is learned, we inject
it to a standard RNN to predict future movements of the human/s and object/s.
We consider two variants of our architecture, either freezing the contextual
interactions in the future of updating them. A thorough evaluation in the
"Whole-Body Human Motion Database" shows that in both cases, our context-aware
networks clearly outperform baselines in which the context information is not
considered.Comment: Accepted at CVPR2
Deep Reinforcement Learning on a Budget: 3D Control and Reasoning Without a Supercomputer
An important goal of research in Deep Reinforcement Learning in mobile
robotics is to train agents capable of solving complex tasks, which require a
high level of scene understanding and reasoning from an egocentric perspective.
When trained from simulations, optimal environments should satisfy a currently
unobtainable combination of high-fidelity photographic observations, massive
amounts of different environment configurations and fast simulation speeds. In
this paper we argue that research on training agents capable of complex
reasoning can be simplified by decoupling from the requirement of high fidelity
photographic observations. We present a suite of tasks requiring complex
reasoning and exploration in continuous, partially observable 3D environments.
The objective is to provide challenging scenarios and a robust baseline agent
architecture that can be trained on mid-range consumer hardware in under 24h.
Our scenarios combine two key advantages: (i) they are based on a simple but
highly efficient 3D environment (ViZDoom) which allows high speed simulation
(12000fps); (ii) the scenarios provide the user with a range of difficulty
settings, in order to identify the limitations of current state of the art
algorithms and network architectures. We aim to increase accessibility to the
field of Deep-RL by providing baselines for challenging scenarios where new
ideas can be iterated on quickly. We argue that the community should be able to
address challenging problems in reasoning of mobile agents without the need for
a large compute infrastructure
- …