1,576 research outputs found
Data-Driven Grasp Synthesis - A Survey
We review the work on data-driven grasp synthesis and the methodologies for
sampling and ranking candidate grasps. We divide the approaches into three
groups based on whether they synthesize grasps for known, familiar or unknown
objects. This structure allows us to identify common object representations and
perceptual processes that facilitate the employed data-driven grasp synthesis
technique. In the case of known objects, we concentrate on the approaches that
are based on object recognition and pose estimation. In the case of familiar
objects, the techniques use some form of a similarity matching to a set of
previously encountered objects. Finally for the approaches dealing with unknown
objects, the core part is the extraction of specific features that are
indicative of good grasps. Our survey provides an overview of the different
methodologies and discusses open problems in the area of robot grasping. We
also draw a parallel to the classical approaches that rely on analytic
formulations.Comment: 20 pages, 30 Figures, submitted to IEEE Transactions on Robotic
Action-oriented Scene Understanding
In order to allow robots to act autonomously it is crucial that they do not only describe their environment accurately but also identify how to interact with their surroundings.
While we witnessed tremendous progress in descriptive computer vision, approaches that explicitly target action are scarcer.
This cumulative dissertation approaches the goal of interpreting visual scenes “in the wild” with respect to actions implied by the scene. We call this approach action-oriented scene understanding. It involves identifying and judging opportunities for interaction with constituents of the scene (e.g. objects and their parts) as well as understanding object functions and how interactions will impact the future. All of these aspects are addressed on three levels of abstraction: elements, perception and reasoning.
On the elementary level, we investigate semantic and functional grouping of objects by analyzing annotated natural image scenes. We compare object label-based and visual context definitions with respect to their suitability for generating meaningful object class representations. Our findings suggest that representations generated from visual context are on-par in terms of semantic quality with those generated from large quantities of text.
The perceptive level concerns action identification. We propose a system to identify possible interactions for robots and humans with the environment (affordances) on a pixel level using state-of-the-art machine learning methods. Pixel-wise part annotations of images are transformed into 12 affordance maps. Using these maps, a convolutional neural network is trained to densely predict affordance maps from unknown RGB images. In contrast to previous work, this approach operates exclusively on RGB images during both, training and testing, and yet achieves state-of-the-art performance.
At the reasoning level, we extend the question from asking what actions are possible to what actions are plausible. For this, we gathered a dataset of household images associated with human ratings of the likelihoods of eight different actions. Based on the judgement provided by the human raters, we train convolutional neural networks to generate plausibility scores from unseen images.
Furthermore, having considered only static scenes previously in this thesis, we propose a system that takes video input and predicts plausible future actions. Since this requires careful identification of relevant features in the video sequence, we analyze this particular aspect in detail using a synthetic dataset for several state-of-the-art video models. We identify feature learning as a major obstacle for anticipation in natural video data.
The presented projects analyze the role of action in scene understanding from various angles and in multiple settings while highlighting the advantages of assuming an action-oriented perspective.
We conclude that action-oriented scene understanding can augment classic computer vision in many real-life applications, in particular robotics
Is That a Chair? Imagining Affordances Using Simulations of an Articulated Human Body
For robots to exhibit a high level of intelligence in the real world, they
must be able to assess objects for which they have no prior knowledge.
Therefore, it is crucial for robots to perceive object affordances by reasoning
about physical interactions with the object. In this paper, we propose a novel
method to provide robots with an ability to imagine object affordances using
physical simulations. The class of chair is chosen here as an initial category
of objects to illustrate a more general paradigm. In our method, the robot
"imagines" the affordance of an arbitrarily oriented object as a chair by
simulating a physical sitting interaction between an articulated human body and
the object. This object affordance reasoning is used as a cue for object
classification (chair vs non-chair). Moreover, if an object is classified as a
chair, the affordance reasoning can also predict the upright pose of the object
which allows the sitting interaction to take place. We call this type of poses
the functional pose. We demonstrate our method in chair classification on
synthetic 3D CAD models. Although our method uses only 30 models for training,
it outperforms appearance-based deep learning methods, which require a large
amount of training data, when the upright orientation is not assumed to be
known a priori. In addition, we showcase that the functional pose predictions
of our method align well with human judgments on both synthetic models and real
objects scanned by a depth camera.Comment: 7 pages, 6 figures. Accepted to ICRA202
- …