135 research outputs found
Real-time Hand Tracking under Occlusion from an Egocentric RGB-D Sensor
We present an approach for real-time, robust and accurate hand pose estimation from moving egocentric RGB-D cameras in cluttered real environments. Existing methods typically fail for hand-object interactions in cluttered scenes imaged from egocentric viewpoints, common for virtual or augmented reality applications. Our approach uses two subsequently applied Convolutional Neural Networks (CNNs) to localize the hand and regress 3D joint locations. Hand localization is achieved by using a CNN to estimate the 2D position of the hand center in the input, even in the presence of clutter and occlusions. The localized hand position, together with the corresponding input depth value, is used to generate a normalized cropped image that is fed into a second CNN to regress relative 3D hand joint locations in real time. For added accuracy, robustness and temporal stability, we refine the pose estimates using a kinematic pose tracking energy. To train the CNNs, we introduce a new photorealistic dataset that uses a merged reality approach to capture and synthesize large amounts of annotated data of natural hand interaction in cluttered scenes. Through quantitative and qualitative evaluation, we show that our method is robust to self-occlusion and occlusions by objects, particularly in moving egocentric perspectives
PointNet++ Grasping: Learning An End-to-end Spatial Grasp Generation Algorithm from Sparse Point Clouds
Grasping for novel objects is important for robot manipulation in
unstructured environments. Most of current works require a grasp sampling
process to obtain grasp candidates, combined with local feature extractor using
deep learning. This pipeline is time-costly, expecially when grasp points are
sparse such as at the edge of a bowl. In this paper, we propose an end-to-end
approach to directly predict the poses, categories and scores (qualities) of
all the grasps. It takes the whole sparse point clouds as the input and
requires no sampling or search process. Moreover, to generate training data of
multi-object scene, we propose a fast multi-object grasp detection algorithm
based on Ferrari Canny metrics. A single-object dataset (79 objects from YCB
object set, 23.7k grasps) and a multi-object dataset (20k point clouds with
annotations and masks) are generated. A PointNet++ based network combined with
multi-mask loss is introduced to deal with different training points. The whole
weight size of our network is only about 11.6M, which takes about 102ms for a
whole prediction process using a GeForce 840M GPU. Our experiment shows our
work get 71.43% success rate and 91.60% completion rate, which performs better
than current state-of-art works.Comment: Accepted at the International Conference on Robotics and Automation
(ICRA) 202
A Robotic Visual Grasping Design: Rethinking Convolution Neural Network with High-Resolutions
High-resolution representations are important for vision-based robotic
grasping problems. Existing works generally encode the input images into
low-resolution representations via sub-networks and then recover
high-resolution representations. This will lose spatial information, and errors
introduced by the decoder will be more serious when multiple types of objects
are considered or objects are far away from the camera. To address these
issues, we revisit the design paradigm of CNN for robotic perception tasks. We
demonstrate that using parallel branches as opposed to serial stacked
convolutional layers will be a more powerful design for robotic visual grasping
tasks. In particular, guidelines of neural network design are provided for
robotic perception tasks, e.g., high-resolution representation and lightweight
design, which respond to the challenges in different manipulation scenarios. We
then develop a novel grasping visual architecture referred to as HRG-Net, a
parallel-branch structure that always maintains a high-resolution
representation and repeatedly exchanges information across resolutions.
Extensive experiments validate that these two designs can effectively enhance
the accuracy of visual-based grasping and accelerate network training. We show
a series of comparative experiments in real physical environments at Youtube:
https://youtu.be/Jhlsp-xzHFY
The State of Lifelong Learning in Service Robots: Current Bottlenecks in Object Perception and Manipulation
Service robots are appearing more and more in our daily life. The development
of service robots combines multiple fields of research, from object perception
to object manipulation. The state-of-the-art continues to improve to make a
proper coupling between object perception and manipulation. This coupling is
necessary for service robots not only to perform various tasks in a reasonable
amount of time but also to continually adapt to new environments and safely
interact with non-expert human users. Nowadays, robots are able to recognize
various objects, and quickly plan a collision-free trajectory to grasp a target
object in predefined settings. Besides, in most of the cases, there is a
reliance on large amounts of training data. Therefore, the knowledge of such
robots is fixed after the training phase, and any changes in the environment
require complicated, time-consuming, and expensive robot re-programming by
human experts. Therefore, these approaches are still too rigid for real-life
applications in unstructured environments, where a significant portion of the
environment is unknown and cannot be directly sensed or controlled. In such
environments, no matter how extensive the training data used for batch
learning, a robot will always face new objects. Therefore, apart from batch
learning, the robot should be able to continually learn about new object
categories and grasp affordances from very few training examples on-site.
Moreover, apart from robot self-learning, non-expert users could interactively
guide the process of experience acquisition by teaching new concepts, or by
correcting insufficient or erroneous concepts. In this way, the robot will
constantly learn how to help humans in everyday tasks by gaining more and more
experiences without the need for re-programming
From Form to Function: Detecting the Affordance of Tool Parts using Geometric Features and Material Cues
With recent advances in robotics, general purpose robots like Baxter are
quickly becoming a reality. As robots begin to collaborate with humans in everyday
workspaces, they will need to understand the functions of objects and their
parts. To cut an apple or hammer a nail, robots need to not just know a tool’s name,
but they must find its parts and identify their potential functions, or affordances.
As Gibson remarked, “If you know what can be done with a[n] object, what it can
be used for, you can call it whatever you please.”
We hypothesize that the geometry of a part is closely related to its affordance,
since its geometric properties govern the possible physical interactions with the environment.
In the first part of this thesis, we investigate how the affordances of tool
parts can be predicted using geometric features from RGB-D sensors like Kinect.
We develop several approaches to learn affordance from geometric features: using
superpixel based hierarchical sparse coding, structured random forests, and convolutional
neural networks. To evaluate the proposed methods, we construct a large
RGB-D dataset where parts are labeled with multiple affordances. Experiments
over sequences containing clutter, occlusions, and viewpoint changes show that the
approaches provide precise predictions that can be used in robotics applications.
In addition to geometry, the material properties of a part also determine its
potential functions. In the second part of this thesis, we investigate how material
cues can be integrated into a deep learning framework for affordance prediction. We
propose a modular approach for combining high-level material information, or other
mid-level cues, in order to improve affordance predictions. We present experiments
which demonstrate the efficacy of our approach on an expanded RGB-D dataset,
which includes data from non-tool objects and multiple depth sensors. The work
presented in this thesis lays a foundation for the development of robots which can
predict the potential functions of tool parts, and provides a basis for higher level
reasoning about affordance
- …