25 research outputs found
ViHOPE: Visuotactile In-Hand Object 6D Pose Estimation with Shape Completion
In this letter, we introduce ViHOPE, a novel framework for estimating the 6D
pose of an in-hand object using visuotactile perception. Our key insight is
that the accuracy of the 6D object pose estimate can be improved by explicitly
completing the shape of the object. To this end, we introduce a novel
visuotactile shape completion module that uses a conditional Generative
Adversarial Network to complete the shape of an in-hand object based on
volumetric representation. This approach improves over prior works that
directly regress visuotactile observations to a 6D pose. By explicitly
completing the shape of the in-hand object and jointly optimizing the shape
completion and pose estimation tasks, we improve the accuracy of the 6D object
pose estimate. We train and test our model on a synthetic dataset and compare
it with the state-of-the-art. In the visuotactile shape completion task, we
outperform the state-of-the-art by 265% using the Intersection of Union metric
and achieve 88% lower Chamfer Distance. In the visuotactile pose estimation
task, we present results that suggest our framework reduces position and
angular errors by 35% and 64%, respectively. Furthermore, we ablate our
framework to confirm the gain on the 6D object pose estimate from explicitly
completing the shape. Ultimately, we show that our framework produces models
that are robust to sim-to-real transfer on a real-world robot platform.Comment: Accepted by RA-
Hierarchical Graph Neural Networks for Proprioceptive 6D Pose Estimation of In-hand Objects
Robotic manipulation, in particular in-hand object manipulation, often
requires an accurate estimate of the object's 6D pose. To improve the accuracy
of the estimated pose, state-of-the-art approaches in 6D object pose estimation
use observational data from one or more modalities, e.g., RGB images, depth,
and tactile readings. However, existing approaches make limited use of the
underlying geometric structure of the object captured by these modalities,
thereby, increasing their reliance on visual features. This results in poor
performance when presented with objects that lack such visual features or when
visual features are simply occluded. Furthermore, current approaches do not
take advantage of the proprioceptive information embedded in the position of
the fingers. To address these limitations, in this paper: (1) we introduce a
hierarchical graph neural network architecture for combining multimodal (vision
and touch) data that allows for a geometrically informed 6D object pose
estimation, (2) we introduce a hierarchical message passing operation that
flows the information within and across modalities to learn a graph-based
object representation, and (3) we introduce a method that accounts for the
proprioceptive information for in-hand object representation. We evaluate our
model on a diverse subset of objects from the YCB Object and Model Set, and
show that our method substantially outperforms existing state-of-the-art work
in accuracy and robustness to occlusion. We also deploy our proposed framework
on a real robot and qualitatively demonstrate successful transfer to real
settings
Interactive Multi-Modal Robot Programming
This paper was presented at the 2002 IEEE International Conference on Robotics and Automation, Washington, DC. The definitive paper is located at http://ieeexplore.ieee.org (DOI: 10.1109/ROBOT.2002.1013355). © IEEE.As robots enter the human environment and come in contact with inexperienced users, they need to be able to interact with users in a multi-modal fashion—keyboard and mouse are no longer acceptable as the only input
modalities. This paper introduces a novel approach to program a robot interactively through a multi-modal interface. The key characteristic of this approach is that the user can provide feedback interactively at any time—during both the programming and the execution phase. The framework takes a three-step approach to the problem:
multi-modal recognition, intention interpretation, and prioritized task execution. The multi-modal recognition
module translates hand gestures and spontaneous speech into a structured symbolic data stream without abstracting away the user's intent. The intention interpretation module selects the appropriate primitives to generate a task based on the user's input, the system's
current state, and robot sensor data. Finally, the prioritized task execution module selects and executes skill primitives based on the system’s current state, sensor inputs, and prior tasks. The framework is demonstrated by interactively controlling and programming a vacuum-cleaning robot