32 research outputs found
Learning Any-View 6DoF Robotic Grasping in Cluttered Scenes via Neural Surface Rendering
Robotic manipulation is critical for admitting robotic agents to various
application domains, like intelligent assistance. A major challenge therein is
the effective 6DoF grasping of objects in cluttered environments from any
viewpoint without requiring additional scene exploration. We introduce
, a novel method for 6DoF grasp detection that leverages
recent advances in neural volumetric representations and surface rendering. Our
approach learns both global (scene-level) and local (grasp-level) neural
surface representations, enabling effective and fully implicit 6DoF grasp
quality prediction, even in unseen parts of the scene. Further, we reinterpret
grasping as a local neural surface rendering problem, allowing the model to
encode the interaction between the robot's end-effector and the object's
surface geometry. NeuGraspNet operates on single viewpoints and can sample
grasp candidates in occluded scenes, outperforming existing implicit and
semi-implicit baseline methods in the literature. We demonstrate the real-world
applicability of NeuGraspNet with a mobile manipulator robot, grasping in open
spaces with clutter by rendering the scene, reasoning about graspable areas of
different objects, and selecting grasps likely to succeed without colliding
with the environment. Visit our project website:
https://sites.google.com/view/neugraspnetComment: Preprin
An information-theoretic approach to unsupervised keypoint representation learning
Extracting informative representations from videos is fundamental for the
effective learning of various downstream tasks. Inspired by classical works on
saliency, we present a novel information-theoretic approach to discover
meaningful representations from videos in an unsupervised fashion. We argue
that local entropy of pixel neighborhoods and its evolution in a video stream
is a valuable intrinsic supervisory signal for learning to attend to salient
features. We, thus, abstract visual features into a concise representation of
keypoints that serve as dynamic information transporters. We discover in an
unsupervised fashion spatio-temporally consistent keypoint representations that
carry the prominent information across video frames, thanks to two original
information-theoretic losses. First, a loss that maximizes the information
covered by the keypoints in a frame. Second, a loss that encourages optimized
keypoint transportation over time, thus, imposing consistency of the
information flow. We evaluate our keypoint-based representation compared to
state-of-the-art baselines in different downstream tasks such as learning
object dynamics. To evaluate the expressivity and consistency of the keypoints,
we propose a new set of metrics. Our empirical results showcase the superior
performance of our information-driven keypoints that resolve challenges like
attendance to both static and dynamic objects, and to objects abruptly entering
and leaving the scene
Learning to reason over scene graphs: a case study of finetuning GPT-2 into a robot language model for grounded task planning
Long-horizon task planning is essential for the development of intelligent assistive and service robots. In this work, we investigate the applicability of a smaller class of large language models (LLMs), specifically GPT-2, in robotic task planning by learning to decompose tasks into subgoal specifications for a planner to execute sequentially. Our method grounds the input of the LLM on the domain that is represented as a scene graph, enabling it to translate human requests into executable robot plans, thereby learning to reason over long-horizon tasks, as encountered in the ALFRED benchmark. We compare our approach with classical planning and baseline methods to examine the applicability and generalizability of LLM-based planners. Our findings suggest that the knowledge stored in an LLM can be effectively grounded to perform long-horizon task planning, demonstrating the promising potential for the future application of neuro-symbolic planning methods in robotics
Active-Perceptive Motion Generation for Mobile Manipulation
Mobile Manipulation (MoMa) systems incorporate the benefits of mobility and
dexterity, due to the enlarged space in which they can move and interact with
their environment. However, even when equipped with onboard sensors, e.g., an
embodied camera, extracting task-relevant visual information in unstructured
and cluttered environments, such as households, remains challenging. In this
work, we introduce an active perception pipeline for mobile manipulators to
generate motions that are informative toward manipulation tasks, such as
grasping in unknown, cluttered scenes. Our proposed approach, ActPerMoMa,
generates robot paths in a receding horizon fashion by sampling paths and
computing path-wise utilities. These utilities trade-off maximizing the visual
Information Gain (IG) for scene reconstruction and the task-oriented objective,
e.g., grasp success, by maximizing grasp reachability. We show the efficacy of
our method in simulated experiments with a dual-arm TIAGo++ MoMa robot
performing mobile grasping in cluttered scenes with obstacles. We empirically
analyze the contribution of various utilities and parameters, and compare
against representative baselines both with and without active perception
objectives. Finally, we demonstrate the transfer of our mobile grasping
strategy to the real world, indicating a promising direction for
active-perceptive MoMa.Comment: ICRA 2024. Project page: https://sites.google.com/view/actpermom
SE(3)-DiffusionFields: Learning smooth cost functions for joint grasp and motion optimization through diffusion
Multi-objective optimization problems are ubiquitous in robotics, e.g., the
optimization of a robot manipulation task requires a joint consideration of
grasp pose configurations, collisions and joint limits. While some demands can
be easily hand-designed, e.g., the smoothness of a trajectory, several
task-specific objectives need to be learned from data. This work introduces a
method for learning data-driven SE(3) cost functions as diffusion models.
Diffusion models can represent highly-expressive multimodal distributions and
exhibit proper gradients over the entire space due to their score-matching
training objective. Learning costs as diffusion models allows their seamless
integration with other costs into a single differentiable objective function,
enabling joint gradient-based motion optimization. In this work, we focus on
learning SE(3) diffusion models for 6DoF grasping, giving rise to a novel
framework for joint grasp and motion optimization without needing to decouple
grasp selection from trajectory generation. We evaluate the representation
power of our SE(3) diffusion models w.r.t. classical generative models, and we
showcase the superior performance of our proposed optimization framework in a
series of simulated and real-world robotic manipulation tasks against
representative baselines.Comment: diffusion models, SE(3), grasping
Accelerating Motion Planning via Optimal Transport
Motion planning is still an open problem for many disciplines, e.g.,
robotics, autonomous driving, due to their need for high computational
resources that hinder real-time, efficient decision-making. A class of methods
striving to provide smooth solutions is gradient-based trajectory optimization.
However, those methods usually suffer from bad local minima, while for many
settings, they may be inapplicable due to the absence of easy-to-access
gradients of the optimization objectives. In response to these issues, we
introduce Motion Planning via Optimal Transport (MPOT) -- a
\textit{gradient-free} method that optimizes a batch of smooth trajectories
over highly nonlinear costs, even for high-dimensional tasks, while imposing
smoothness through a Gaussian Process dynamics prior via the
planning-as-inference perspective. To facilitate batch trajectory optimization,
we introduce an original zero-order and highly-parallelizable update rule: the
Sinkhorn Step, which uses the regular polytope family for its search
directions. Each regular polytope, centered on trajectory waypoints, serves as
a local cost-probing neighborhood, acting as a \textit{trust region} where the
Sinkhorn Step "transports" local waypoints toward low-cost regions. We
theoretically show that Sinkhorn Step guides the optimizing parameters toward
local minima regions of non-convex objective functions. We then show the
efficiency of MPOT in a range of problems from low-dimensional point-mass
navigation to high-dimensional whole-body robot motion planning, evincing its
superiority compared to popular motion planners, paving the way for new
applications of optimal transport in motion planning.Comment: Published as a conference paper at NeurIPS 2023. Project website:
https://sites.google.com/view/sinkhorn-step
Graph-based Reinforcement Learning meets Mixed Integer Programs: An application to 3D robot assembly discovery
Robot assembly discovery is a challenging problem that lives at the
intersection of resource allocation and motion planning. The goal is to combine
a predefined set of objects to form something new while considering task
execution with the robot-in-the-loop. In this work, we tackle the problem of
building arbitrary, predefined target structures entirely from scratch using a
set of Tetris-like building blocks and a robotic manipulator. Our novel
hierarchical approach aims at efficiently decomposing the overall task into
three feasible levels that benefit mutually from each other. On the high level,
we run a classical mixed-integer program for global optimization of block-type
selection and the blocks' final poses to recreate the desired shape. Its output
is then exploited to efficiently guide the exploration of an underlying
reinforcement learning (RL) policy. This RL policy draws its generalization
properties from a flexible graph-based representation that is learned through
Q-learning and can be refined with search. Moreover, it accounts for the
necessary conditions of structural stability and robotic feasibility that
cannot be effectively reflected in the previous layer. Lastly, a grasp and
motion planner transforms the desired assembly commands into robot joint
movements. We demonstrate our proposed method's performance on a set of
competitive simulated RAD environments, showcase real-world transfer, and
report performance and robustness gains compared to an unstructured end-to-end
approach. Videos are available at https://sites.google.com/view/rl-meets-milp
A Deep Learning Approach for Multi-View Engagement Estimation of Children in a Child-Robot Joint Attention Task
International audienceIn this work we tackle the problem of child engagement estimation while children freely interact with a robot in a friendly, room-like environment. We propose a deep-based multi-view solution that takes advantage of recent developments in human pose detection. We extract the child's pose from different RGB-D cameras placed regularly in the room, fuse the results and feed them to a deep neural network trained for classifying engagement levels. The deep network contains a recurrent layer, in order to exploit the rich temporal information contained in the pose data. The resulting method outperforms a number of baseline classifiers, and provides a promising tool for better automatic understanding of a child's attitude, interest and attention while cooperating with a robot. The goal is to integrate this model in next generation social robots as an attention monitoring tool during various Child Robot Interaction (CRI) tasks both for Typically Developed (TD) children and children affected by autism (ASD)