4 research outputs found
STAP: Sequencing Task-Agnostic Policies
Advances in robotic skill acquisition have made it possible to build
general-purpose libraries of learned skills for downstream manipulation tasks.
However, naively executing these skills one after the other is unlikely to
succeed without accounting for dependencies between actions prevalent in
long-horizon plans. We present Sequencing Task-Agnostic Policies (STAP), a
scalable framework for training manipulation skills and coordinating their
geometric dependencies at planning time to solve long-horizon tasks never seen
by any skill during training. Given that Q-functions encode a measure of skill
feasibility, we formulate an optimization problem to maximize the joint success
of all skills sequenced in a plan, which we estimate by the product of their
Q-values. Our experiments indicate that this objective function approximates
ground truth plan feasibility and, when used as a planning objective, reduces
myopic behavior and thereby promotes long-horizon task success. We further
demonstrate how STAP can be used for task and motion planning by estimating the
geometric feasibility of skill sequences provided by a task planner. We
evaluate our approach in simulation and on a real robot. Qualitative results
and code are made available at https://sites.google.com/stanford.edu/stap/home
S3CNet: A Sparse Semantic Scene Completion Network for LiDAR Point Clouds
With the increasing reliance of self-driving and similar robotic systems on
robust 3D vision, the processing of LiDAR scans with deep convolutional neural
networks has become a trend in academia and industry alike. Prior attempts on
the challenging Semantic Scene Completion task - which entails the inference of
dense 3D structure and associated semantic labels from "sparse" representations
- have been, to a degree, successful in small indoor scenes when provided with
dense point clouds or dense depth maps often fused with semantic segmentation
maps from RGB images. However, the performance of these systems drop
drastically when applied to large outdoor scenes characterized by dynamic and
exponentially sparser conditions. Likewise, processing of the entire sparse
volume becomes infeasible due to memory limitations and workarounds introduce
computational inefficiency as practitioners are forced to divide the overall
volume into multiple equal segments and infer on each individually, rendering
real-time performance impossible. In this work, we formulate a method that
subsumes the sparsity of large-scale environments and present S3CNet, a sparse
convolution based neural network that predicts the semantically completed scene
from a single, unified LiDAR point cloud. We show that our proposed method
outperforms all counterparts on the 3D task, achieving state-of-the art results
on the SemanticKITTI benchmark. Furthermore, we propose a 2D variant of S3CNet
with a multi-view fusion strategy to complement our 3D network, providing
robustness to occlusions and extreme sparsity in distant regions. We conduct
experiments for the 2D semantic scene completion task and compare the results
of our sparse 2D network against several leading LiDAR segmentation models
adapted for bird's eye view segmentation on two open-source datasets.Comment: 14 page
Text2Motion: From Natural Language Instructions to Feasible Plans
We propose Text2Motion, a language-based planning framework enabling robots
to solve sequential manipulation tasks that require long-horizon reasoning.
Given a natural language instruction, our framework constructs both a task- and
motion-level plan that is verified to reach inferred symbolic goals.
Text2Motion uses feasibility heuristics encoded in Q-functions of a library of
skills to guide task planning with Large Language Models. Whereas previous
language-based planners only consider the feasibility of individual skills,
Text2Motion actively resolves geometric dependencies spanning skill sequences
by performing geometric feasibility planning during its search. We evaluate our
method on a suite of problems that require long-horizon reasoning,
interpretation of abstract goals, and handling of partial affordance
perception. Our experiments show that Text2Motion can solve these challenging
problems with a success rate of 82%, while prior state-of-the-art
language-based planning methods only achieve 13%. Text2Motion thus provides
promising generalization characteristics to semantically diverse sequential
manipulation tasks with geometric dependencies between skills.Comment: https://sites.google.com/stanford.edu/text2motio