3,095 research outputs found
HoME: a Household Multimodal Environment
We introduce HoME: a Household Multimodal Environment for artificial agents
to learn from vision, audio, semantics, physics, and interaction with objects
and other agents, all within a realistic context. HoME integrates over 45,000
diverse 3D house layouts based on the SUNCG dataset, a scale which may
facilitate learning, generalization, and transfer. HoME is an open-source,
OpenAI Gym-compatible platform extensible to tasks in reinforcement learning,
language grounding, sound-based navigation, robotics, multi-agent learning, and
more. We hope HoME better enables artificial agents to learn as humans do: in
an interactive, multimodal, and richly contextualized setting.Comment: Presented at NIPS 2017's Visually-Grounded Interaction and Language
Worksho
Inferring Fluid Dynamics via Inverse Rendering
Humans have a strong intuitive understanding of physical processes such as
fluid falling by just a glimpse of such a scene picture, i.e., quickly derived
from our immersive visual experiences in memory. This work achieves such a
photo-to-fluid-dynamics reconstruction functionality learned from unannotated
videos, without any supervision of ground-truth fluid dynamics. In a nutshell,
a differentiable Euler simulator modeled with a ConvNet-based pressure
projection solver, is integrated with a volumetric renderer, supporting
end-to-end/coherent differentiable dynamic simulation and rendering. By
endowing each sampled point with a fluid volume value, we derive a NeRF-like
differentiable renderer dedicated from fluid data; and thanks to this
volume-augmented representation, fluid dynamics could be inversely inferred
from the error signal between the rendered result and ground-truth video frame
(i.e., inverse rendering). Experiments on our generated Fluid Fall datasets and
DPI Dam Break dataset are conducted to demonstrate both effectiveness and
generalization ability of our method
Assemble Them All: Physics-Based Planning for Generalizable Assembly by Disassembly
Assembly planning is the core of automating product assembly, maintenance,
and recycling for modern industrial manufacturing. Despite its importance and
long history of research, planning for mechanical assemblies when given the
final assembled state remains a challenging problem. This is due to the
complexity of dealing with arbitrary 3D shapes and the highly constrained
motion required for real-world assemblies. In this work, we propose a novel
method to efficiently plan physically plausible assembly motion and sequences
for real-world assemblies. Our method leverages the assembly-by-disassembly
principle and physics-based simulation to efficiently explore a reduced search
space. To evaluate the generality of our method, we define a large-scale
dataset consisting of thousands of physically valid industrial assemblies with
a variety of assembly motions required. Our experiments on this new benchmark
demonstrate we achieve a state-of-the-art success rate and the highest
computational efficiency compared to other baseline algorithms. Our method also
generalizes to rotational assemblies (e.g., screws and puzzles) and solves
80-part assemblies within several minutes.Comment: Accepted by SIGGRAPH Asia 2022. Project website:
http://assembly.csail.mit.edu
SparseDFF: Sparse-View Feature Distillation for One-Shot Dexterous Manipulation
Humans excel at transferring manipulation skills across diverse object
shapes, poses, and appearances due to their understanding of semantic
correspondences between different instances. To endow robots with a similar
high-level understanding, we develop a Distilled Feature Field (DFF) for 3D
scenes, leveraging large 2D vision models to distill semantic features from
multiview images. While current research demonstrates advanced performance in
reconstructing DFFs from dense views, the development of learning a DFF from
sparse views is relatively nascent, despite its prevalence in numerous
manipulation tasks with fixed cameras. In this work, we introduce SparseDFF, a
novel method for acquiring view-consistent 3D DFFs from sparse RGBD
observations, enabling one-shot learning of dexterous manipulations that are
transferable to novel scenes. Specifically, we map the image features to the 3D
point cloud, allowing for propagation across the 3D space to establish a dense
feature field. At the core of SparseDFF is a lightweight feature refinement
network, optimized with a contrastive loss between pairwise views after
back-projecting the image features onto the 3D point cloud. Additionally, we
implement a point-pruning mechanism to augment feature continuity within each
local neighborhood. By establishing coherent feature fields on both source and
target scenes, we devise an energy function that facilitates the minimization
of feature discrepancies w.r.t. the end-effector parameters between the
demonstration and the target manipulation. We evaluate our approach using a
dexterous hand, mastering real-world manipulations on both rigid and deformable
objects, and showcase robust generalization in the face of object and
scene-context variations
Robo360: A 3D Omnispective Multi-Material Robotic Manipulation Dataset
Building robots that can automate labor-intensive tasks has long been the
core motivation behind the advancements in computer vision and the robotics
community. Recent interest in leveraging 3D algorithms, particularly neural
fields, has led to advancements in robot perception and physical understanding
in manipulation scenarios. However, the real world's complexity poses
significant challenges. To tackle these challenges, we present Robo360, a
dataset that features robotic manipulation with a dense view coverage, which
enables high-quality 3D neural representation learning, and a diverse set of
objects with various physical and optical properties and facilitates research
in various object manipulation and physical world modeling tasks. We confirm
the effectiveness of our dataset using existing dynamic NeRF and evaluate its
potential in learning multi-view policies. We hope that Robo360 can open new
research directions yet to be explored at the intersection of understanding the
physical world in 3D and robot control
SURFSUP: Learning Fluid Simulation for Novel Surfaces
Modeling the mechanics of fluid in complex scenes is vital to applications in
design, graphics, and robotics. Learning-based methods provide fast and
differentiable fluid simulators, however most prior work is unable to
accurately model how fluids interact with genuinely novel surfaces not seen
during training. We introduce SURFSUP, a framework that represents objects
implicitly using signed distance functions (SDFs), rather than an explicit
representation of meshes or particles. This continuous representation of
geometry enables more accurate simulation of fluid-object interactions over
long time periods while simultaneously making computation more efficient.
Moreover, SURFSUP trained on simple shape primitives generalizes considerably
out-of-distribution, even to complex real-world scenes and objects. Finally, we
show we can invert our model to design simple objects to manipulate fluid flow.Comment: Website: https://surfsup.cs.columbia.edu
EquivAct: SIM(3)-Equivariant Visuomotor Policies beyond Rigid Object Manipulation
If a robot masters folding a kitchen towel, we would also expect it to master
folding a beach towel. However, existing works for policy learning that rely on
data set augmentations are still limited in achieving this level of
generalization. Our insight is to add equivariance to both the visual object
representation and policy architecture. We propose EquivAct which utilizes
SIM(3)-equivariant network structures that guarantee generalization across all
possible object translations, 3D rotations, and scales by construction.
Training of EquivAct is done in two phases. We first pre-train a
SIM(3)-equivariant visual representation on simulated scene point clouds. Then,
we learn a SIM(3)-equivariant visuomotor policy on top of the pre-trained
visual representation using a small amount of source task demonstrations. We
demonstrate that after training, the learned policy directly transfers to
objects that substantially differ in scale, position and orientation from the
source demonstrations. In simulation, we evaluate our method in three
manipulation tasks involving deformable and articulated objects thereby going
beyond the typical rigid object manipulation tasks that prior works considered.
We show that our method outperforms prior works that do not use equivariant
architectures or do not use our contrastive pre-training procedure. We also
show quantitative and qualitative experiments on three real robot tasks, where
the robot watches twenty demonstrations of a tabletop task and transfers
zero-shot to a mobile manipulation task in a much larger setup. Project
website: https://equivact.github.i
- …