6 research outputs found
OpenMask3D: Open-Vocabulary 3D Instance Segmentation
We introduce the task of open-vocabulary 3D instance segmentation.
Traditional approaches for 3D instance segmentation largely rely on existing 3D
annotated datasets, which are restricted to a closed-set of object categories.
This is an important limitation for real-life applications where one might need
to perform tasks guided by novel, open-vocabulary queries related to objects
from a wide variety. Recently, open-vocabulary 3D scene understanding methods
have emerged to address this problem by learning queryable features per each
point in the scene. While such a representation can be directly employed to
perform semantic segmentation, existing methods have limitations in their
ability to identify object instances. In this work, we address this limitation,
and propose OpenMask3D, which is a zero-shot approach for open-vocabulary 3D
instance segmentation. Guided by predicted class-agnostic 3D instance masks,
our model aggregates per-mask features via multi-view fusion of CLIP-based
image embeddings. We conduct experiments and ablation studies on the ScanNet200
dataset to evaluate the performance of OpenMask3D, and provide insights about
the open-vocabulary 3D instance segmentation task. We show that our approach
outperforms other open-vocabulary counterparts, particularly on the long-tail
distribution. Furthermore, OpenMask3D goes beyond the limitations of
close-vocabulary approaches, and enables the segmentation of object instances
based on free-form queries describing object properties such as semantics,
geometry, affordances, and material properties.Comment: project page: https://openmask3d.github.io
Unsupervised Monocular Depth Reconstruction of Non-Rigid Scenes
Monocular depth reconstruction of complex and dynamic scenes is a highly
challenging problem. While for rigid scenes learning-based methods have been
offering promising results even in unsupervised cases, there exists little to
no literature addressing the same for dynamic and deformable scenes. In this
work, we present an unsupervised monocular framework for dense depth estimation
of dynamic scenes, which jointly reconstructs rigid and non-rigid parts without
explicitly modelling the camera motion. Using dense correspondences, we derive
a training objective that aims to opportunistically preserve pairwise distances
between reconstructed 3D points. In this process, the dense depth map is
learned implicitly using the as-rigid-as-possible hypothesis. Our method
provides promising results, demonstrating its capability of reconstructing 3D
from challenging videos of non-rigid scenes. Furthermore, the proposed method
also provides unsupervised motion segmentation results as an auxiliary output
3D Segmentation of Humans in Point Clouds with Synthetic Data
Segmenting humans in 3D indoor scenes has become increasingly important with
the rise of human-centered robotics and AR/VR applications. To this end, we
propose the task of joint 3D human semantic segmentation, instance segmentation
and multi-human body-part segmentation. Few works have attempted to directly
segment humans in cluttered 3D scenes, which is largely due to the lack of
annotated training data of humans interacting with 3D scenes. We address this
challenge and propose a framework for generating training data of synthetic
humans interacting with real 3D scenes. Furthermore, we propose a novel
transformer-based model, Human3D, which is the first end-to-end model for
segmenting multiple human instances and their body-parts in a unified manner.
The key advantage of our synthetic data generation framework is its ability to
generate diverse and realistic human-scene interactions, with highly accurate
ground truth. Our experiments show that pre-training on synthetic data improves
performance on a wide variety of 3D human segmentation tasks. Finally, we
demonstrate that Human3D outperforms even task-specific state-of-the-art 3D
segmentation methods.Comment: project page: https://human-3d.github.io
OpenMask3D: Open-Vocabulary 3D Instance Segmentation
We introduce the task of open-vocabulary 3D instance segmentation. Current approaches for 3D instance segmentation can typically only recognize object categories from a pre-defined closed set of classes that are annotated in the training datasets. This results in important limitations for real-world applications where one might need to perform tasks guided by novel, open-vocabulary queries related to a wide variety of objects. Recently, open-vocabulary 3D scene understanding methods have emerged to address this problem by learning queryable features for each point in the scene. While such a representation can be directly employed to perform semantic segmentation, existing methods cannot separate multiple object instances. In this work, we address this limitation, and propose OpenMask3D, which is a zero-shot approach for open-vocabulary 3D instance segmentation. Guided by predicted class-agnostic 3D instance masks, our model aggregates per-mask features via multi-view fusion of CLIP-based image embeddings. Experiments and ablation studies on ScanNet200 and Replica show that OpenMask3D outperforms other open-vocabulary methods, especially on the long-tail distribution. Qualitative experiments further showcase OpenMask3D’s ability to segment object properties based on free-form queries describing geometry, affordances, and materials
Unsupervised Monocular Depth Reconstruction of Non-Rigid Scenes
Monocular depth reconstruction of complex and dynamic scenes is a highly challenging problem. While for rigid scenes learning-based methods have been offering promising results even in unsupervised cases, there exists little to no literature addressing the same for dynamic and deformable scenes. In this work, we present an unsupervised monocular framework for dense depth estimation of dynamic scenes, which jointly reconstructs rigid and non-rigid parts without explicitly modelling the camera motion. Using dense correspondences, we derive a training objective that aims to opportunistically preserve pairwise distances between reconstructed 3D points. In this process, the dense depth map is learned implicitly using the as-rigid-as-possible hypothesis. Our method provides promising results, demonstrating its capability of reconstructing 3D from challenging videos of non-rigid scenes. Furthermore, the proposed method also provides unsupervised motion segmentation results as an auxiliary output