13 research outputs found
MASK4D: Mask Transformer for 4D Panoptic Segmentation
Accurately perceiving and tracking instances over time is essential for the
decision-making processes of autonomous agents interacting safely in dynamic
environments. With this intention, we propose Mask4D for the challenging task
of 4D panoptic segmentation of LiDAR point clouds. Mask4D is the first
transformer-based approach unifying semantic instance segmentation and tracking
of sparse and irregular sequences of 3D point clouds into a single joint model.
Our model directly predicts semantic instances and their temporal associations
without relying on any hand-crafted non-learned association strategies such as
probabilistic clustering or voting-based center prediction. Instead, Mask4D
introduces spatio-temporal instance queries which encode the semantic and
geometric properties of each semantic tracklet in the sequence. In an in-depth
study, we find that it is critical to promote spatially compact instance
predictions as spatio-temporal instance queries tend to merge multiple
semantically similar instances, even if they are spatially distant. To this
end, we regress 6-DOF bounding box parameters from spatio-temporal instance
queries, which is used as an auxiliary task to foster spatially compact
predictions. Mask4D achieves a new state-of-the-art on the SemanticKITTI test
set with a score of 68.4 LSTQ, improving upon published top-performing methods
by at least +4.5%.Comment: Project page: https://vision.rwth-aachen.de/mask4
Point2Vec for Self-Supervised Representation Learning on Point Clouds
Recently, the self-supervised learning framework data2vec has shown inspiring
performance for various modalities using a masked student-teacher approach.
However, it remains open whether such a framework generalizes to the unique
challenges of 3D point clouds. To answer this question, we extend data2vec to
the point cloud domain and report encouraging results on several downstream
tasks. In an in-depth analysis, we discover that the leakage of positional
information reveals the overall object shape to the student even under heavy
masking and thus hampers data2vec to learn strong representations for point
clouds. We address this 3D-specific shortcoming by proposing point2vec, which
unleashes the full potential of data2vec-like pre-training on point clouds. Our
experiments show that point2vec outperforms other self-supervised methods on
shape classification and few-shot learning on ModelNet40 and ScanObjectNN,
while achieving competitive results on part segmentation on ShapeNetParts.
These results suggest that the learned representations are strong and
transferable, highlighting point2vec as a promising direction for
self-supervised learning of point cloud representations
AGILE3D: Attention Guided Interactive Multi-object 3D Segmentation
During interactive segmentation, a model and a user work together to
delineate objects of interest in a 3D point cloud. In an iterative process, the
model assigns each data point to an object (or the background), while the user
corrects errors in the resulting segmentation and feeds them back into the
model. The current best practice formulates the problem as binary
classification and segments objects one at a time. The model expects the user
to provide positive clicks to indicate regions wrongly assigned to the
background and negative clicks on regions wrongly assigned to the object.
Sequentially visiting objects is wasteful since it disregards synergies between
objects: a positive click for a given object can, by definition, serve as a
negative click for nearby objects. Moreover, a direct competition between
adjacent objects can speed up the identification of their common boundary. We
introduce AGILE3D, an efficient, attention-based model that (1) supports
simultaneous segmentation of multiple 3D objects, (2) yields more accurate
segmentation masks with fewer user clicks, and (3) offers faster inference. Our
core idea is to encode user clicks as spatial-temporal queries and enable
explicit interactions between click queries as well as between them and the 3D
scene through a click attention module. Every time new clicks are added, we
only need to run a lightweight decoder that produces updated segmentation
masks. In experiments with four different 3D point cloud datasets, AGILE3D sets
a new state-of-the-art. Moreover, we also verify its practicality in real-world
setups with real user studies.Comment: Project page: https://ywyue.github.io/AGILE3
3D Segmentation of Humans in Point Clouds with Synthetic Data
Segmenting humans in 3D indoor scenes has become increasingly important with
the rise of human-centered robotics and AR/VR applications. To this end, we
propose the task of joint 3D human semantic segmentation, instance segmentation
and multi-human body-part segmentation. Few works have attempted to directly
segment humans in cluttered 3D scenes, which is largely due to the lack of
annotated training data of humans interacting with 3D scenes. We address this
challenge and propose a framework for generating training data of synthetic
humans interacting with real 3D scenes. Furthermore, we propose a novel
transformer-based model, Human3D, which is the first end-to-end model for
segmenting multiple human instances and their body-parts in a unified manner.
The key advantage of our synthetic data generation framework is its ability to
generate diverse and realistic human-scene interactions, with highly accurate
ground truth. Our experiments show that pre-training on synthetic data improves
performance on a wide variety of 3D human segmentation tasks. Finally, we
demonstrate that Human3D outperforms even task-specific state-of-the-art 3D
segmentation methods.Comment: project page: https://human-3d.github.io
Mix3D: Out-of-Context Data Augmentation for 3D Scenes
We present Mix3D, a data augmentation technique for segmenting large-scale 3D scenes. Since scene context helps reasoning about object semantics, current works focus on models with large capacity and receptive fields that can fully capture the global context of an input 3D scene. However, strong contextual priors can have detrimental implications like mistaking a pedestrian crossing the street for a car. In this work, we focus on the importance of balancing global scene context and local geometry, with the goal of generalizing beyond the contextual priors in the training set. In particular, we propose a “mixing” technique which creates new training samples by combining two augmented scenes. By doing so, object instances are implicitly placed into novel out-of-context environments, therefore making it harder for models to rely on scene context alone, and instead infer semantics from local structure as well. We perform detailed analysis to understand the importance of global context, local structures and the effect of mixing scenes. In experiments, we show that models trained with Mix3D profit from a significant performance boost on indoor (ScanNet, S3DIS) and outdoor datasets (SemanticKITTI). Mix3D can be trivially used with any existing method, e.g., trained with Mix3D, MinkowskiNet outperforms all prior state-of-the-art methods by a significant margin on the ScanNet test benchmark (78.1% mIoU). Code is available at: https://nekrasov.dev/mix3d