8 research outputs found
Mean Shift Mask Transformer for Unseen Object Instance Segmentation
Segmenting unseen objects is a critical task in many different domains. For
example, a robot may need to grasp an unseen object, which means it needs to
visually separate this object from the background and/or other objects. Mean
shift clustering is a common method in object segmentation tasks. However, the
traditional mean shift clustering algorithm is not easily integrated into an
end-to-end neural network training pipeline. In this work, we propose the Mean
Shift Mask Transformer (MSMFormer), a new transformer architecture that
simulates the von Mises-Fisher (vMF) mean shift clustering algorithm, allowing
for the joint training and inference of both the feature extractor and the
clustering. Its central component is a hypersphere attention mechanism, which
updates object queries on a hypersphere. To illustrate the effectiveness of our
method, we apply MSMFormer to Unseen Object Instance Segmentation, which yields
a new state-of-the-art of 87.3 Boundary F-meansure on the real-world Object
Clutter Indoor Dataset (OCID). Code is available at
https://github.com/YoungSean/UnseenObjectsWithMeanShiftComment: 10 figure
Recurrent Pixel Embedding for Instance Grouping
We introduce a differentiable, end-to-end trainable framework for solving
pixel-level grouping problems such as instance segmentation consisting of two
novel components. First, we regress pixels into a hyper-spherical embedding
space so that pixels from the same group have high cosine similarity while
those from different groups have similarity below a specified margin. We
analyze the choice of embedding dimension and margin, relating them to
theoretical results on the problem of distributing points uniformly on the
sphere. Second, to group instances, we utilize a variant of mean-shift
clustering, implemented as a recurrent neural network parameterized by kernel
bandwidth. This recurrent grouping module is differentiable, enjoys convergent
dynamics and probabilistic interpretability. Backpropagating the group-weighted
loss through this module allows learning to focus on only correcting embedding
errors that won't be resolved during subsequent clustering. Our framework,
while conceptually simple and theoretically abundant, is also practically
effective and computationally efficient. We demonstrate substantial
improvements over state-of-the-art instance segmentation for object proposal
generation, as well as demonstrating the benefits of grouping loss for
classification tasks such as boundary detection and semantic segmentation
SA6D: Self-Adaptive Few-Shot 6D Pose Estimator for Novel and Occluded Objects
To enable meaningful robotic manipulation of objects in the real-world, 6D
pose estimation is one of the critical aspects. Most existing approaches have
difficulties to extend predictions to scenarios where novel object instances
are continuously introduced, especially with heavy occlusions. In this work, we
propose a few-shot pose estimation (FSPE) approach called SA6D, which uses a
self-adaptive segmentation module to identify the novel target object and
construct a point cloud model of the target object using only a small number of
cluttered reference images. Unlike existing methods, SA6D does not require
object-centric reference images or any additional object information, making it
a more generalizable and scalable solution across categories. We evaluate SA6D
on real-world tabletop object datasets and demonstrate that SA6D outperforms
existing FSPE methods, particularly in cluttered scenes with occlusions, while
requiring fewer reference images