1,220 research outputs found
Unseen Object Instance Segmentation with Fully Test-time RGB-D Embeddings Adaptation
Segmenting unseen objects is a crucial ability for the robot since it may
encounter new environments during the operation. Recently, a popular solution
is leveraging RGB-D features of large-scale synthetic data and directly
applying the model to unseen real-world scenarios. However, even though depth
data have fair generalization ability, the domain shift due to the Sim2Real gap
is inevitable, which presents a key challenge to the unseen object instance
segmentation (UOIS) model. To tackle this problem, we re-emphasize the
adaptation process across Sim2Real domains in this paper. Specifically, we
propose a framework to conduct the Fully Test-time RGB-D Embeddings Adaptation
(FTEA) based on parameters of the BatchNorm layer. To construct the learning
objective for test-time back-propagation, we propose a novel non-parametric
entropy objective that can be implemented without explicit classification
layers. Moreover, we design a cross-modality knowledge distillation module to
encourage the information transfer during test time. The proposed method can be
efficiently conducted with test-time images, without requiring annotations or
revisiting the large-scale synthetic training data. Besides significant time
savings, the proposed method consistently improves segmentation results on both
overlap and boundary metrics, achieving state-of-the-art performances on two
real-world RGB-D image datasets. We hope our work could draw attention to the
test-time adaptation and reveal a promising direction for robot perception in
unseen environments.Comment: 10 pages, 6 figure
Mean Shift Mask Transformer for Unseen Object Instance Segmentation
Segmenting unseen objects is a critical task in many different domains. For
example, a robot may need to grasp an unseen object, which means it needs to
visually separate this object from the background and/or other objects. Mean
shift clustering is a common method in object segmentation tasks. However, the
traditional mean shift clustering algorithm is not easily integrated into an
end-to-end neural network training pipeline. In this work, we propose the Mean
Shift Mask Transformer (MSMFormer), a new transformer architecture that
simulates the von Mises-Fisher (vMF) mean shift clustering algorithm, allowing
for the joint training and inference of both the feature extractor and the
clustering. Its central component is a hypersphere attention mechanism, which
updates object queries on a hypersphere. To illustrate the effectiveness of our
method, we apply MSMFormer to Unseen Object Instance Segmentation, which yields
a new state-of-the-art of 87.3 Boundary F-meansure on the real-world Object
Clutter Indoor Dataset (OCID). Code is available at
https://github.com/YoungSean/UnseenObjectsWithMeanShiftComment: 10 figure
3D-BEVIS: Bird's-Eye-View Instance Segmentation
Recent deep learning models achieve impressive results on 3D scene analysis
tasks by operating directly on unstructured point clouds. A lot of progress was
made in the field of object classification and semantic segmentation. However,
the task of instance segmentation is less explored. In this work, we present
3D-BEVIS, a deep learning framework for 3D semantic instance segmentation on
point clouds. Following the idea of previous proposal-free instance
segmentation approaches, our model learns a feature embedding and groups the
obtained feature space into semantic instances. Current point-based methods
scale linearly with the number of points by processing local sub-parts of a
scene individually. However, to perform instance segmentation by clustering,
globally consistent features are required. Therefore, we propose to combine
local point geometry with global context information from an intermediate
bird's-eye view representation.Comment: camera-ready version for GCPR '1
Straight to Shapes: Real-time Detection of Encoded Shapes
Current object detection approaches predict bounding boxes, but these provide
little instance-specific information beyond location, scale and aspect ratio.
In this work, we propose to directly regress to objects' shapes in addition to
their bounding boxes and categories. It is crucial to find an appropriate shape
representation that is compact and decodable, and in which objects can be
compared for higher-order concepts such as view similarity, pose variation and
occlusion. To achieve this, we use a denoising convolutional auto-encoder to
establish an embedding space, and place the decoder after a fast end-to-end
network trained to regress directly to the encoded shape vectors. This yields
what to the best of our knowledge is the first real-time shape prediction
network, running at ~35 FPS on a high-end desktop. With higher-order shape
reasoning well-integrated into the network pipeline, the network shows the
useful practical quality of generalising to unseen categories similar to the
ones in the training set, something that most existing approaches fail to
handle.Comment: 16 pages including appendix; Published at CVPR 201
3D Model-based Zero-Shot Pose Estimation Pipeline
Most existing learning-based pose estimation methods are typically developed
for non-zero-shot scenarios, where they can only estimate the poses of objects
present in the training dataset. This setting restricts their applicability to
unseen objects in the training phase. In this paper, we introduce a fully
zero-shot pose estimation pipeline that leverages the 3D models of objects as
clues. Specifically, we design a two-step pipeline consisting of 3D model-based
zero-shot instance segmentation and a zero-shot pose estimator. For the first
step, there is a novel way to perform zero-shot instance segmentation based on
the 3D models instead of text descriptions, which can handle complex properties
of unseen objects. For the second step, we utilize a hierarchical geometric
structure matching mechanism to perform zero-shot pose estimation which is 10
times faster than the current render-based method. Extensive experimental
results on the seven core datasets on the BOP challenge show that the proposed
method outperforms the zero-shot state-of-the-art method with higher speed and
lower computation cost
- …