14 research outputs found
Occlusion-Aware Instance Segmentation via BiLayer Network Architectures
Segmenting highly-overlapping image objects is challenging, because there is
typically no distinction between real object contours and occlusion boundaries
on images. Unlike previous instance segmentation methods, we model image
formation as a composition of two overlapping layers, and propose Bilayer
Convolutional Network (BCNet), where the top layer detects occluding objects
(occluders) and the bottom layer infers partially occluded instances
(occludees). The explicit modeling of occlusion relationship with bilayer
structure naturally decouples the boundaries of both the occluding and occluded
instances, and considers the interaction between them during mask regression.
We investigate the efficacy of bilayer structure using two popular
convolutional network designs, namely, Fully Convolutional Network (FCN) and
Graph Convolutional Network (GCN). Further, we formulate bilayer decoupling
using the vision transformer (ViT), by representing instances in the image as
separate learnable occluder and occludee queries. Large and consistent
improvements using one/two-stage and query-based object detectors with various
backbones and network layer choices validate the generalization ability of
bilayer decoupling, as shown by extensive experiments on image instance
segmentation benchmarks (COCO, KINS, COCOA) and video instance segmentation
benchmarks (YTVIS, OVIS, BDD100K MOTS), especially for heavy occlusion cases.
Code and data are available at https://github.com/lkeab/BCNet.Comment: Extended version of "Deep Occlusion-Aware Instance Segmentation with
Overlapping BiLayers", CVPR 2021 (arXiv:2103.12340
Rethinking Amodal Video Segmentation from Learning Supervised Signals with Object-centric Representation
Video amodal segmentation is a particularly challenging task in computer
vision, which requires to deduce the full shape of an object from the visible
parts of it. Recently, some studies have achieved promising performance by
using motion flow to integrate information across frames under a
self-supervised setting. However, motion flow has a clear limitation by the two
factors of moving cameras and object deformation. This paper presents a
rethinking to previous works. We particularly leverage the supervised signals
with object-centric representation in \textit{real-world scenarios}. The
underlying idea is the supervision signal of the specific object and the
features from different views can mutually benefit the deduction of the full
mask in any specific frame. We thus propose an Efficient object-centric
Representation amodal Segmentation (EoRaS). Specially, beyond solely relying on
supervision signals, we design a translation module to project image features
into the Bird's-Eye View (BEV), which introduces 3D information to improve
current feature quality. Furthermore, we propose a multi-view fusion layer
based temporal module which is equipped with a set of object slots and
interacts with features from different views by attention mechanism to fulfill
sufficient object representation completion. As a result, the full mask of the
object can be decoded from image features updated by object slots. Extensive
experiments on both real-world and synthetic benchmarks demonstrate the
superiority of our proposed method, achieving state-of-the-art performance. Our
code will be released at \url{https://github.com/kfan21/EoRaS}.Comment: Accepted by ICCV 202
Amodal Segmentation through Out-of-Task and Out-of-Distribution Generalization with a Bayesian Model
Amodal completion is a visual task that humans perform easily but which is
difficult for computer vision algorithms. The aim is to segment those object
boundaries which are occluded and hence invisible. This task is particularly
challenging for deep neural networks because data is difficult to obtain and
annotate. Therefore, we formulate amodal segmentation as an out-of-task and
out-of-distribution generalization problem. Specifically, we replace the fully
connected classifier in neural networks with a Bayesian generative model of the
neural network features. The model is trained from non-occluded images using
bounding box annotations and class labels only, but is applied to generalize
out-of-task to object segmentation and to generalize out-of-distribution to
segment occluded objects. We demonstrate how such Bayesian models can naturally
generalize beyond the training task labels when they learn a prior that models
the object's background context and shape. Moreover, by leveraging an outlier
process, Bayesian models can further generalize out-of-distribution to segment
partially occluded objects and to predict their amodal object boundaries. Our
algorithm outperforms alternative methods that use the same supervision by a
large margin, and even outperforms methods where annotated amodal segmentations
are used during training, when the amount of occlusion is large. Code is
publically available at https://github.com/YihongSun/Bayesian-Amodal
Learning Environment-Aware Affordance for 3D Articulated Object Manipulation under Occlusions
Perceiving and manipulating 3D articulated objects in diverse environments is
essential for home-assistant robots. Recent studies have shown that point-level
affordance provides actionable priors for downstream manipulation tasks.
However, existing works primarily focus on single-object scenarios with
homogeneous agents, overlooking the realistic constraints imposed by the
environment and the agent's morphology, e.g., occlusions and physical
limitations. In this paper, we propose an environment-aware affordance
framework that incorporates both object-level actionable priors and environment
constraints. Unlike object-centric affordance approaches, learning
environment-aware affordance faces the challenge of combinatorial explosion due
to the complexity of various occlusions, characterized by their quantities,
geometries, positions and poses. To address this and enhance data efficiency,
we introduce a novel contrastive affordance learning framework capable of
training on scenes containing a single occluder and generalizing to scenes with
complex occluder combinations. Experiments demonstrate the effectiveness of our
proposed approach in learning affordance considering environment constraints.
Project page at https://chengkaiacademycity.github.io/EnvAwareAfford/Comment: In 37th Conference on Neural Information Processing Systems (NeurIPS
2023). Website at https://chengkaiacademycity.github.io/EnvAwareAfford
3D Interacting Hand Pose Estimation by Hand De-occlusion and Removal
Estimating 3D interacting hand pose from a single RGB image is essential for
understanding human actions. Unlike most previous works that directly predict
the 3D poses of two interacting hands simultaneously, we propose to decompose
the challenging interacting hand pose estimation task and estimate the pose of
each hand separately. In this way, it is straightforward to take advantage of
the latest research progress on the single-hand pose estimation system.
However, hand pose estimation in interacting scenarios is very challenging, due
to (1) severe hand-hand occlusion and (2) ambiguity caused by the homogeneous
appearance of hands. To tackle these two challenges, we propose a novel Hand
De-occlusion and Removal (HDR) framework to perform hand de-occlusion and
distractor removal. We also propose the first large-scale synthetic amodal hand
dataset, termed Amodal InterHand Dataset (AIH), to facilitate model training
and promote the development of the related research. Experiments show that the
proposed method significantly outperforms previous state-of-the-art interacting
hand pose estimation approaches. Codes and data are available at
https://github.com/MengHao666/HDR.Comment: ECCV202
Coarse-to-Fine Amodal Segmentation with Shape Prior
Amodal object segmentation is a challenging task that involves segmenting
both visible and occluded parts of an object. In this paper, we propose a novel
approach, called Coarse-to-Fine Segmentation (C2F-Seg), that addresses this
problem by progressively modeling the amodal segmentation. C2F-Seg initially
reduces the learning space from the pixel-level image space to the
vector-quantized latent space. This enables us to better handle long-range
dependencies and learn a coarse-grained amodal segment from visual features and
visible segments. However, this latent space lacks detailed information about
the object, which makes it difficult to provide a precise segmentation
directly. To address this issue, we propose a convolution refine module to
inject fine-grained information and provide a more precise amodal object
segmentation based on visual features and coarse-predicted segmentation. To
help the studies of amodal object segmentation, we create a synthetic amodal
dataset, named as MOViD-Amodal (MOViD-A), which can be used for both image and
video amodal object segmentation. We extensively evaluate our model on two
benchmark datasets: KINS and COCO-A. Our empirical results demonstrate the
superiority of C2F-Seg. Moreover, we exhibit the potential of our approach for
video amodal object segmentation tasks on FISHBOWL and our proposed MOViD-A.
Project page at: http://jianxgao.github.io/C2F-Seg.Comment: Accepted to ICCV 202