11 research outputs found
Unifying Training and Inference for Panoptic Segmentation
We present an end-to-end network to bridge the gap between training and
inference pipeline for panoptic segmentation, a task that seeks to partition an
image into semantic regions for "stuff" and object instances for "things". In
contrast to recent works, our network exploits a parametrised, yet lightweight
panoptic segmentation submodule, powered by an end-to-end learnt dense instance
affinity, to capture the probability that any pair of pixels belong to the same
instance. This panoptic submodule gives rise to a novel propagation mechanism
for panoptic logits and enables the network to output a coherent panoptic
segmentation map for both "stuff" and "thing" classes, without any
post-processing. Reaping the benefits of end-to-end training, our full system
sets new records on the popular street scene dataset, Cityscapes, achieving
61.4 PQ with a ResNet-50 backbone using only the fine annotations. On the
challenging COCO dataset, our ResNet-50-based network also delivers
state-of-the-art accuracy of 43.4 PQ. Moreover, our network flexibly works with
and without object mask cues, performing competitively under both settings,
which is of interest for applications with computation budgets.Comment: CVPR 202
DFormer: Diffusion-guided Transformer for Universal Image Segmentation
This paper introduces an approach, named DFormer, for universal image
segmentation. The proposed DFormer views universal image segmentation task as a
denoising process using a diffusion model. DFormer first adds various levels of
Gaussian noise to ground-truth masks, and then learns a model to predict
denoising masks from corrupted masks. Specifically, we take deep pixel-level
features along with the noisy masks as inputs to generate mask features and
attention masks, employing diffusion-based decoder to perform mask prediction
gradually. At inference, our DFormer directly predicts the masks and
corresponding categories from a set of randomly-generated masks. Extensive
experiments reveal the merits of our proposed contributions on different image
segmentation tasks: panoptic segmentation, instance segmentation, and semantic
segmentation. Our DFormer outperforms the recent diffusion-based panoptic
segmentation method Pix2Seq-D with a gain of 3.6% on MS COCO val2017 set.
Further, DFormer achieves promising semantic segmentation performance
outperforming the recent diffusion-based method by 2.2% on ADE20K val set. Our
source code and models will be publicly on https://github.com/cp3wan/DForme
Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP
Open-vocabulary segmentation is a challenging task requiring segmenting and
recognizing objects from an open set of categories. One way to address this
challenge is to leverage multi-modal models, such as CLIP, to provide image and
text features in a shared embedding space, which bridges the gap between
closed-vocabulary and open-vocabulary recognition. Hence, existing methods
often adopt a two-stage framework to tackle the problem, where the inputs first
go through a mask generator and then through the CLIP model along with the
predicted masks. This process involves extracting features from images multiple
times, which can be ineffective and inefficient. By contrast, we propose to
build everything into a single-stage framework using a shared Frozen
Convolutional CLIP backbone, which not only significantly simplifies the
current two-stage pipeline, but also remarkably yields a better accuracy-cost
trade-off. The proposed FC-CLIP, benefits from the following observations: the
frozen CLIP backbone maintains the ability of open-vocabulary classification
and can also serve as a strong mask generator, and the convolutional CLIP
generalizes well to a larger input resolution than the one used during
contrastive image-text pretraining. When training on COCO panoptic data only
and testing in a zero-shot manner, FC-CLIP achieve 26.8 PQ, 16.8 AP, and 34.1
mIoU on ADE20K, 18.2 PQ, 27.9 mIoU on Mapillary Vistas, 44.0 PQ, 26.8 AP, 56.2
mIoU on Cityscapes, outperforming the prior art by +4.2 PQ, +2.4 AP, +4.2 mIoU
on ADE20K, +4.0 PQ on Mapillary Vistas and +20.1 PQ on Cityscapes,
respectively. Additionally, the training and testing time of FC-CLIP is 7.5x
and 6.6x significantly faster than the same prior art, while using 5.9x fewer
parameters. FC-CLIP also sets a new state-of-the-art performance across various
open-vocabulary semantic segmentation datasets. Code at
https://github.com/bytedance/fc-clipComment: code and model available at https://github.com/bytedance/fc-cli
Part-aware Panoptic Segmentation
In this work, we introduce the new scene understanding task of Part-aware
Panoptic Segmentation (PPS), which aims to understand a scene at multiple
levels of abstraction, and unifies the tasks of scene parsing and part parsing.
For this novel task, we provide consistent annotations on two commonly used
datasets: Cityscapes and Pascal VOC. Moreover, we present a single metric to
evaluate PPS, called Part-aware Panoptic Quality (PartPQ). For this new task,
using the metric and annotations, we set multiple baselines by merging results
of existing state-of-the-art methods for panoptic segmentation and part
segmentation. Finally, we conduct several experiments that evaluate the
importance of the different levels of abstraction in this single task.Comment: CVPR 2021. Code and data: https://github.com/tue-mps/panoptic_part
Combinatorial Optimization for Panoptic Segmentation: A Fully Differentiable Approach
We propose a fully differentiable architecture for simultaneous semantic and
instance segmentation (a.k.a. panoptic segmentation) consisting of a
convolutional neural network and an asymmetric multiway cut problem solver. The
latter solves a combinatorial optimization problem that elegantly incorporates
semantic and boundary predictions to produce a panoptic labeling. Our
formulation allows to directly maximize a smooth surrogate of the panoptic
quality metric by backpropagating the gradient through the optimization
problem. Experimental evaluation shows improvement by backpropagating through
the optimization problem w.r.t. comparable approaches on Cityscapes and COCO
datasets. Overall, our approach shows the utility of using combinatorial
optimization in tandem with deep learning in a challenging large scale
real-world problem and showcases benefits and insights into training such an
architecture.Comment: To be presented at NeurIPS 202