13,676 research outputs found
Zero-Shot Semantic Segmentation
International audienceSemantic segmentation models are limited in their ability to scale to large numbers of object classes. In this paper, we introduce the new task of zero-shot semantic segmentation: learning pixel-wise classifiers for never-seen object categories with zero training examples. To this end, we present a novel architecture, ZS3Net, combining a deep visual segmentation model with an approach to generate visual representations from semantic word embeddings. By this way, ZS3Net addresses pixel classification tasks where both seen and unseen categories are faced at test time (so called "generalized" zero-shot classification). Performance is further improved by a self-training step that relies on automatic pseudo-labeling of pixels from unseen classes. On the two standard segmentation datasets, Pascal-VOC and Pascal-Context, we propose zero-shot benchmarks and set competitive baselines. For complex scenes as ones in the Pascal-Context dataset, we extend our approach by using a graph-context encoding to fully leverage spatial context priors coming from class-wise segmentation maps.Code and models are available at: https://github.com/valeoai/zero_shot_semantic_segmentatio
What a MESS: Multi-Domain Evaluation of Zero-Shot Semantic Segmentation
While semantic segmentation has seen tremendous improvements in the past,
there is still significant labeling efforts necessary and the problem of
limited generalization to classes that have not been present during training.
To address this problem, zero-shot semantic segmentation makes use of large
self-supervised vision-language models, allowing zero-shot transfer to unseen
classes. In this work, we build a benchmark for Multi-domain Evaluation of
Semantic Segmentation (MESS), which allows a holistic analysis of performance
across a wide range of domain-specific datasets such as medicine, engineering,
earth monitoring, biology, and agriculture. To do this, we reviewed 120
datasets, developed a taxonomy, and classified the datasets according to the
developed taxonomy. We select a representative subset consisting of 22 datasets
and propose it as the MESS benchmark. We evaluate eight recently published
models on the proposed MESS benchmark and analyze characteristics for the
performance of zero-shot transfer models. The toolkit is available at
https://github.com/blumenstiel/MESS
ZegOT: Zero-shot Segmentation Through Optimal Transport of Text Prompts
Recent success of large-scale Contrastive Language-Image Pre-training (CLIP)
has led to great promise in zero-shot semantic segmentation by transferring
image-text aligned knowledge to pixel-level classification. However, existing
methods usually require an additional image encoder or retraining/tuning the
CLIP module. Here, we propose a novel Zero-shot segmentation with Optimal
Transport (ZegOT) method that matches multiple text prompts with frozen image
embeddings through optimal transport. In particular, we introduce a novel
Multiple Prompt Optimal Transport Solver (MPOT), which is designed to learn an
optimal mapping between multiple text prompts and visual feature maps of the
frozen image encoder hidden layers. This unique mapping method facilitates each
of the multiple text prompts to effectively focus on distinct visual semantic
attributes. Through extensive experiments on benchmark datasets, we show that
our method achieves the state-of-the-art (SOTA) performance over existing
Zero-shot Semantic Segmentation (ZS3) approaches.Comment: 18pages, 8 figure
Exploring Open-Vocabulary Semantic Segmentation without Human Labels
Semantic segmentation is a crucial task in computer vision that involves
segmenting images into semantically meaningful regions at the pixel level.
However, existing approaches often rely on expensive human annotations as
supervision for model training, limiting their scalability to large, unlabeled
datasets. To address this challenge, we present ZeroSeg, a novel method that
leverages the existing pretrained vision-language (VL) model (e.g. CLIP) to
train open-vocabulary zero-shot semantic segmentation models. Although acquired
extensive knowledge of visual concepts, it is non-trivial to exploit knowledge
from these VL models to the task of semantic segmentation, as they are usually
trained at an image level. ZeroSeg overcomes this by distilling the visual
concepts learned by VL models into a set of segment tokens, each summarizing a
localized region of the target image. We evaluate ZeroSeg on multiple popular
segmentation benchmarks, including PASCAL VOC 2012, PASCAL Context, and COCO,
in a zero-shot manner (i.e., no training or adaption on target segmentation
datasets). Our approach achieves state-of-the-art performance when compared to
other zero-shot segmentation methods under the same training data, while also
performing competitively compared to strongly supervised methods. Finally, we
also demonstrated the effectiveness of ZeroSeg on open-vocabulary segmentation,
through both human studies and qualitative visualizations
Decoupling Zero-Shot Semantic Segmentation
Zero-shot semantic segmentation (ZS3) aims to segment the novel categories that have not been seen in the training. Existing works formulate ZS3 as a pixel-level zero-shot classification problem, and transfer semantic knowledge from seen classes to unseen ones with the help of language models pre-trained only with texts. While simple, the pixel-level ZS3 formulation shows the limited capability to integrate vision-language models that are often pre-trained with image-text pairs and currently demonstrate great potential for vision tasks. Inspired by the observation that humans often perform segment-level semantic labeling, we propose to decouple the ZS3 into two sub-tasks: 1) a class-agnostic grouping task to group the pixels into segments. 2) a zero-shot classification task on segments. The former sub-task does not involve category information and can be directly transferred to group pixels for unseen classes. The latter subtask performs at segment-level and provides a natural way to leverage large-scale vision-language models pre-trained with image-text pairs (e.g. CLIP) for ZS3. Based on the decoupling formulation, we propose a simple and effective zero-shot semantic segmentation model, called ZegFormer, which outperforms the previous methods on ZS3 standard benchmarks by large margins, e.g., 35 points on the PASCAL VOC and 3 points on the COCO-Stuff in terms of mIoU for unseen classes. Code will be released at https://github.com/dingjiansw101/ZegFormer
Decoupling Zero-Shot Semantic Segmentation
Zero-shot semantic segmentation (ZS3) aims to segment the novel categories
that have not been seen in the training. Existing works formulate ZS3 as a
pixel-level zero-shot classification problem, and transfer semantic knowledge
from seen classes to unseen ones with the help of language models pre-trained
only with texts. While simple, the pixel-level ZS3 formulation shows the
limited capability to integrate vision-language models that are often
pre-trained with image-text pairs and currently demonstrate great potential for
vision tasks. Inspired by the observation that humans often perform
segment-level semantic labeling, we propose to decouple the ZS3 into two
sub-tasks: 1) a class-agnostic grouping task to group the pixels into segments.
2) a zero-shot classification task on segments. The former sub-task does not
involve category information and can be directly transferred to group pixels
for unseen classes. The latter subtask performs at segment-level and provides a
natural way to leverage large-scale vision-language models pre-trained with
image-text pairs (e.g. CLIP) for ZS3. Based on the decoupling formulation, we
propose a simple and effective zero-shot semantic segmentation model, called
ZegFormer, which outperforms the previous methods on ZS3 standard benchmarks by
large margins, e.g., 35 points on the PASCAL VOC and 3 points on the COCO-Stuff
in terms of mIoU for unseen classes. Code will be released at
https://github.com/dingjiansw101/ZegFormer.Comment: 14 pages, 8 figure
- …