3 research outputs found
Generalized Few-shot Semantic Segmentation
Training semantic segmentation models requires a large amount of finely
annotated data, making it hard to quickly adapt to novel classes not satisfying
this condition. Few-Shot Segmentation (FS-Seg) tackles this problem with many
constraints. In this paper, we introduce a new benchmark, called Generalized
Few-Shot Semantic Segmentation (GFS-Seg), to analyze the generalization ability
of simultaneously segmenting the novel categories with very few examples and
the base categories with sufficient examples. It is the first study showing
that previous representative state-of-the-art FS-Seg methods fall short in
GFS-Seg and the performance discrepancy mainly comes from the constrained
setting of FS-Seg. To make GFS-Seg tractable, we set up a GFS-Seg baseline that
achieves decent performance without structural change on the original model.
Then, since context is essential for semantic segmentation, we propose the
Context-Aware Prototype Learning (CAPL) that significantly improves performance
by 1) leveraging the co-occurrence prior knowledge from support samples, and 2)
dynamically enriching contextual information to the classifier, conditioned on
the content of each query image. Both two contributions are experimentally
shown to have substantial practical merit. Extensive experiments on Pascal-VOC
and COCO manifest the effectiveness of CAPL, and CAPL generalizes well to
FS-Seg by achieving competitive performance. Code will be made publicly
available
Learning Expressive Prompting With Residuals for Vision Transformers
Prompt learning is an efficient approach to adapt transformers by inserting
learnable set of parameters into the input and intermediate representations of
a pre-trained model. In this work, we present Expressive Prompts with Residuals
(EXPRES) which modifies the prompt learning paradigm specifically for effective
adaptation of vision transformers (ViT). Out method constructs downstream
representations via learnable ``output'' tokens, that are akin to the learned
class tokens of the ViT. Further for better steering of the downstream
representation processed by the frozen transformer, we introduce residual
learnable tokens that are added to the output of various computations. We apply
EXPRES for image classification, few shot learning, and semantic segmentation,
and show our method is capable of achieving state of the art prompt tuning on
3/3 categories of the VTAB benchmark. In addition to strong performance, we
observe that our approach is an order of magnitude more prompt efficient than
existing visual prompting baselines. We analytically show the computational
benefits of our approach over weight space adaptation techniques like
finetuning. Lastly we systematically corroborate the architectural design of
our method via a series of ablation experiments.Comment: Accepted at CVPR (2023