5,574 research outputs found
FSS-1000: A 1000-Class Dataset for Few-Shot Segmentation
Over the past few years, we have witnessed the success of deep learning in
image recognition thanks to the availability of large-scale human-annotated
datasets such as PASCAL VOC, ImageNet, and COCO. Although these datasets have
covered a wide range of object categories, there are still a significant number
of objects that are not included. Can we perform the same task without a lot of
human annotations? In this paper, we are interested in few-shot object
segmentation where the number of annotated training examples are limited to 5
only. To evaluate and validate the performance of our approach, we have built a
few-shot segmentation dataset, FSS-1000, which consists of 1000 object classes
with pixelwise annotation of ground-truth segmentation. Unique in FSS-1000, our
dataset contains significant number of objects that have never been seen or
annotated in previous datasets, such as tiny daily objects, merchandise,
cartoon characters, logos, etc. We build our baseline model using standard
backbone networks such as VGG-16, ResNet-101, and Inception. To our surprise,
we found that training our model from scratch using FSS-1000 achieves
comparable and even better results than training with weights pre-trained by
ImageNet which is more than 100 times larger than FSS-1000. Both our approach
and dataset are simple, effective, and easily extensible to learn segmentation
of new object classes given very few annotated training examples. Dataset is
available at https://github.com/HKUSTCV/FSS-1000
A Novel BiLevel Paradigm for Image-to-Image Translation
Image-to-image (I2I) translation is a pixel-level mapping that requires a
large number of paired training data and often suffers from the problems of
high diversity and strong category bias in image scenes. In order to tackle
these problems, we propose a novel BiLevel (BiL) learning paradigm that
alternates the learning of two models, respectively at an instance-specific
(IS) and a general-purpose (GP) level. In each scene, the IS model learns to
maintain the specific scene attributes. It is initialized by the GP model that
learns from all the scenes to obtain the generalizable translation knowledge.
This GP initialization gives the IS model an efficient starting point, thus
enabling its fast adaptation to the new scene with scarce training data. We
conduct extensive I2I translation experiments on human face and street view
datasets. Quantitative results validate that our approach can significantly
boost the performance of classical I2I translation models, such as PG2 and
Pix2Pix. Our visualization results show both higher image quality and more
appropriate instance-specific details, e.g., the translated image of a person
looks more like that person in terms of identity
- …