2 research outputs found
Exploring Open-Vocabulary Semantic Segmentation without Human Labels
Semantic segmentation is a crucial task in computer vision that involves
segmenting images into semantically meaningful regions at the pixel level.
However, existing approaches often rely on expensive human annotations as
supervision for model training, limiting their scalability to large, unlabeled
datasets. To address this challenge, we present ZeroSeg, a novel method that
leverages the existing pretrained vision-language (VL) model (e.g. CLIP) to
train open-vocabulary zero-shot semantic segmentation models. Although acquired
extensive knowledge of visual concepts, it is non-trivial to exploit knowledge
from these VL models to the task of semantic segmentation, as they are usually
trained at an image level. ZeroSeg overcomes this by distilling the visual
concepts learned by VL models into a set of segment tokens, each summarizing a
localized region of the target image. We evaluate ZeroSeg on multiple popular
segmentation benchmarks, including PASCAL VOC 2012, PASCAL Context, and COCO,
in a zero-shot manner (i.e., no training or adaption on target segmentation
datasets). Our approach achieves state-of-the-art performance when compared to
other zero-shot segmentation methods under the same training data, while also
performing competitively compared to strongly supervised methods. Finally, we
also demonstrated the effectiveness of ZeroSeg on open-vocabulary segmentation,
through both human studies and qualitative visualizations
EgoObjects: A Large-Scale Egocentric Dataset for Fine-Grained Object Understanding
Object understanding in egocentric visual data is arguably a fundamental
research topic in egocentric vision. However, existing object datasets are
either non-egocentric or have limitations in object categories, visual content,
and annotation granularities. In this work, we introduce EgoObjects, a
large-scale egocentric dataset for fine-grained object understanding. Its Pilot
version contains over 9K videos collected by 250 participants from 50+
countries using 4 wearable devices, and over 650K object annotations from 368
object categories. Unlike prior datasets containing only object category
labels, EgoObjects also annotates each object with an instance-level
identifier, and includes over 14K unique object instances. EgoObjects was
designed to capture the same object under diverse background complexities,
surrounding objects, distance, lighting and camera motion. In parallel to the
data collection, we conducted data annotation by developing a multi-stage
federated annotation process to accommodate the growing nature of the dataset.
To bootstrap the research on EgoObjects, we present a suite of 4 benchmark
tasks around the egocentric object understanding, including a novel instance
level- and the classical category level object detection. Moreover, we also
introduce 2 novel continual learning object detection tasks. The dataset and
API are available at https://github.com/facebookresearch/EgoObjects.Comment: ICCV 2023 final version and supplement. See more details in project
page: https://github.com/facebookresearch/EgoObject