27 research outputs found
MaskDiff: Modeling Mask Distribution with Diffusion Probabilistic Model for Few-Shot Instance Segmentation
Few-shot instance segmentation extends the few-shot learning paradigm to the
instance segmentation task, which tries to segment instance objects from a
query image with a few annotated examples of novel categories. Conventional
approaches have attempted to address the task via prototype learning, known as
point estimation. However, this mechanism depends on prototypes (\eg mean of
shot) for prediction, leading to performance instability. To overcome the
disadvantage of the point estimation mechanism, we propose a novel approach,
dubbed MaskDiff, which models the underlying conditional distribution of a
binary mask, which is conditioned on an object region and shot information.
Inspired by augmentation approaches that perturb data with Gaussian noise for
populating low data density regions, we model the mask distribution with a
diffusion probabilistic model. We also propose to utilize classifier-free
guided mask sampling to integrate category information into the binary mask
generation process. Without bells and whistles, our proposed method
consistently outperforms state-of-the-art methods on both base and novel
classes of the COCO dataset while simultaneously being more stable than
existing methods. The source code is available at:
https://github.com/minhquanlecs/MaskDiff.Comment: Accepted at AAAI 2024 (oral presentation
Robotic Interestingness via Human-Informed Few-Shot Object Detection
Interestingness recognition is crucial for decision making in autonomous
exploration for mobile robots. Previous methods proposed an unsupervised online
learning approach that can adapt to environments and detect interesting scenes
quickly, but lack the ability to adapt to human-informed interesting objects.
To solve this problem, we introduce a human-interactive framework,
AirInteraction, that can detect human-informed objects via few-shot online
learning. To reduce the communication bandwidth, we first apply an online
unsupervised learning algorithm on the unmanned vehicle for interestingness
recognition and then only send the potential interesting scenes to a
base-station for human inspection. The human operator is able to draw and
provide bounding box annotations for particular interesting objects, which are
sent back to the robot to detect similar objects via few-shot learning. Only
using few human-labeled examples, the robot can learn novel interesting object
categories during the mission and detect interesting scenes that contain the
objects. We evaluate our method on various interesting scene recognition
datasets. To the best of our knowledge, it is the first human-informed few-shot
object detection framework for autonomous exploration
Closing the Generalization Gap in One-Shot Object Detection
Despite substantial progress in object detection and few-shot learning,
detecting objects based on a single example - one-shot object detection -
remains a challenge: trained models exhibit a substantial generalization gap,
where object categories used during training are detected much more reliably
than novel ones. Here we show that this generalization gap can be nearly closed
by increasing the number of object categories used during training. Our results
show that the models switch from memorizing individual categories to learning
object similarity over the category distribution, enabling strong
generalization at test time. Importantly, in this regime standard methods to
improve object detection models like stronger backbones or longer training
schedules also benefit novel categories, which was not the case for smaller
datasets like COCO. Our results suggest that the key to strong few-shot
detection models may not lie in sophisticated metric learning approaches, but
instead in scaling the number of categories. Future data annotation efforts
should therefore focus on wider datasets and annotate a larger number of
categories rather than gathering more images or instances per category
Cos R-CNN for online few-shot object detection
We propose Cos R-CNN, a simple exemplar-based R-CNN formulation that is designed for online few-shot object detection. That is, it is able to localise and classify novel object categories in images with few examples without fine-tuning. Cos R-CNN frames detection as a learning-to-compare task: unseen classes are represented as exemplar images, and objects are detected based on their similarity to these exemplars. The cosine-based classification head allows for dynamic adaptation of classification parameters to the exemplar embedding, and encourages the clustering of similar classes in embedding space without the need for manual tuning of distance-metric hyperparameters. This simple formulation achieves best results on the recently proposed 5-way ImageNet few-shot detection benchmark, beating the online 1/5/10-shot scenarios by more than 8/3/1%, as well as performing up to 20% better in online 20-way few-shot VOC across all shots on novel classes
Few-Shot Object Detection in Unseen Domains
Few-shot object detection (FSOD) has thrived in recent years to learn novel
object classes with limited data by transferring knowledge gained on abundant
base classes. FSOD approaches commonly assume that both the scarcely provided
examples of novel classes and test-time data belong to the same domain.
However, this assumption does not hold in various industrial and robotics
applications, where a model can learn novel classes from a source domain while
inferring on classes from a target domain. In this work, we address the task of
zero-shot domain adaptation, also known as domain generalization, for FSOD.
Specifically, we assume that neither images nor labels of the novel classes in
the target domain are available during training. Our approach for solving the
domain gap is two-fold. First, we leverage a meta-training paradigm, where we
learn the domain shift on the base classes, then transfer the domain knowledge
to the novel classes. Second, we propose various data augmentations techniques
on the few shots of novel classes to account for all possible domain-specific
information. To constraint the network into encoding domain-agnostic
class-specific representations only, a contrastive loss is proposed to maximize
the mutual information between foreground proposals and class embeddings and
reduce the network's bias to the background information from target domain. Our
experiments on the T-LESS, PASCAL-VOC, and ExDark datasets show that the
proposed approach succeeds in alleviating the domain gap considerably without
utilizing labels or images of novel categories from the target domain
Multi-Faceted Distillation of Base-Novel Commonality for Few-shot Object Detection
Most of existing methods for few-shot object detection follow the fine-tuning
paradigm, which potentially assumes that the class-agnostic generalizable
knowledge can be learned and transferred implicitly from base classes with
abundant samples to novel classes with limited samples via such a two-stage
training strategy. However, it is not necessarily true since the object
detector can hardly distinguish between class-agnostic knowledge and
class-specific knowledge automatically without explicit modeling. In this work
we propose to learn three types of class-agnostic commonalities between base
and novel classes explicitly: recognition-related semantic commonalities,
localization-related semantic commonalities and distribution commonalities. We
design a unified distillation framework based on a memory bank, which is able
to perform distillation of all three types of commonalities jointly and
efficiently. Extensive experiments demonstrate that our method can be readily
integrated into most of existing fine-tuning based methods and consistently
improve the performance by a large margin