58,271 research outputs found
Context-Aware Zero-Shot Recognition
We present a novel problem setting in zero-shot learning, zero-shot object
recognition and detection in the context. Contrary to the traditional zero-shot
learning methods, which simply infers unseen categories by transferring
knowledge from the objects belonging to semantically similar seen categories,
we aim to understand the identity of the novel objects in an image surrounded
by the known objects using the inter-object relation prior. Specifically, we
leverage the visual context and the geometric relationships between all pairs
of objects in a single image, and capture the information useful to infer
unseen categories. We integrate our context-aware zero-shot learning framework
into the traditional zero-shot learning techniques seamlessly using a
Conditional Random Field (CRF). The proposed algorithm is evaluated on both
zero-shot region classification and zero-shot detection tasks. The results on
Visual Genome (VG) dataset show that our model significantly boosts performance
with the additional visual context compared to traditional methods
Zero-Shot Object Detection by Hybrid Region Embedding
Object detection is considered as one of the most challenging problems in
computer vision, since it requires correct prediction of both classes and
locations of objects in images. In this study, we define a more difficult
scenario, namely zero-shot object detection (ZSD) where no visual training data
is available for some of the target object classes. We present a novel approach
to tackle this ZSD problem, where a convex combination of embeddings are used
in conjunction with a detection framework. For evaluation of ZSD methods, we
propose a simple dataset constructed from Fashion-MNIST images and also a
custom zero-shot split for the Pascal VOC detection challenge. The experimental
results suggest that our method yields promising results for ZSD
Zero-Shot Object Detection with Textual Descriptions
Object detection is important in real-world applications. Existing methods mainly focus on object detection with sufficient labelled training data or zero-shot object detection with only concept names. In this paper, we address the challenging problem of zero-shot object detection with natural language description, which aims to simultaneously detect and recognize novel concept instances with textual descriptions. We propose a novel deep learning framework to jointly learn visual units, visual-unit attention and word-level attention, which are combined to achieve word-proposal affinity by an element-wise multiplication. To the best of our knowledge, this is the first work on zero-shot object detection with textual descriptions. Since there is no directly related work in the literature, we investigate plausible solutions based on existing zero-shot object detection for a fair comparison. We conduct extensive experiments on three challenging benchmark datasets. The extensive experimental results confirm the superiority of the proposed model
Segment Any Change
Visual foundation models have achieved remarkable results in zero-shot image
classification and segmentation, but zero-shot change detection remains an open
problem. In this paper, we propose the segment any change models (AnyChange), a
new type of change detection model that supports zero-shot prediction and
generalization on unseen change types and data distributions. AnyChange is
built on the segment anything model (SAM) via our training-free adaptation
method, bitemporal latent matching. By revealing and exploiting intra-image and
inter-image semantic similarities in SAM's latent space, bitemporal latent
matching endows SAM with zero-shot change detection capabilities in a
training-free way. We also propose a point query mechanism to enable
AnyChange's zero-shot object-centric change detection capability. We perform
extensive experiments to confirm the effectiveness of AnyChange for zero-shot
change detection. AnyChange sets a new record on the SECOND benchmark for
unsupervised change detection, exceeding the previous SOTA by up to 4.4% F
score, and achieving comparable accuracy with negligible manual annotations (1
pixel per image) for supervised change detection.Comment: technical report, 12 page
Frustratingly Simple but Effective Zero-shot Detection and Segmentation: Analysis and a Strong Baseline
Methods for object detection and segmentation often require abundant
instance-level annotations for training, which are time-consuming and expensive
to collect. To address this, the task of zero-shot object detection (or
segmentation) aims at learning effective methods for identifying and localizing
object instances for the categories that have no supervision available.
Constructing architectures for these tasks requires choosing from a myriad of
design options, ranging from the form of the class encoding used to transfer
information from seen to unseen categories, to the nature of the function being
optimized for learning. In this work, we extensively study these design
choices, and carefully construct a simple yet extremely effective zero-shot
recognition method. Through extensive experiments on the MSCOCO dataset on
object detection and segmentation, we highlight that our proposed method
outperforms existing, considerably more complex, architectures. Our findings
and method, which we propose as a competitive future baseline, point towards
the need to revisit some of the recent design trends in zero-shot detection /
segmentation.Comment: 17 Pages, 7 Figure
ZiCo-BC: A Bias Corrected Zero-Shot NAS for Vision Tasks
Zero-Shot Neural Architecture Search (NAS) approaches propose novel
training-free metrics called zero-shot proxies to substantially reduce the
search time compared to the traditional training-based NAS. Despite the success
on image classification, the effectiveness of zero-shot proxies is rarely
evaluated on complex vision tasks such as semantic segmentation and object
detection. Moreover, existing zero-shot proxies are shown to be biased towards
certain model characteristics which restricts their broad applicability. In
this paper, we empirically study the bias of state-of-the-art (SOTA) zero-shot
proxy ZiCo across multiple vision tasks and observe that ZiCo is biased towards
thinner and deeper networks, leading to sub-optimal architectures. To solve the
problem, we propose a novel bias correction on ZiCo, called ZiCo-BC. Our
extensive experiments across various vision tasks (image classification, object
detection and semantic segmentation) show that our approach can successfully
search for architectures with higher accuracy and significantly lower latency
on Samsung Galaxy S10 devices.Comment: Accepted at ICCV-Workshop on Resource-Efficient Deep Learning, 202
Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting
We study multimodal few-shot object detection (FSOD) in this paper, using
both few-shot visual examples and class semantic information for detection.
Most of previous works focus on either few-shot or zero-shot object detection,
ignoring the complementarity of visual and semantic information. We first show
that meta-learning and prompt-based learning, the most commonly-used methods
for few-shot learning and zero-shot transferring from pre-trained
vision-language models to downstream tasks, are conceptually similar. They both
reformulate the objective of downstream tasks the same as the pre-training
tasks, and mostly without tuning the parameters of pre-trained models. Based on
this observation, we propose to combine meta-learning with prompt-based
learning for multimodal FSOD without fine-tuning, by learning transferable
class-agnostic multimodal FSOD models over many-shot base classes.
Specifically, to better exploit the pre-trained vision-language models, the
meta-learning based cross-modal prompting is proposed to generate soft prompts
and further used to extract the semantic prototype, conditioned on the few-shot
visual examples. Then, the extracted semantic prototype and few-shot visual
prototype are fused to generate the multimodal prototype for detection. Our
models can efficiently fuse the visual and semantic information at both
token-level and feature-level. We comprehensively evaluate the proposed
multimodal FSOD models on multiple few-shot object detection benchmarks,
achieving promising results.Comment: 22 page
MAEDAY: MAE for few and zero shot AnomalY-Detection
We propose using Masked Auto-Encoder (MAE), a transformer model
self-supervisedly trained on image inpainting, for anomaly detection (AD).
Assuming anomalous regions are harder to reconstruct compared with normal
regions. MAEDAY is the first image-reconstruction-based anomaly detection
method that utilizes a pre-trained model, enabling its use for Few-Shot Anomaly
Detection (FSAD). We also show the same method works surprisingly well for the
novel tasks of Zero-Shot AD (ZSAD) and Zero-Shot Foreign Object Detection
(ZSFOD), where no normal samples are available. Code is available at
https://github.com/EliSchwartz/MAEDAY .Comment: Computer Vision and Image Understanding, 202
- …