6,084 research outputs found
What Can Human Sketches Do for Object Detection?
Sketches are highly expressive, inherently capturing subjective and
fine-grained visual cues. The exploration of such innate properties of human
sketches has, however, been limited to that of image retrieval. In this paper,
for the first time, we cultivate the expressiveness of sketches but for the
fundamental vision task of object detection. The end result is a sketch-enabled
object detection framework that detects based on what \textit{you} sketch --
\textit{that} ``zebra'' (e.g., one that is eating the grass) in a herd of
zebras (instance-aware detection), and only the \textit{part} (e.g., ``head" of
a ``zebra") that you desire (part-aware detection). We further dictate that our
model works without (i) knowing which category to expect at testing (zero-shot)
and (ii) not requiring additional bounding boxes (as per fully supervised) and
class labels (as per weakly supervised). Instead of devising a model from the
ground up, we show an intuitive synergy between foundation models (e.g., CLIP)
and existing sketch models build for sketch-based image retrieval (SBIR), which
can already elegantly solve the task -- CLIP to provide model generalisation,
and SBIR to bridge the (sketchphoto) gap. In particular, we first
perform independent prompting on both sketch and photo branches of an SBIR
model to build highly generalisable sketch and photo encoders on the back of
the generalisation ability of CLIP. We then devise a training paradigm to adapt
the learned encoders for object detection, such that the region embeddings of
detected boxes are aligned with the sketch and photo embeddings from SBIR.
Evaluating our framework on standard object detection datasets like PASCAL-VOC
and MS-COCO outperforms both supervised (SOD) and weakly-supervised object
detectors (WSOD) on zero-shot setups. Project Page:
\url{https://pinakinathc.github.io/sketch-detect}Comment: Accepted as Top 12 Best Papers. Will be presented in special
single-track plenary sessions to all attendees in Computer Vision and Pattern
Recognition (CVPR), 2023. Project Page: www.pinakinathc.me/sketch-detec
Doodle to Search: Practical Zero-Shot Sketch-based Image Retrieval
In this paper, we investigate the problem of zero-shot sketch-based image
retrieval (ZS-SBIR), where human sketches are used as queries to conduct
retrieval of photos from unseen categories. We importantly advance prior arts
by proposing a novel ZS-SBIR scenario that represents a firm step forward in
its practical application. The new setting uniquely recognizes two important
yet often neglected challenges of practical ZS-SBIR, (i) the large domain gap
between amateur sketch and photo, and (ii) the necessity for moving towards
large-scale retrieval. We first contribute to the community a novel ZS-SBIR
dataset, QuickDraw-Extended, that consists of 330,000 sketches and 204,000
photos spanning across 110 categories. Highly abstract amateur human sketches
are purposefully sourced to maximize the domain gap, instead of ones included
in existing datasets that can often be semi-photorealistic. We then formulate a
ZS-SBIR framework to jointly model sketches and photos into a common embedding
space. A novel strategy to mine the mutual information among domains is
specifically engineered to alleviate the domain gap. External semantic
knowledge is further embedded to aid semantic transfer. We show that, rather
surprisingly, retrieval performance significantly outperforms that of
state-of-the-art on existing datasets that can already be achieved using a
reduced version of our model. We further demonstrate the superior performance
of our full model by comparing with a number of alternatives on the newly
proposed dataset. The new dataset, plus all training and testing code of our
model, will be publicly released to facilitate future researchComment: Oral paper in CVPR 201
Learning Cross-Modal Deep Embeddings for Multi-Object Image Retrieval using Text and Sketch
In this work we introduce a cross modal image retrieval system that allows
both text and sketch as input modalities for the query. A cross-modal deep
network architecture is formulated to jointly model the sketch and text input
modalities as well as the the image output modality, learning a common
embedding between text and images and between sketches and images. In addition,
an attention model is used to selectively focus the attention on the different
objects of the image, allowing for retrieval with multiple objects in the
query. Experiments show that the proposed method performs the best in both
single and multiple object image retrieval in standard datasets.Comment: Accepted at ICPR 201
- …