10 research outputs found
Generalising Fine-Grained Sketch-Based Image Retrieval
Fine-grained sketch-based image retrieval (FG-SBIR) addresses matching specific photo instance using free-hand sketch as a query modality. Existing models aim to learn an embedding space in which sketch and photo can be directly compared. While successful, they require instance-level pairing within each coarse-grained category as annotated training data. Since the learned embedding space is
domain-specific, these models do not generalise well across categories. This limits the practical applicability of FGSBIR. In this paper, we identify cross-category generalisation for FG-SBIR as a domain generalisation problem, and propose the first solution. Our key contribution is a novel unsupervised learning approach to model a universal manifold of prototypical visual sketch traits. This manifold can then be used to paramaterise the learning of a sketch/photo representation. Model adaptation to novel categories then becomes automatic via embedding the novel sketch in the manifold and updating the representation and retrieval function accordingly. Experiments on the two largest FG-SBIR datasets, Sketchy and QMUL-Shoe-V2, demonstrate the efficacy of our approach in enabling crosscategory generalisation of FG-SBIR
Cross-Modal Hierarchical Modelling for Fine-Grained Sketch Based Image Retrieval
Sketch as an image search query is an ideal alternative to text in capturing
the fine-grained visual details. Prior successes on fine-grained sketch-based
image retrieval (FG-SBIR) have demonstrated the importance of tackling the
unique traits of sketches as opposed to photos, e.g., temporal vs. static,
strokes vs. pixels, and abstract vs. pixel-perfect. In this paper, we study a
further trait of sketches that has been overlooked to date, that is, they are
hierarchical in terms of the levels of detail -- a person typically sketches up
to various extents of detail to depict an object. This hierarchical structure
is often visually distinct. In this paper, we design a novel network that is
capable of cultivating sketch-specific hierarchies and exploiting them to match
sketch with photo at corresponding hierarchical levels. In particular, features
from a sketch and a photo are enriched using cross-modal co-attention, coupled
with hierarchical node fusion at every level to form a better embedding space
to conduct retrieval. Experiments on common benchmarks show our method to
outperform state-of-the-arts by a significant margin.Comment: Accepted for ORAL presentation in BMVC 202
Sketch Less for More: On-the-Fly Fine-Grained Sketch Based Image Retrieval
Fine-grained sketch-based image retrieval (FG-SBIR) addresses the problem of
retrieving a particular photo instance given a user's query sketch. Its
widespread applicability is however hindered by the fact that drawing a sketch
takes time, and most people struggle to draw a complete and faithful sketch. In
this paper, we reformulate the conventional FG-SBIR framework to tackle these
challenges, with the ultimate goal of retrieving the target photo with the
least number of strokes possible. We further propose an on-the-fly design that
starts retrieving as soon as the user starts drawing. To accomplish this, we
devise a reinforcement learning-based cross-modal retrieval framework that
directly optimizes rank of the ground-truth photo over a complete sketch
drawing episode. Additionally, we introduce a novel reward scheme that
circumvents the problems related to irrelevant sketch strokes, and thus
provides us with a more consistent rank list during the retrieval. We achieve
superior early-retrieval efficiency over state-of-the-art methods and
alternative baselines on two publicly available fine-grained sketch retrieval
datasets.Comment: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2020
[Oral Presentation] Code:
https://github.com/AyanKumarBhunia/on-the-fly-FGSBI
Data-Free Sketch-Based Image Retrieval
Rising concerns about privacy and anonymity preservation of deep learning
models have facilitated research in data-free learning (DFL). For the first
time, we identify that for data-scarce tasks like Sketch-Based Image Retrieval
(SBIR), where the difficulty in acquiring paired photos and hand-drawn sketches
limits data-dependent cross-modal learning algorithms, DFL can prove to be a
much more practical paradigm. We thus propose Data-Free (DF)-SBIR, where,
unlike existing DFL problems, pre-trained, single-modality classification
models have to be leveraged to learn a cross-modal metric-space for retrieval
without access to any training data. The widespread availability of pre-trained
classification models, along with the difficulty in acquiring paired
photo-sketch datasets for SBIR justify the practicality of this setting. We
present a methodology for DF-SBIR, which can leverage knowledge from models
independently trained to perform classification on photos and sketches. We
evaluate our model on the Sketchy, TU-Berlin, and QuickDraw benchmarks,
designing a variety of baselines based on state-of-the-art DFL literature, and
observe that our method surpasses all of them by significant margins. Our
method also achieves mAPs competitive with data-dependent approaches, all the
while requiring no training data. Implementation is available at
\url{https://github.com/abhrac/data-free-sbir}.Comment: Computer Vision and Pattern Recognition (CVPR) 202
What Can Human Sketches Do for Object Detection?
Sketches are highly expressive, inherently capturing subjective and
fine-grained visual cues. The exploration of such innate properties of human
sketches has, however, been limited to that of image retrieval. In this paper,
for the first time, we cultivate the expressiveness of sketches but for the
fundamental vision task of object detection. The end result is a sketch-enabled
object detection framework that detects based on what \textit{you} sketch --
\textit{that} ``zebra'' (e.g., one that is eating the grass) in a herd of
zebras (instance-aware detection), and only the \textit{part} (e.g., ``head" of
a ``zebra") that you desire (part-aware detection). We further dictate that our
model works without (i) knowing which category to expect at testing (zero-shot)
and (ii) not requiring additional bounding boxes (as per fully supervised) and
class labels (as per weakly supervised). Instead of devising a model from the
ground up, we show an intuitive synergy between foundation models (e.g., CLIP)
and existing sketch models build for sketch-based image retrieval (SBIR), which
can already elegantly solve the task -- CLIP to provide model generalisation,
and SBIR to bridge the (sketchphoto) gap. In particular, we first
perform independent prompting on both sketch and photo branches of an SBIR
model to build highly generalisable sketch and photo encoders on the back of
the generalisation ability of CLIP. We then devise a training paradigm to adapt
the learned encoders for object detection, such that the region embeddings of
detected boxes are aligned with the sketch and photo embeddings from SBIR.
Evaluating our framework on standard object detection datasets like PASCAL-VOC
and MS-COCO outperforms both supervised (SOD) and weakly-supervised object
detectors (WSOD) on zero-shot setups. Project Page:
\url{https://pinakinathc.github.io/sketch-detect}Comment: Accepted as Top 12 Best Papers. Will be presented in special
single-track plenary sessions to all attendees in Computer Vision and Pattern
Recognition (CVPR), 2023. Project Page: www.pinakinathc.me/sketch-detec
CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not
In this paper, we leverage CLIP for zero-shot sketch based image retrieval
(ZS-SBIR). We are largely inspired by recent advances on foundation models and
the unparalleled generalisation ability they seem to offer, but for the first
time tailor it to benefit the sketch community. We put forward novel designs on
how best to achieve this synergy, for both the category setting and the
fine-grained setting ("all"). At the very core of our solution is a prompt
learning setup. First we show just via factoring in sketch-specific prompts, we
already have a category-level ZS-SBIR system that overshoots all prior arts, by
a large margin (24.8%) - a great testimony on studying the CLIP and ZS-SBIR
synergy. Moving onto the fine-grained setup is however trickier, and requires a
deeper dive into this synergy. For that, we come up with two specific designs
to tackle the fine-grained matching nature of the problem: (i) an additional
regularisation loss to ensure the relative separation between sketches and
photos is uniform across categories, which is not the case for the gold
standard standalone triplet loss, and (ii) a clever patch shuffling technique
to help establishing instance-level structural correspondences between
sketch-photo pairs. With these designs, we again observe significant
performance gains in the region of 26.9% over previous state-of-the-art. The
take-home message, if any, is the proposed CLIP and prompt learning paradigm
carries great promise in tackling other sketch-related tasks (not limited to
ZS-SBIR) where data scarcity remains a great challenge. Project page:
https://aneeshan95.github.io/Sketch_LVM/Comment: Accepted in CVPR 2023. Project page available at
https://aneeshan95.github.io/Sketch_LVM