3,950 research outputs found
Deep Sketch Hashing: Fast Free-hand Sketch-Based Image Retrieval
Free-hand sketch-based image retrieval (SBIR) is a specific cross-view
retrieval task, in which queries are abstract and ambiguous sketches while the
retrieval database is formed with natural images. Work in this area mainly
focuses on extracting representative and shared features for sketches and
natural images. However, these can neither cope well with the geometric
distortion between sketches and images nor be feasible for large-scale SBIR due
to the heavy continuous-valued distance computation. In this paper, we speed up
SBIR by introducing a novel binary coding method, named \textbf{Deep Sketch
Hashing} (DSH), where a semi-heterogeneous deep architecture is proposed and
incorporated into an end-to-end binary coding framework. Specifically, three
convolutional neural networks are utilized to encode free-hand sketches,
natural images and, especially, the auxiliary sketch-tokens which are adopted
as bridges to mitigate the sketch-image geometric distortion. The learned DSH
codes can effectively capture the cross-view similarities as well as the
intrinsic semantic correlations between different categories. To the best of
our knowledge, DSH is the first hashing work specifically designed for
category-level SBIR with an end-to-end deep architecture. The proposed DSH is
comprehensively evaluated on two large-scale datasets of TU-Berlin Extension
and Sketchy, and the experiments consistently show DSH's superior SBIR
accuracies over several state-of-the-art methods, while achieving significantly
reduced retrieval time and memory footprint.Comment: This paper will appear as a spotlight paper in CVPR201
Sketch-based Video Object Localization
We introduce Sketch-based Video Object Localization (SVOL), a new task aimed
at localizing spatio-temporal object boxes in video queried by the input
sketch. We first outline the challenges in the SVOL task and build the
Sketch-Video Attention Network (SVANet) with the following design principles:
(i) to consider temporal information of video and bridge the domain gap between
sketch and video; (ii) to accurately identify and localize multiple objects
simultaneously; (iii) to handle various styles of sketches; (iv) to be
classification-free. In particular, SVANet is equipped with a Cross-modal
Transformer that models the interaction between learnable object tokens, query
sketch, and video through attention operations, and learns upon a per-frame set
matching strategy that enables frame-wise prediction while utilizing global
video context. We evaluate SVANet on a newly curated SVOL dataset. By design,
SVANet successfully learns the mapping between the query sketches and video
objects, achieving state-of-the-art results on the SVOL benchmark. We further
confirm the effectiveness of SVANet via extensive ablation studies and
visualizations. Lastly, we demonstrate its transfer capability on unseen
datasets and novel categories, suggesting its high scalability in real-world
application
Sketch-an-Anchor: Sub-epoch Fast Model Adaptation for Zero-shot Sketch-based Image Retrieval
Sketch-an-Anchor is a novel method to train state-of-the-art Zero-shot
Sketch-based Image Retrieval (ZSSBIR) models in under an epoch. Most studies
break down the problem of ZSSBIR into two parts: domain alignment between
images and sketches, inherited from SBIR, and generalization to unseen data,
inherent to the zero-shot protocol. We argue one of these problems can be
considerably simplified and re-frame the ZSSBIR problem around the
already-stellar yet underexplored Zero-shot Image-based Retrieval performance
of off-the-shelf models. Our fast-converging model keeps the single-domain
performance while learning to extract similar representations from sketches. To
this end we introduce our Semantic Anchors -- guiding embeddings learned from
word-based semantic spaces and features from off-the-shelf models -- and
combine them with our novel Anchored Contrastive Loss. Empirical evidence shows
we can achieve state-of-the-art performance on all benchmark datasets while
training for 100x less iterations than other methods
Sketchformer: Transformer-based Representation for Sketched Structure
Sketchformer is a novel transformer-based representation for encoding
free-hand sketches input in a vector form, i.e. as a sequence of strokes.
Sketchformer effectively addresses multiple tasks: sketch classification,
sketch based image retrieval (SBIR), and the reconstruction and interpolation
of sketches. We report several variants exploring continuous and tokenized
input representations, and contrast their performance. Our learned embedding,
driven by a dictionary learning tokenization scheme, yields state of the art
performance in classification and image retrieval tasks, when compared against
baseline representations driven by LSTM sequence to sequence architectures:
SketchRNN and derivatives. We show that sketch reconstruction and interpolation
are improved significantly by the Sketchformer embedding for complex sketches
with longer stroke sequences.Comment: Accepted for publication at CVPR 202
Exploiting Unlabelled Photos for Stronger Fine-Grained SBIR
This paper advances the fine-grained sketch-based image retrieval (FG-SBIR)
literature by putting forward a strong baseline that overshoots prior
state-of-the-arts by ~11%. This is not via complicated design though, but by
addressing two critical issues facing the community (i) the gold standard
triplet loss does not enforce holistic latent space geometry, and (ii) there
are never enough sketches to train a high accuracy model. For the former, we
propose a simple modification to the standard triplet loss, that explicitly
enforces separation amongst photos/sketch instances. For the latter, we put
forward a novel knowledge distillation module can leverage photo data for model
training. Both modules are then plugged into a novel plug-n-playable training
paradigm that allows for more stable training. More specifically, for (i) we
employ an intra-modal triplet loss amongst sketches to bring sketches of the
same instance closer from others, and one more amongst photos to push away
different photo instances while bringing closer a structurally augmented
version of the same photo (offering a gain of ~4-6%). To tackle (ii), we first
pre-train a teacher on the large set of unlabelled photos over the
aforementioned intra-modal photo triplet loss. Then we distill the contextual
similarity present amongst the instances in the teacher's embedding space to
that in the student's embedding space, by matching the distribution over
inter-feature distances of respective samples in both embedding spaces
(delivering a further gain of ~4-5%). Apart from outperforming prior arts
significantly, our model also yields satisfactory results on generalising to
new classes. Project page: https://aneeshan95.github.io/Sketch_PVT/Comment: Accepted in CVPR 2023. Project page available at
https://aneeshan95.github.io/Sketch_PVT
- …