3,613 research outputs found
Open Cross-Domain Visual Search
This paper addresses cross-domain visual search, where visual queries
retrieve category samples from a different domain. For example, we may want to
sketch an airplane and retrieve photographs of airplanes. Despite considerable
progress, the search occurs in a closed setting between two pre-defined
domains. In this paper, we make the step towards an open setting where multiple
visual domains are available. This notably translates into a search between any
pair of domains, from a combination of domains or within multiple domains. We
introduce a simple -- yet effective -- approach. We formulate the search as a
mapping from every visual domain to a common semantic space, where categories
are represented by hyperspherical prototypes. Open cross-domain visual search
is then performed by searching in the common semantic space, regardless of
which domains are used as source or target. Domains are combined in the common
space to search from or within multiple domains simultaneously. A separate
training of every domain-specific mapping function enables an efficient scaling
to any number of domains without affecting the search performance. We
empirically illustrate our capability to perform open cross-domain visual
search in three different scenarios. Our approach is competitive with respect
to existing closed settings, where we obtain state-of-the-art results on
several benchmarks for three sketch-based search tasks.Comment: Accepted at Computer Vision and Image Understanding (CVIU
What's in a Name? Beyond Class Indices for Image Recognition
Existing machine learning models demonstrate excellent performance in image
object recognition after training on a large-scale dataset under full
supervision. However, these models only learn to map an image to a predefined
class index, without revealing the actual semantic meaning of the object in the
image. In contrast, vision-language models like CLIP are able to assign
semantic class names to unseen objects in a `zero-shot' manner, although they
still rely on a predefined set of candidate names at test time. In this paper,
we reconsider the recognition problem and task a vision-language model to
assign class names to images given only a large and essentially unconstrained
vocabulary of categories as prior information. We use non-parametric methods to
establish relationships between images which allow the model to automatically
narrow down the set of possible candidate names. Specifically, we propose
iteratively clustering the data and voting on class names within them, showing
that this enables a roughly 50\% improvement over the baseline on ImageNet.
Furthermore, we tackle this problem both in unsupervised and partially
supervised settings, as well as with a coarse-grained and fine-grained search
space as the unconstrained dictionary
- …