290 research outputs found
Spreading vectors for similarity search
Discretizing multi-dimensional data distributions is a fundamental step of
modern indexing methods. State-of-the-art techniques learn parameters of
quantizers on training data for optimal performance, thus adapting quantizers
to the data. In this work, we propose to reverse this paradigm and adapt the
data to the quantizer: we train a neural net which last layer forms a fixed
parameter-free quantizer, such as pre-defined points of a hyper-sphere. As a
proxy objective, we design and train a neural network that favors uniformity in
the spherical latent space, while preserving the neighborhood structure after
the mapping. We propose a new regularizer derived from the Kozachenko--Leonenko
differential entropy estimator to enforce uniformity and combine it with a
locality-aware triplet loss. Experiments show that our end-to-end approach
outperforms most learned quantization methods, and is competitive with the
state of the art on widely adopted benchmarks. Furthermore, we show that
training without the quantization step results in almost no difference in
accuracy, but yields a generic catalyzer that can be applied with any
subsequent quantizer.Comment: Published at ICLR 201
Unsupervised Graph-based Rank Aggregation for Improved Retrieval
This paper presents a robust and comprehensive graph-based rank aggregation
approach, used to combine results of isolated ranker models in retrieval tasks.
The method follows an unsupervised scheme, which is independent of how the
isolated ranks are formulated. Our approach is able to combine arbitrary
models, defined in terms of different ranking criteria, such as those based on
textual, image or hybrid content representations.
We reformulate the ad-hoc retrieval problem as a document retrieval based on
fusion graphs, which we propose as a new unified representation model capable
of merging multiple ranks and expressing inter-relationships of retrieval
results automatically. By doing so, we claim that the retrieval system can
benefit from learning the manifold structure of datasets, thus leading to more
effective results. Another contribution is that our graph-based aggregation
formulation, unlike existing approaches, allows for encapsulating contextual
information encoded from multiple ranks, which can be directly used for
ranking, without further computations and post-processing steps over the
graphs. Based on the graphs, a novel similarity retrieval score is formulated
using an efficient computation of minimum common subgraphs. Finally, another
benefit over existing approaches is the absence of hyperparameters.
A comprehensive experimental evaluation was conducted considering diverse
well-known public datasets, composed of textual, image, and multimodal
documents. Performed experiments demonstrate that our method reaches top
performance, yielding better effectiveness scores than state-of-the-art
baseline methods and promoting large gains over the rankers being fused, thus
demonstrating the successful capability of the proposal in representing queries
based on a unified graph-based model of rank fusions
Search in WikiImages using mobile phone
Demonstration will focus on the content based retrieval of Wikipedia images (Hungarian version). A mobile application for iOS will be used to gather images and send directly to the crossmodal processing framework. Searching is implemented in a high performance hybrid index tree with total ~500k entries. The hit list is converted to wikipages and ordered by the content based score
Building K-Anonymous User Cohorts with\\ Consecutive Consistent Weighted Sampling (CCWS)
To retrieve personalized campaigns and creatives while protecting user
privacy, digital advertising is shifting from member-based identity to
cohort-based identity. Under such identity regime, an accurate and efficient
cohort building algorithm is desired to group users with similar
characteristics. In this paper, we propose a scalable -anonymous cohort
building algorithm called {\em consecutive consistent weighted sampling}
(CCWS). The proposed method combines the spirit of the (-powered) consistent
weighted sampling and hierarchical clustering, so that the -anonymity is
ensured by enforcing a lower bound on the size of cohorts. Evaluations on a
LinkedIn dataset consisting of M users and ads campaigns demonstrate that
CCWS achieves substantial improvements over several hashing-based methods
including sign random projections (SignRP), minwise hashing (MinHash), as well
as the vanilla CWS
Packing bag-of-features
One of the main limitations of image search based on bag-of-features is the memory usage per image. Only a few million images can be handled on a single machine in reasonable response time. In this paper, we first evaluate how the memory usage is reduced by using lossless index compression. We then propose an approximate representation of bag-of-features obtained by projecting the corresponding histogram onto a set of pre-defined sparse projection functions, producing several image descriptors. Coupled with a proper indexing structure, an image is represented by a few hundred bytes. A distance expectation criterion is then used to rank the images. Our method is at least one order of magnitude faster than standard bag-of-features while providing excellent search quality. 1
- …