13,139 research outputs found
Consistent Weighted Sampling Made Fast, Small, and Easy
Document sketching using Jaccard similarity has been a workable effective
technique in reducing near-duplicates in Web page and image search results, and
has also proven useful in file system synchronization, compression and learning
applications.
Min-wise sampling can be used to derive an unbiased estimator for Jaccard
similarity and taking a few hundred independent consistent samples leads to
compact sketches which provide good estimates of pairwise-similarity.
Subsequent works extended this technique to weighted sets and show how to
produce samples with only a constant number of hash evaluations for any
element, independent of its weight. Another improvement by Li et al. shows how
to speedup sketch computations by computing many (near-)independent samples in
one shot. Unfortunately this latter improvement works only for the unweighted
case.
In this paper we give a simple, fast and accurate procedure which reduces
weighted sets to unweighted sets with small impact on the Jaccard similarity.
This leads to compact sketches consisting of many (near-)independent weighted
samples which can be computed with just a small constant number of hash
function evaluations per weighted element. The size of the produced unweighted
set is furthermore a tunable parameter which enables us to run the unweighted
scheme of Li et al. in the regime where it is most efficient. Even when the
sets involved are unweighted, our approach gives a simple solution to the
densification problem that other works attempted to address.
Unlike previously known schemes, ours does not result in an unbiased
estimator. However, we prove that the bias introduced by our reduction is
negligible and that the standard deviation is comparable to the unweighted
case. We also empirically evaluate our scheme and show that it gives
significant gains in computational efficiency, without any measurable loss in
accuracy
Max-sum diversity via convex programming
Diversity maximization is an important concept in information retrieval,
computational geometry and operations research. Usually, it is a variant of the
following problem: Given a ground set, constraints, and a function
that measures diversity of a subset, the task is to select a feasible subset
such that is maximized. The \emph{sum-dispersion} function , which is the sum of the pairwise distances in , is
in this context a prominent diversification measure. The corresponding
diversity maximization is the \emph{max-sum} or \emph{sum-sum diversification}.
Many recent results deal with the design of constant-factor approximation
algorithms of diversification problems involving sum-dispersion function under
a matroid constraint. In this paper, we present a PTAS for the max-sum
diversification problem under a matroid constraint for distances
of \emph{negative type}. Distances of negative type are, for
example, metric distances stemming from the and norm, as well
as the cosine or spherical, or Jaccard distance which are popular similarity
metrics in web and image search
Preference Networks: Probabilistic Models for Recommendation Systems
Recommender systems are important to help users select relevant and
personalised information over massive amounts of data available. We propose an
unified framework called Preference Network (PN) that jointly models various
types of domain knowledge for the task of recommendation. The PN is a
probabilistic model that systematically combines both content-based filtering
and collaborative filtering into a single conditional Markov random field. Once
estimated, it serves as a probabilistic database that supports various useful
queries such as rating prediction and top- recommendation. To handle the
challenging problem of learning large networks of users and items, we employ a
simple but effective pseudo-likelihood with regularisation. Experiments on the
movie rating data demonstrate the merits of the PN.Comment: In Proc. of 6th Australasian Data Mining Conference (AusDM), Gold
Coast, Australia, pages 195--202, 200
- β¦