Search CORE

13,139 research outputs found

Consistent Weighted Sampling Made Fast, Small, and Easy

Author: Haeupler Bernhard
Manasse Mark
Talwar Kunal
Publication venue
Publication date: 15/10/2014
Field of study

Document sketching using Jaccard similarity has been a workable effective technique in reducing near-duplicates in Web page and image search results, and has also proven useful in file system synchronization, compression and learning applications. Min-wise sampling can be used to derive an unbiased estimator for Jaccard similarity and taking a few hundred independent consistent samples leads to compact sketches which provide good estimates of pairwise-similarity. Subsequent works extended this technique to weighted sets and show how to produce samples with only a constant number of hash evaluations for any element, independent of its weight. Another improvement by Li et al. shows how to speedup sketch computations by computing many (near-)independent samples in one shot. Unfortunately this latter improvement works only for the unweighted case. In this paper we give a simple, fast and accurate procedure which reduces weighted sets to unweighted sets with small impact on the Jaccard similarity. This leads to compact sketches consisting of many (near-)independent weighted samples which can be computed with just a small constant number of hash function evaluations per weighted element. The size of the produced unweighted set is furthermore a tunable parameter which enables us to run the unweighted scheme of Li et al. in the regime where it is most efficient. Even when the sets involved are unweighted, our approach gives a simple solution to the densification problem that other works attempted to address. Unlike previously known schemes, ours does not result in an unbiased estimator. However, we prove that the bias introduced by our reduction is negligible and that the standard deviation is comparable to the unweighted case. We also empirically evaluate our scheme and show that it gives significant gains in computational efficiency, without any measurable loss in accuracy

arXiv.org e-Print Archive

CiteSeerX

Max-sum diversity via convex programming

Author: Cevallos Alfonso
Eisenbrand Friedrich
Zenklusen Rico
Publication venue
Publication date: 22/11/2015
Field of study

Diversity maximization is an important concept in information retrieval, computational geometry and operations research. Usually, it is a variant of the following problem: Given a ground set, constraints, and a function

f(\cdot)

that measures diversity of a subset, the task is to select a feasible subset

S

such that

f(S)

is maximized. The \emph{sum-dispersion} function

f(S) = \sum_{x,y \in S} d(x,y)

, which is the sum of the pairwise distances in

S

, is in this context a prominent diversification measure. The corresponding diversity maximization is the \emph{max-sum} or \emph{sum-sum diversification}. Many recent results deal with the design of constant-factor approximation algorithms of diversification problems involving sum-dispersion function under a matroid constraint. In this paper, we present a PTAS for the max-sum diversification problem under a matroid constraint for distances

d(\cdot,\cdot)

of \emph{negative type}. Distances of negative type are, for example, metric distances stemming from the

\ell_2

and

\ell_1

norm, as well as the cosine or spherical, or Jaccard distance which are popular similarity metrics in web and image search

arXiv.org e-Print Archive

Repository for Publications and Research Data

Dagstuhl Research Online Publication Server

Preference Networks: Probabilistic Models for Recommendation Systems

Author: Phung Dinh Q.
Truyen Tran The
Venkatesh Svetha
Publication venue
Publication date: 22/07/2014
Field of study

Recommender systems are important to help users select relevant and personalised information over massive amounts of data available. We propose an unified framework called Preference Network (PN) that jointly models various types of domain knowledge for the task of recommendation. The PN is a probabilistic model that systematically combines both content-based filtering and collaborative filtering into a single conditional Markov random field. Once estimated, it serves as a probabilistic database that supports various useful queries such as rating prediction and top-

N

recommendation. To handle the challenging problem of learning large networks of users and items, we employ a simple but effective pseudo-likelihood with regularisation. Experiments on the movie rating data demonstrate the merits of the PN.Comment: In Proc. of 6th Australasian Data Mining Conference (AusDM), Gold Coast, Australia, pages 195--202, 200

arXiv.org e-Print Archive

CiteSeerX