243 research outputs found
Scaling-up Split-Merge MCMC with Locality Sensitive Sampling (LSS)
Split-Merge MCMC (Monte Carlo Markov Chain) is one of the essential and
popular variants of MCMC for problems when an MCMC state consists of an unknown
number of components. It is well known that state-of-the-art methods for
split-merge MCMC do not scale well. Strategies for rapid mixing requires smart
and informative proposals to reduce the rejection rate. However, all known
smart proposals involve expensive operations to suggest informative
transitions. As a result, the cost of each iteration is prohibitive for massive
scale datasets. It is further known that uninformative but computationally
efficient proposals, such as random split-merge, leads to extremely slow
convergence. This tradeoff between mixing time and per update cost seems hard
to get around.
In this paper, we show a sweet spot. We leverage some unique properties of
weighted MinHash, which is a popular LSH, to design a novel class of
split-merge proposals which are significantly more informative than random
sampling but at the same time efficient to compute. Overall, we obtain a
superior tradeoff between convergence and per update cost. As a direct
consequence, our proposals are around 6X faster than the state-of-the-art
sampling methods on two large real datasets KDDCUP and PubMed with several
millions of entities and thousands of clusters
SLIM : Scalable Linkage of Mobility Data
We present a scalable solution to link entities across mobility datasets using their spatio-temporal information. This is a fundamental problem in many applications such as linking user identities for security, understanding privacy limitations of location based services, or producing a unified dataset from multiple sources for urban planning. Such integrated datasets are also essential for service providers to optimise their services and improve business intelligence. In this paper, we first propose a mobility based representation and similarity computation for entities. An efficient matching process is then developed to identify the final linked pairs, with an automated mechanism to decide when to stop the linkage. We scale the process with a locality-sensitive hashing (LSH) based approach that significantly reduces candidate pairs for matching. To realize the effectiveness and efficiency of our techniques in practice, we introduce an algorithm called SLIM. In the experimental evaluation, SLIM outperforms the two existing state-of-the-art approaches in terms of precision and recall. Moreover, the LSH-based approach brings two to four orders of magnitude speedup
Diverse near neighbor problem
Motivated by the recent research on diversity-aware search, we investigate the k-diverse near neighbor reporting problem. The problem is defined as follows: given a query point q, report the maximum diversity set S of k points in the ball of radius r around q. The diversity of a set S is measured by the minimum distance between any pair of points in (the higher, the better). We present two approximation algorithms for the case where the points live in a d-dimensional Hamming space. Our algorithms guarantee query times that are sub-linear in n and only polynomial in the diversity parameter k, as well as the dimension d. For low values of k, our algorithms achieve sub-linear query times even if the number of points within distance r from a query is linear in . To the best of our knowledge, these are the first known algorithms of this type that offer provable guarantees.Charles Stark Draper LaboratoryNational Science Foundation (U.S.) (Award NSF CCF-1012042)David & Lucile Packard Foundatio
- …