Search CORE

243 research outputs found

Scaling-up Split-Merge MCMC with Locality Sensitive Sampling (LSS)

Author: Luo Chen
Shrivastava Anshumali
Publication venue
Publication date: 12/10/2018
Field of study

Split-Merge MCMC (Monte Carlo Markov Chain) is one of the essential and popular variants of MCMC for problems when an MCMC state consists of an unknown number of components. It is well known that state-of-the-art methods for split-merge MCMC do not scale well. Strategies for rapid mixing requires smart and informative proposals to reduce the rejection rate. However, all known smart proposals involve expensive operations to suggest informative transitions. As a result, the cost of each iteration is prohibitive for massive scale datasets. It is further known that uninformative but computationally efficient proposals, such as random split-merge, leads to extremely slow convergence. This tradeoff between mixing time and per update cost seems hard to get around. In this paper, we show a sweet spot. We leverage some unique properties of weighted MinHash, which is a popular LSH, to design a novel class of split-merge proposals which are significantly more informative than random sampling but at the same time efficient to compute. Overall, we obtain a superior tradeoff between convergence and per update cost. As a direct consequence, our proposals are around 6X faster than the state-of-the-art sampling methods on two large real datasets KDDCUP and PubMed with several millions of entities and thousands of clusters

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

SLIM : Scalable Linkage of Mobility Data

Author: Atluri Gowtham
Basik Fuat
Corless Robert
E.
Goga Oana
Kieu Tung
Reynolds Douglas A.
Sharma Vishal
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2020
Field of study

We present a scalable solution to link entities across mobility datasets using their spatio-temporal information. This is a fundamental problem in many applications such as linking user identities for security, understanding privacy limitations of location based services, or producing a unified dataset from multiple sources for urban planning. Such integrated datasets are also essential for service providers to optimise their services and improve business intelligence. In this paper, we first propose a mobility based representation and similarity computation for entities. An efficient matching process is then developed to identify the final linked pairs, with an automated mechanism to decide when to stop the linkage. We scale the process with a locality-sensitive hashing (LSH) based approach that significantly reduces candidate pairs for matching. To realize the effectiveness and efficiency of our techniques in practice, we introduce an algorithm called SLIM. In the experimental evaluation, SLIM outperforms the two existing state-of-the-art approaches in terms of precision and recall. Moreover, the LSH-based approach brings two to four orders of magnitude speedup

arXiv.org e-Print Archive

Crossref

Bilkent University Institutional Repository

Warwick Research Archives Portal Repository

Diverse near neighbor problem

Author: Abbar Sofiane
Amer-Yahia Sihem
Indyk Piotr
Mahabadi Sepideh
Varadarajan Kasturi R.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2013
Field of study

Motivated by the recent research on diversity-aware search, we investigate the k-diverse near neighbor reporting problem. The problem is defined as follows: given a query point q, report the maximum diversity set S of k points in the ball of radius r around q. The diversity of a set S is measured by the minimum distance between any pair of points in

S

(the higher, the better). We present two approximation algorithms for the case where the points live in a d-dimensional Hamming space. Our algorithms guarantee query times that are sub-linear in n and only polynomial in the diversity parameter k, as well as the dimension d. For low values of k, our algorithms achieve sub-linear query times even if the number of points within distance r from a query

q

is linear in

n

. To the best of our knowledge, these are the first known algorithms of this type that offer provable guarantees.Charles Stark Draper LaboratoryNational Science Foundation (U.S.) (Award NSF CCF-1012042)David & Lucile Packard Foundatio

DSpace@MIT

Crossref

Hal - Université Grenoble Alpes