Search CORE

1,244 research outputs found

Sampled Weighted Min-Hashing for Large-Scale Topic Mining

Author: AZ Broder
DM Blei
G Fuentes Pineda
G Salton
GE Hinton
O Chum
YW Teh
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 07/09/2015
Field of study

We present Sampled Weighted Min-Hashing (SWMH), a randomized approach to automatically mine topics from large-scale corpora. SWMH generates multiple random partitions of the corpus vocabulary based on term co-occurrence and agglomerates highly overlapping inter-partition cells to produce the mined topics. While other approaches define a topic as a probabilistic distribution over a vocabulary, SWMH topics are ordered subsets of such vocabulary. Interestingly, the topics mined by SWMH underlie themes from the corpus at different levels of granularity. We extensively evaluate the meaningfulness of the mined topics both qualitatively and quantitatively on the NIPS (1.7 K documents), 20 Newsgroups (20 K), Reuters (800 K) and Wikipedia (4 M) corpora. Additionally, we compare the quality of SWMH with Online LDA topics for document representation in classification.Comment: 10 pages, Proceedings of the Mexican Conference on Pattern Recognition 201

arXiv.org e-Print Archive

Crossref

Hashing-Based-Estimators for Kernel Density in High Dimensions

Author: Charikar Moses
Siminelakis Paris
Publication venue
Publication date: 30/08/2018
Field of study

Given a set of points

P\subset \mathbb{R}^{d}

and a kernel

k

, the Kernel Density Estimate at a point

x\in\mathbb{R}^{d}

is defined as

\mathrm{KDE}_{P}(x)=\frac{1}{|P|}\sum_{y\in P} k(x,y)

. We study the problem of designing a data structure that given a data set

P

and a kernel function, returns *approximations to the kernel density* of a query point in *sublinear time*. We introduce a class of unbiased estimators for kernel density implemented through locality-sensitive hashing, and give general theorems bounding the variance of such estimators. These estimators give rise to efficient data structures for estimating the kernel density in high dimensions for a variety of commonly used kernels. Our work is the first to provide data-structures with theoretical guarantees that improve upon simple random sampling in high dimensions.Comment: A preliminary version of this paper appeared in FOCS 201

arXiv.org e-Print Archive

Crossref

Scaling-up Split-Merge MCMC with Locality Sensitive Sampling (LSS)

Author: Luo Chen
Shrivastava Anshumali
Publication venue
Publication date: 12/10/2018
Field of study

Split-Merge MCMC (Monte Carlo Markov Chain) is one of the essential and popular variants of MCMC for problems when an MCMC state consists of an unknown number of components. It is well known that state-of-the-art methods for split-merge MCMC do not scale well. Strategies for rapid mixing requires smart and informative proposals to reduce the rejection rate. However, all known smart proposals involve expensive operations to suggest informative transitions. As a result, the cost of each iteration is prohibitive for massive scale datasets. It is further known that uninformative but computationally efficient proposals, such as random split-merge, leads to extremely slow convergence. This tradeoff between mixing time and per update cost seems hard to get around. In this paper, we show a sweet spot. We leverage some unique properties of weighted MinHash, which is a popular LSH, to design a novel class of split-merge proposals which are significantly more informative than random sampling but at the same time efficient to compute. Overall, we obtain a superior tradeoff between convergence and per update cost. As a direct consequence, our proposals are around 6X faster than the state-of-the-art sampling methods on two large real datasets KDDCUP and PubMed with several millions of entities and thousands of clusters

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Hashing for Similarity Search: A Survey

Author: Ji Jianqiu
Shen Heng Tao
Song Jingkuan
Wang Jingdong
Publication venue
Publication date: 13/08/2014
Field of study

Similarity search (nearest neighbor search) is a problem of pursuing the data items whose distances to a query item are the smallest from a large database. Various methods have been developed to address this problem, and recently a lot of efforts have been devoted to approximate search. In this paper, we present a survey on one of the main solutions, hashing, which has been widely studied since the pioneering work locality sensitive hashing. We divide the hashing algorithms two main categories: locality sensitive hashing, which designs hash functions without exploring the data distribution and learning to hash, which learns hash functions according the data distribution, and review them from various aspects, including hash function design and distance measure and search scheme in the hash coding space

arXiv.org e-Print Archive

CiteSeerX