1,244 research outputs found
Sampled Weighted Min-Hashing for Large-Scale Topic Mining
We present Sampled Weighted Min-Hashing (SWMH), a randomized approach to
automatically mine topics from large-scale corpora. SWMH generates multiple
random partitions of the corpus vocabulary based on term co-occurrence and
agglomerates highly overlapping inter-partition cells to produce the mined
topics. While other approaches define a topic as a probabilistic distribution
over a vocabulary, SWMH topics are ordered subsets of such vocabulary.
Interestingly, the topics mined by SWMH underlie themes from the corpus at
different levels of granularity. We extensively evaluate the meaningfulness of
the mined topics both qualitatively and quantitatively on the NIPS (1.7 K
documents), 20 Newsgroups (20 K), Reuters (800 K) and Wikipedia (4 M) corpora.
Additionally, we compare the quality of SWMH with Online LDA topics for
document representation in classification.Comment: 10 pages, Proceedings of the Mexican Conference on Pattern
Recognition 201
Hashing-Based-Estimators for Kernel Density in High Dimensions
Given a set of points and a kernel , the Kernel
Density Estimate at a point is defined as
. We study the problem
of designing a data structure that given a data set and a kernel function,
returns *approximations to the kernel density* of a query point in *sublinear
time*. We introduce a class of unbiased estimators for kernel density
implemented through locality-sensitive hashing, and give general theorems
bounding the variance of such estimators. These estimators give rise to
efficient data structures for estimating the kernel density in high dimensions
for a variety of commonly used kernels. Our work is the first to provide
data-structures with theoretical guarantees that improve upon simple random
sampling in high dimensions.Comment: A preliminary version of this paper appeared in FOCS 201
Scaling-up Split-Merge MCMC with Locality Sensitive Sampling (LSS)
Split-Merge MCMC (Monte Carlo Markov Chain) is one of the essential and
popular variants of MCMC for problems when an MCMC state consists of an unknown
number of components. It is well known that state-of-the-art methods for
split-merge MCMC do not scale well. Strategies for rapid mixing requires smart
and informative proposals to reduce the rejection rate. However, all known
smart proposals involve expensive operations to suggest informative
transitions. As a result, the cost of each iteration is prohibitive for massive
scale datasets. It is further known that uninformative but computationally
efficient proposals, such as random split-merge, leads to extremely slow
convergence. This tradeoff between mixing time and per update cost seems hard
to get around.
In this paper, we show a sweet spot. We leverage some unique properties of
weighted MinHash, which is a popular LSH, to design a novel class of
split-merge proposals which are significantly more informative than random
sampling but at the same time efficient to compute. Overall, we obtain a
superior tradeoff between convergence and per update cost. As a direct
consequence, our proposals are around 6X faster than the state-of-the-art
sampling methods on two large real datasets KDDCUP and PubMed with several
millions of entities and thousands of clusters
Hashing for Similarity Search: A Survey
Similarity search (nearest neighbor search) is a problem of pursuing the data
items whose distances to a query item are the smallest from a large database.
Various methods have been developed to address this problem, and recently a lot
of efforts have been devoted to approximate search. In this paper, we present a
survey on one of the main solutions, hashing, which has been widely studied
since the pioneering work locality sensitive hashing. We divide the hashing
algorithms two main categories: locality sensitive hashing, which designs hash
functions without exploring the data distribution and learning to hash, which
learns hash functions according the data distribution, and review them from
various aspects, including hash function design and distance measure and search
scheme in the hash coding space
- …