4,196 research outputs found
Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence
We consider bottom-k sampling for a set X, picking a sample S_k(X) consisting
of the k elements that are smallest according to a given hash function h. With
this sample we can estimate the relative size f=|Y|/|X| of any subset Y as
|S_k(X) intersect Y|/k. A standard application is the estimation of the Jaccard
similarity f=|A intersect B|/|A union B| between sets A and B. Given the
bottom-k samples from A and B, we construct the bottom-k sample of their union
as S_k(A union B)=S_k(S_k(A) union S_k(B)), and then the similarity is
estimated as |S_k(A union B) intersect S_k(A) intersect S_k(B)|/k.
We show here that even if the hash function is only 2-independent, the
expected relative error is O(1/sqrt(fk)). For fk=Omega(1) this is within a
constant factor of the expected relative error with truly random hashing.
For comparison, consider the classic approach of kxmin-wise where we use k
hash independent functions h_1,...,h_k, storing the smallest element with each
hash function. For kxmin-wise there is an at least constant bias with constant
independence, and it is not reduced with larger k. Recently Feigenblat et al.
showed that bottom-k circumvents the bias if the hash function is 8-independent
and k is sufficiently large. We get down to 2-independence for any k. Our
result is based on a simply union bound, transferring generic concentration
bounds for the hashing scheme to the bottom-k sample, e.g., getting stronger
probability error bounds with higher independence.
For weighted sets, we consider priority sampling which adapts efficiently to
the concrete input weights, e.g., benefiting strongly from heavy-tailed input.
This time, the analysis is much more involved, but again we show that generic
concentration bounds can be applied.Comment: A short version appeared at STOC'1
Approximately Minwise Independence with Twisted Tabulation
A random hash function is -minwise if for any set ,
, and element , .
Minwise hash functions with low bias have widespread applications
within similarity estimation.
Hashing from a universe , the twisted tabulation hashing of
P\v{a}tra\c{s}cu and Thorup [SODA'13] makes lookups in tables of size
. Twisted tabulation was invented to get good concentration for
hashing based sampling. Here we show that twisted tabulation yields -minwise hashing.
In the classic independence paradigm of Wegman and Carter [FOCS'79] -minwise hashing requires -independence [Indyk
SODA'99]. P\v{a}tra\c{s}cu and Thorup [STOC'11] had shown that simple
tabulation, using same space and lookups yields -minwise
independence, which is good for large sets, but useless for small sets. Our
analysis uses some of the same methods, but is much cleaner bypassing a
complicated induction argument.Comment: To appear in Proceedings of SWAT 201
Estimation for Monotone Sampling: Competitiveness and Customization
Random samples are lossy summaries which allow queries posed over the data to
be approximated by applying an appropriate estimator to the sample. The
effectiveness of sampling, however, hinges on estimator selection. The choice
of estimators is subjected to global requirements, such as unbiasedness and
range restrictions on the estimate value, and ideally, we seek estimators that
are both efficient to derive and apply and {\em admissible} (not dominated, in
terms of variance, by other estimators). Nevertheless, for a given data domain,
sampling scheme, and query, there are many admissible estimators. We study the
choice of admissible nonnegative and unbiased estimators for monotone sampling
schemes. Monotone sampling schemes are implicit in many applications of massive
data set analysis. Our main contribution is general derivations of admissible
estimators with desirable properties. We present a construction of {\em
order-optimal} estimators, which minimize variance according to {\em any}
specified priorities over the data domain. Order-optimality allows us to
customize the derivation to common patterns that we can learn or observe in the
data. When we prioritize lower values (e.g., more similar data sets when
estimating difference), we obtain the L estimator, which is the unique
monotone admissible estimator. We show that the L estimator is
4-competitive and dominates the classic Horvitz-Thompson estimator. These
properties make the L estimator a natural default choice. We also present
the U estimator, which prioritizes large values (e.g., less similar data
sets). Our estimator constructions are both easy to apply and possess desirable
properties, allowing us to make the most from our summarized data.Comment: 28 pages; Improved write up, presentation in the context of the more
general monotone sampling formulation (instead of coordinated sampling).
Bounds on universal ratio removed to make the paper more focused, since it is
mainly of theoretical interes
Practical Hash Functions for Similarity Estimation and Dimensionality Reduction
Hashing is a basic tool for dimensionality reduction employed in several
aspects of machine learning. However, the perfomance analysis is often carried
out under the abstract assumption that a truly random unit cost hash function
is used, without concern for which concrete hash function is employed. The
concrete hash function may work fine on sufficiently random input. The question
is if it can be trusted in the real world when faced with more structured
input.
In this paper we focus on two prominent applications of hashing, namely
similarity estimation with the one permutation hashing (OPH) scheme of Li et
al. [NIPS'12] and feature hashing (FH) of Weinberger et al. [ICML'09], both of
which have found numerous applications, i.e. in approximate near-neighbour
search with LSH and large-scale classification with SVM.
We consider mixed tabulation hashing of Dahlgaard et al.[FOCS'15] which was
proved to perform like a truly random hash function in many applications,
including OPH. Here we first show improved concentration bounds for FH with
truly random hashing and then argue that mixed tabulation performs similar for
sparse input. Our main contribution, however, is an experimental comparison of
different hashing schemes when used inside FH, OPH, and LSH.
We find that mixed tabulation hashing is almost as fast as the
multiply-mod-prime scheme ax+b mod p. Mutiply-mod-prime is guaranteed to work
well on sufficiently random data, but we demonstrate that in the above
applications, it can lead to bias and poor concentration on both real-world and
synthetic data. We also compare with the popular MurmurHash3, which has no
proven guarantees. Mixed tabulation and MurmurHash3 both perform similar to
truly random hashing in our experiments. However, mixed tabulation is 40%
faster than MurmurHash3, and it has the proven guarantee of good performance on
all possible input.Comment: Preliminary version of this paper will appear at NIPS 201
Leveraging Discarded Samples for Tighter Estimation of Multiple-Set Aggregates
Many datasets such as market basket data, text or hypertext documents, and
sensor observations recorded in different locations or time periods, are
modeled as a collection of sets over a ground set of keys. We are interested in
basic aggregates such as the weight or selectivity of keys that satisfy some
selection predicate defined over keys' attributes and membership in particular
sets. This general formulation includes basic aggregates such as the Jaccard
coefficient, Hamming distance, and association rules.
On massive data sets, exact computation can be inefficient or infeasible.
Sketches based on coordinated random samples are classic summaries that support
approximate query processing.
Queries are resolved by generating a sketch (sample) of the union of sets
used in the predicate from the sketches these sets and then applying an
estimator to this union-sketch.
We derive novel tighter (unbiased) estimators that leverage sampled keys that
are present in the union of applicable sketches but excluded from the union
sketch. We establish analytically that our estimators dominate estimators
applied to the union-sketch for {\em all queries and data sets}. Empirical
evaluation on synthetic and real data reveals that on typical applications we
can expect a 25%-4 fold reduction in estimation error.Comment: 16 page
Automated Reconstruction of Dendritic and Axonal Trees by Global Optimization with Geometric Priors
We present a novel probabilistic approach to fully automated delineation of tree structures in noisy 2D images and 3D image stacks. Unlike earlier methods that rely mostly on local evidence, ours builds a set of candidate trees over many different subsets of points likely to belong to the optimal tree and then chooses the best one according to a global objective function that combines image evidence with geometric priors. Since the best tree does not necessarily span all the points, the algorithm is able to eliminate false detections while retaining the correct tree topology. Manually annotated brightfield micrographs, retinal scans and the DIADEM challenge datasets are used to evaluate the performance of our method. We used the DIADEM metric to quantitatively evaluate the topological accuracy of the reconstructions and showed that the use of the geometric regularization yields a substantial improvemen
- …