17 research outputs found
Approximately Minwise Independence with Twisted Tabulation
A random hash function is -minwise if for any set ,
, and element , .
Minwise hash functions with low bias have widespread applications
within similarity estimation.
Hashing from a universe , the twisted tabulation hashing of
P\v{a}tra\c{s}cu and Thorup [SODA'13] makes lookups in tables of size
. Twisted tabulation was invented to get good concentration for
hashing based sampling. Here we show that twisted tabulation yields -minwise hashing.
In the classic independence paradigm of Wegman and Carter [FOCS'79] -minwise hashing requires -independence [Indyk
SODA'99]. P\v{a}tra\c{s}cu and Thorup [STOC'11] had shown that simple
tabulation, using same space and lookups yields -minwise
independence, which is good for large sets, but useless for small sets. Our
analysis uses some of the same methods, but is much cleaner bypassing a
complicated induction argument.Comment: To appear in Proceedings of SWAT 201
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window
The past decade has witnessed many interesting algorithms for maintaining
statistics over a data stream. This paper initiates a theoretical study of
algorithms for monitoring distributed data streams over a time-based sliding
window (which contains a variable number of items and possibly out-of-order
items). The concern is how to minimize the communication between individual
streams and the root, while allowing the root, at any time, to be able to
report the global statistics of all streams within a given error bound. This
paper presents communication-efficient algorithms for three classical
statistics, namely, basic counting, frequent items and quantiles. The
worst-case communication cost over a window is bits for basic counting and words for the remainings, where is the number of distributed
data streams, is the total number of items in the streams that arrive or
expire in the window, and is the desired error bound. Matching
and nearly matching lower bounds are also obtained.Comment: 12 pages, to appear in the 27th International Symposium on
Theoretical Aspects of Computer Science (STACS), 201
Interval Selection in the Streaming Model
A set of intervals is independent when the intervals are pairwise disjoint.
In the interval selection problem we are given a set of intervals
and we want to find an independent subset of intervals of largest cardinality.
Let denote the cardinality of an optimal solution. We
discuss the estimation of in the streaming model, where we
only have one-time, sequential access to the input intervals, the endpoints of
the intervals lie in , and the amount of the memory is
constrained.
For intervals of different sizes, we provide an algorithm in the data stream
model that computes an estimate of that, with
probability at least , satisfies . For same-length
intervals, we provide another algorithm in the data stream model that computes
an estimate of that, with probability at
least , satisfies . The space used by our algorithms is bounded
by a polynomial in and . We also show that no better
estimations can be achieved using bits of storage.
We also develop new, approximate solutions to the interval selection problem,
where we want to report a feasible solution, that use
space. Our algorithms for the interval selection problem match the optimal
results by Emek, Halld{\'o}rsson and Ros{\'e}n [Space-Constrained Interval
Selection, ICALP 2012], but are much simpler.Comment: Minor correction
Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence
We consider bottom-k sampling for a set X, picking a sample S_k(X) consisting
of the k elements that are smallest according to a given hash function h. With
this sample we can estimate the relative size f=|Y|/|X| of any subset Y as
|S_k(X) intersect Y|/k. A standard application is the estimation of the Jaccard
similarity f=|A intersect B|/|A union B| between sets A and B. Given the
bottom-k samples from A and B, we construct the bottom-k sample of their union
as S_k(A union B)=S_k(S_k(A) union S_k(B)), and then the similarity is
estimated as |S_k(A union B) intersect S_k(A) intersect S_k(B)|/k.
We show here that even if the hash function is only 2-independent, the
expected relative error is O(1/sqrt(fk)). For fk=Omega(1) this is within a
constant factor of the expected relative error with truly random hashing.
For comparison, consider the classic approach of kxmin-wise where we use k
hash independent functions h_1,...,h_k, storing the smallest element with each
hash function. For kxmin-wise there is an at least constant bias with constant
independence, and it is not reduced with larger k. Recently Feigenblat et al.
showed that bottom-k circumvents the bias if the hash function is 8-independent
and k is sufficiently large. We get down to 2-independence for any k. Our
result is based on a simply union bound, transferring generic concentration
bounds for the hashing scheme to the bottom-k sample, e.g., getting stronger
probability error bounds with higher independence.
For weighted sets, we consider priority sampling which adapts efficiently to
the concrete input weights, e.g., benefiting strongly from heavy-tailed input.
This time, the analysis is much more involved, but again we show that generic
concentration bounds can be applied.Comment: A short version appeared at STOC'1