381 research outputs found
Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence
We consider bottom-k sampling for a set X, picking a sample S_k(X) consisting
of the k elements that are smallest according to a given hash function h. With
this sample we can estimate the relative size f=|Y|/|X| of any subset Y as
|S_k(X) intersect Y|/k. A standard application is the estimation of the Jaccard
similarity f=|A intersect B|/|A union B| between sets A and B. Given the
bottom-k samples from A and B, we construct the bottom-k sample of their union
as S_k(A union B)=S_k(S_k(A) union S_k(B)), and then the similarity is
estimated as |S_k(A union B) intersect S_k(A) intersect S_k(B)|/k.
We show here that even if the hash function is only 2-independent, the
expected relative error is O(1/sqrt(fk)). For fk=Omega(1) this is within a
constant factor of the expected relative error with truly random hashing.
For comparison, consider the classic approach of kxmin-wise where we use k
hash independent functions h_1,...,h_k, storing the smallest element with each
hash function. For kxmin-wise there is an at least constant bias with constant
independence, and it is not reduced with larger k. Recently Feigenblat et al.
showed that bottom-k circumvents the bias if the hash function is 8-independent
and k is sufficiently large. We get down to 2-independence for any k. Our
result is based on a simply union bound, transferring generic concentration
bounds for the hashing scheme to the bottom-k sample, e.g., getting stronger
probability error bounds with higher independence.
For weighted sets, we consider priority sampling which adapts efficiently to
the concrete input weights, e.g., benefiting strongly from heavy-tailed input.
This time, the analysis is much more involved, but again we show that generic
concentration bounds can be applied.Comment: A short version appeared at STOC'1
Fast and Powerful Hashing using Tabulation
Randomized algorithms are often enjoyed for their simplicity, but the hash
functions employed to yield the desired probabilistic guarantees are often too
complicated to be practical. Here we survey recent results on how simple
hashing schemes based on tabulation provide unexpectedly strong guarantees.
Simple tabulation hashing dates back to Zobrist [1970]. Keys are viewed as
consisting of characters and we have precomputed character tables
mapping characters to random hash values. A key
is hashed to . This schemes is
very fast with character tables in cache. While simple tabulation is not even
4-independent, it does provide many of the guarantees that are normally
obtained via higher independence, e.g., linear probing and Cuckoo hashing.
Next we consider twisted tabulation where one input character is "twisted" in
a simple way. The resulting hash function has powerful distributional
properties: Chernoff-Hoeffding type tail bounds and a very small bias for
min-wise hashing. This also yields an extremely fast pseudo-random number
generator that is provably good for many classic randomized algorithms and
data-structures.
Finally, we consider double tabulation where we compose two simple tabulation
functions, applying one to the output of the other, and show that this yields
very high independence in the classic framework of Carter and Wegman [1977]. In
fact, w.h.p., for a given set of size proportional to that of the space
consumed, double tabulation gives fully-random hashing. We also mention some
more elaborate tabulation schemes getting near-optimal independence for given
time and space.
While these tabulation schemes are all easy to implement and use, their
analysis is not
Simple Tabulation, Fast Expanders, Double Tabulation, and High Independence
Simple tabulation dates back to Zobrist in 1970. Keys are viewed as c
characters from some alphabet A. We initialize c tables h_0, ..., h_{c-1}
mapping characters to random hash values. A key x=(x_0, ..., x_{c-1}) is hashed
to h_0[x_0] xor...xor h_{c-1}[x_{c-1}]. The scheme is extremely fast when the
character hash tables h_i are in cache. Simple tabulation hashing is not
4-independent, but we show that if we apply it twice, then we get high
independence. First we hash to intermediate keys that are 6 times longer than
the original keys, and then we hash the intermediate keys to the final hash
values.
The intermediate keys have d=6c characters from A. We can view the hash
function as a degree d bipartite graph with keys on one side, each with edges
to d output characters. We show that this graph has nice expansion properties,
and from that we get that with another level of simple tabulation on the
intermediate keys, the composition is a highly independent hash function. The
independence we get is |A|^{Omega(1/c)}.
Our space is O(c|A|) and the hash function is evaluated in O(c) time. Siegel
[FOCS'89, SICOMP'04] proved that with this space, if the hash function is
evaluated in o(c) time, then the independence can only be o(c), so our
evaluation time is best possible for Omega(c) independence---our independence
is much higher if c=|A|^{o(1)}.
Siegel used O(c)^c evaluation time to get the same independence with similar
space. Siegel's main focus was c=O(1), but we are exponentially faster when
c=omega(1).
Applying our scheme recursively, we can increase our independence to
|A|^{Omega(1)} with o(c^{log c}) evaluation time. Compared with Siegel's scheme
this is both faster and higher independence.
Our scheme is easy to implement, and it does provide realistic
implementations of 100-independent hashing for, say, 32 and 64-bit keys
Dynamic Integer Sets with Optimal Rank, Select, and Predecessor Search
We present a data structure representing a dynamic set S of w-bit integers on
a w-bit word RAM. With |S|=n and w > log n and space O(n), we support the
following standard operations in O(log n / log w) time:
- insert(x) sets S = S + {x}. - delete(x) sets S = S - {x}. - predecessor(x)
returns max{y in S | y= x}. -
rank(x) returns #{y in S | y< x}. - select(i) returns y in S with rank(y)=i, if
any.
Our O(log n/log w) bound is optimal for dynamic rank and select, matching a
lower bound of Fredman and Saks [STOC'89]. When the word length is large, our
time bound is also optimal for dynamic predecessor, matching a static lower
bound of Beame and Fich [STOC'99] whenever log n/log w=O(log w/loglog w).
Technically, the most interesting aspect of our data structure is that it
supports all the above operations in constant time for sets of size n=w^{O(1)}.
This resolves a main open problem of Ajtai, Komlos, and Fredman [FOCS'83].
Ajtai et al. presented such a data structure in Yao's abstract cell-probe model
with w-bit cells/words, but pointed out that the functions used could not be
implemented. As a partial solution to the problem, Fredman and Willard
[STOC'90] introduced a fusion node that could handle queries in constant time,
but used polynomial time on the updates. We call our small set data structure a
dynamic fusion node as it does both queries and updates in constant time.Comment: Presented with different formatting in Proceedings of the 55nd IEEE
Symposium on Foundations of Computer Science (FOCS), 2014, pp. 166--175. The
new version fixes a bug in one of the bounds stated for predecessor search,
pointed out to me by Djamal Belazzougu
Dynamic Ordered Sets with Exponential Search Trees
We introduce exponential search trees as a novel technique for converting
static polynomial space search structures for ordered sets into fully-dynamic
linear space data structures.
This leads to an optimal bound of O(sqrt(log n/loglog n)) for searching and
updating a dynamic set of n integer keys in linear space. Here searching an
integer y means finding the maximum key in the set which is smaller than or
equal to y. This problem is equivalent to the standard text book problem of
maintaining an ordered set (see, e.g., Cormen, Leiserson, Rivest, and Stein:
Introduction to Algorithms, 2nd ed., MIT Press, 2001).
The best previous deterministic linear space bound was O(log n/loglog n) due
Fredman and Willard from STOC 1990. No better deterministic search bound was
known using polynomial space.
We also get the following worst-case linear space trade-offs between the
number n, the word length w, and the maximal key U < 2^w: O(min{loglog n+log
n/log w, (loglog n)(loglog U)/(logloglog U)}). These trade-offs are, however,
not likely to be optimal.
Our results are generalized to finger searching and string searching,
providing optimal results for both in terms of n.Comment: Revision corrects some typoes and state things better for
applications in subsequent paper
Integer priority queues with decrease key in constant time and the single source shortest paths problem
AbstractWe consider Fibonacci heap style integer priority queues supporting find-min, insert, and decrease key operations in constant time. We present a deterministic linear space solution that with n integer keys supports delete in O(loglogn) time. If the integers are in the range [0,N), we can also support delete in O(loglogN) time.Even for the special case of monotone priority queues, where the minimum has to be non-decreasing, the best previous bounds on delete were O((logn)1/(3−ε)) and O((logN)1/(4−ε)). These previous bounds used both randomization and amortization. Our new bounds are deterministic, worst-case, with no restriction to monotonicity, and exponentially faster.As a classical application, for a directed graph with n nodes and m edges with non-negative integer weights, we get single source shortest paths in O(m+nloglogn) time, or O(m+nloglogC) if C is the maximal edge weight. The latter solves an open problem of Ahuja, Mehlhorn, Orlin, and Tarjan from 1990
Finding the Maximum Subset with Bounded Convex Curvature
We describe an algorithm for solving an important geometric problem arising in computer-aided manufacturing. When machining a pocket in a solid piece of material such as steel using a rough tool in a milling machine, sharp convex corners of the pocket cannot be done properly, but have to be left for finer tools that are more expensive to use. We want to determine a tool path that maximizes the use of the rough tool. Mathematically, this boils down to the following problem. Given a simply-connected set of points P in the plane such that the boundary of P is a curvilinear polygon consisting of n line segments and circular arcs of arbitrary radii, compute the maximum subset Q of P consisting of simply-connected sets where the boundary of each set is a curve with bounded convex curvature. A closed curve has bounded convex curvature if, when traversed in counterclockwise direction, it turns to the left with curvature at most 1. There is no bound on the curvature where it turns to the right. The difference in the requirement to left- and right-curvature is a natural consequence of different conditions when machining convex and concave areas of the pocket. We devise an algorithm to compute the unique maximum such set Q. The algorithm runs in O(n log n) time and uses O(n) space.
For the correctness of our algorithm, we prove a new generalization of the Pestov-Ionin Theorem. This is needed to show that the output Q of our algorithm is indeed maximum in the sense that if Q\u27 is any subset of P with a boundary of bounded convex curvature, then Q\u27 is a subset of Q
- …