22 research outputs found
Additive Spanners: A Simple Construction
We consider additive spanners of unweighted undirected graphs. Let be a
graph and a subgraph of . The most na\"ive way to construct an additive
-spanner of is the following: As long as is not an additive
-spanner repeat: Find a pair that violates the
spanner-condition and a shortest path from to in . Add the edges of
this path to .
We show that, with a very simple initial graph , this na\"ive method gives
additive - and -spanners of sizes matching the best known upper bounds.
For additive -spanners we start with and end with
edges in the spanner. For additive -spanners we start with containing
arbitrary edges incident to each node and end with a
spanner of size .Comment: To appear at proceedings of the 14th Scandinavian Symposium and
Workshop on Algorithm Theory (SWAT 2014
Linear Hashing is Awesome
We consider the hash function where
are chosen uniformly at random from . We prove that when we
use in hashing with chaining to insert elements into a table of size
the expected length of the longest chain is
. The proof also generalises to give the same
bound when we use the multiply-shift hash function by Dietzfelbinger et al.
[Journal of Algorithms 1997].Comment: A preliminary version appeared at FOCS'1
The Entropy of Backwards Analysis
Backwards analysis, first popularized by Seidel, is often the simplest most
elegant way of analyzing a randomized algorithm. It applies to incremental
algorithms where elements are added incrementally, following some random
permutation, e.g., incremental Delauney triangulation of a pointset, where
points are added one by one, and where we always maintain the Delauney
triangulation of the points added thus far. For backwards analysis, we think of
the permutation as generated backwards, implying that the th point in the
permutation is picked uniformly at random from the points not picked yet in
the backwards direction. Backwards analysis has also been applied elegantly by
Chan to the randomized linear time minimum spanning tree algorithm of Karger,
Klein, and Tarjan.
The question considered in this paper is how much randomness we need in order
to trust the expected bounds obtained using backwards analysis, exactly and
approximately. For the exact case, it turns out that a random permutation works
if and only if it is minwise, that is, for any given subset, each element has
the same chance of being first. Minwise permutations are known to have
entropy, and this is then also what we need for exact backwards
analysis.
However, when it comes to approximation, the two concepts diverge
dramatically. To get backwards analysis to hold within a factor , the
random permutation needs entropy . This contrasts with
minwise permutations, where it is known that a approximation
only needs entropy. Our negative result for
backwards analysis essentially shows that it is as abstract as any analysis
based on full randomness
Finding Even Cycles Faster via Capped k-Walks
In this paper, we consider the problem of finding a cycle of length (a
) in an undirected graph with nodes and edges for constant
. A classic result by Bondy and Simonovits [J.Comb.Th.'74] implies that
if , then contains a , further implying that
one needs to consider only graphs with .
Previously the best known algorithms were an algorithm due to Yuster
and Zwick [J.Disc.Math'97] as well as a algorithm by Alon et al. [Algorithmica'97].
We present an algorithm that uses time and finds a
if one exists. This bound is exactly when . For
-cycles our new bound coincides with Alon et al., while for every our
bound yields a polynomial improvement in .
Yuster and Zwick noted that it is "plausible to conjecture that is
the best possible bound in terms of ". We show "conditional optimality": if
this hypothesis holds then our algorithm is tight as well.
Furthermore, a folklore reduction implies that no combinatorial algorithm can
determine if a graph contains a -cycle in time for any
under the widely believed combinatorial BMM conjecture. Coupled
with our main result, this gives tight bounds for finding -cycles
combinatorially and also separates the complexity of finding - and
-cycles giving evidence that the exponent of in the running time should
indeed increase with .
The key ingredient in our algorithm is a new notion of capped -walks,
which are walks of length that visit only nodes according to a fixed
ordering. Our main technical contribution is an involved analysis proving
several properties of such walks which may be of independent interest.Comment: To appear at STOC'1
Fast Similarity Sketching
We consider the Similarity Sketching problem: Given a universe we want a random function mapping subsets into vectors of size , such that similarity is preserved. More
precisely: Given sets , define and
. We want to have , where
and furthermore to have strong concentration
guarantees (i.e. Chernoff-style bounds) for . This is a fundamental problem
which has found numerous applications in data mining, large-scale
classification, computer vision, similarity search, etc. via the classic
MinHash algorithm. The vectors are also called sketches.
The seminal MinHash algorithm uses random hash functions
, and stores as the sketch of . The main drawback of MinHash is,
however, its running time, and finding a sketch with similar
properties and faster running time has been the subject of several papers.
Addressing this, Li et al. [NIPS'12] introduced one permutation hashing (OPH),
which creates a sketch of size in time, but with the drawback
that possibly some of the entries are "empty" when . One could
argue that sketching is not necessary in this case, however the desire in most
applications is to have one sketching procedure that works for sets of all
sizes. Therefore, filling out these empty entries is the subject of several
follow-up papers initiated by Shrivastava and Li [ICML'14]. However, these
"densification" schemes fail to provide good concentration bounds exactly in
the case , where they are needed. (continued...
Practical Hash Functions for Similarity Estimation and Dimensionality Reduction
Hashing is a basic tool for dimensionality reduction employed in several
aspects of machine learning. However, the perfomance analysis is often carried
out under the abstract assumption that a truly random unit cost hash function
is used, without concern for which concrete hash function is employed. The
concrete hash function may work fine on sufficiently random input. The question
is if it can be trusted in the real world when faced with more structured
input.
In this paper we focus on two prominent applications of hashing, namely
similarity estimation with the one permutation hashing (OPH) scheme of Li et
al. [NIPS'12] and feature hashing (FH) of Weinberger et al. [ICML'09], both of
which have found numerous applications, i.e. in approximate near-neighbour
search with LSH and large-scale classification with SVM.
We consider mixed tabulation hashing of Dahlgaard et al.[FOCS'15] which was
proved to perform like a truly random hash function in many applications,
including OPH. Here we first show improved concentration bounds for FH with
truly random hashing and then argue that mixed tabulation performs similar for
sparse input. Our main contribution, however, is an experimental comparison of
different hashing schemes when used inside FH, OPH, and LSH.
We find that mixed tabulation hashing is almost as fast as the
multiply-mod-prime scheme ax+b mod p. Mutiply-mod-prime is guaranteed to work
well on sufficiently random data, but we demonstrate that in the above
applications, it can lead to bias and poor concentration on both real-world and
synthetic data. We also compare with the popular MurmurHash3, which has no
proven guarantees. Mixed tabulation and MurmurHash3 both perform similar to
truly random hashing in our experiments. However, mixed tabulation is 40%
faster than MurmurHash3, and it has the proven guarantee of good performance on
all possible input.Comment: Preliminary version of this paper will appear at NIPS 201
Sublinear Distance Labeling
A distance labeling scheme labels the nodes of a graph with binary
strings such that, given the labels of any two nodes, one can determine the
distance in the graph between the two nodes by looking only at the labels. A
-preserving distance labeling scheme only returns precise distances between
pairs of nodes that are at distance at least from each other. In this paper
we consider distance labeling schemes for the classical case of unweighted
graphs with both directed and undirected edges.
We present a bit -preserving distance labeling
scheme, improving the previous bound by Bollob\'as et. al. [SIAM J. Discrete
Math. 2005]. We also give an almost matching lower bound of
. With our -preserving distance labeling scheme as a
building block, we additionally achieve the following results:
1. We present the first distance labeling scheme of size for sparse
graphs (and hence bounded degree graphs). This addresses an open problem by
Gavoille et. al. [J. Algo. 2004], hereby separating the complexity from
distance labeling in general graphs which require bits, Moon [Proc.
of Glasgow Math. Association 1965].
2. For approximate -additive labeling schemes, that return distances
within an additive error of we show a scheme of size for .
This improves on the current best bound of by
Alstrup et. al. [SODA 2016] for sub-polynomial , and is a generalization of
a result by Gawrychowski et al. [arXiv preprint 2015] who showed this for
.Comment: A preliminary version of this paper appeared at ESA'1
Power of Choices with Simple Tabulation
Suppose that we are to place balls into bins sequentially using the
-choice paradigm: For each ball we are given a choice of bins, according
to hash functions and we place the ball in the least loaded
of these bins breaking ties arbitrarily. Our interest is in the number of balls
in the fullest bin after all balls have been placed.
Azar et al. [STOC'94] proved that when and when the hash functions
are fully random the maximum load is at most
whp (i.e. with probability for any choice of ).
In this paper we suppose that the are simple tabulation hash
functions. Generalising a result by Dahlgaard et al [SODA'16] we show that for
an arbitrary constant the maximum load is whp, and
that expected maximum load is at most . We
further show that by using a simple tie-breaking algorithm introduced by
V\"ocking [J.ACM'03] the expected maximum load drops to where is the rate of growth of the -ary
Fibonacci numbers. Both of these expected bounds match those of the fully
random setting.
The analysis by Dahlgaard et al. relies on a proof by P\u{a}tra\c{s}cu and
Thorup [J.ACM'11] concerning the use of simple tabulation for cuckoo hashing.
We need here a generalisation to hash functions, but the original proof
is an 8-page tour de force of ad-hoc arguments that do not appear to
generalise. Our main technical contribution is a shorter, simpler and more
accessible proof of the result by P\u{a}tra\c{s}cu and Thorup, where the
relevant parts generalise nicely to the analysis of choices.Comment: Accepted at ICALP 201