258 research outputs found
Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation
We introduce and study a new data sketch for processing massive datasets. It
addresses two common problems: 1) computing a sum given arbitrary filter
conditions and 2) identifying the frequent items or heavy hitters in a data
set. For the former, the sketch provides unbiased estimates with state of the
art accuracy. It handles the challenging scenario when the data is
disaggregated so that computing the per unit metric of interest requires an
expensive aggregation. For example, the metric of interest may be total clicks
per user while the raw data is a click stream with multiple rows per user. Thus
the sketch is suitable for use in a wide range of applications including
computing historical click through rates for ad prediction, reporting user
metrics from event streams, and measuring network traffic for IP flows.
We prove and empirically show the sketch has good properties for both the
disaggregated subset sum estimation and frequent item problems. On i.i.d. data,
it not only picks out the frequent items but gives strongly consistent
estimates for the proportion of each frequent item. The resulting sketch
asymptotically draws a probability proportional to size sample that is optimal
for estimating sums over the data. For non i.i.d. data, we show that it
typically does much better than random sampling for the frequent item problem
and never does worse. For subset sum estimation, we show that even for
pathological sequences, the variance is close to that of an optimal sampling
design. Empirically, despite the disadvantage of operating on disaggregated
data, our method matches or bests priority sampling, a state of the art method
for pre-aggregated data and performs orders of magnitude better on skewed data
compared to uniform sampling. We propose extensions to the sketch that allow it
to be used in combining multiple data sets, in distributed systems, and for
time decayed aggregation
An O(n^3)-Time Algorithm for Tree Edit Distance
The {\em edit distance} between two ordered trees with vertex labels is the
minimum cost of transforming one tree into the other by a sequence of
elementary operations consisting of deleting and relabeling existing nodes, as
well as inserting new nodes. In this paper, we present a worst-case
-time algorithm for this problem, improving the previous best
-time algorithm~\cite{Klein}. Our result requires a novel
adaptive strategy for deciding how a dynamic program divides into subproblems
(which is interesting in its own right), together with a deeper understanding
of the previous algorithms for the problem. We also prove the optimality of our
algorithm among the family of \emph{decomposition strategy} algorithms--which
also includes the previous fastest algorithms--by tightening the known lower
bound of ~\cite{Touzet} to , matching our
algorithm's running time. Furthermore, we obtain matching upper and lower
bounds of when the two trees have
different sizes and~, where .Comment: 10 pages, 5 figures, 5 .tex files where TED.tex is the main on
Reconstructing David Huffman's Origami Tessellations
David A. Huffman (1925–1999) is best known in computer science for his work in information theory, particularly Huffman codes, and best known in origami as a pioneer of curved-crease folding. But during his early paper folding in the 1970s, he also designed and folded over a 100 different straight-crease origami tessellations. Unlike most origami tessellations designed in the past 20 years, Huffman's straight-crease tessellations are mostly three-dimensional, rigidly foldable, and have no locking mechanism. In collaboration with Huffman's family, our goal is to document all of his designs by reverse-engineering his models into the corresponding crease patterns, or in some cases, matching his models with his sketches of crease patterns. Here, we describe several of Huffman's origami tessellations that are most interesting historically, mathematically, and artistically.National Science Foundation (U.S.) (Origami Design for Integration of Self-assembling Systems for Engineering Innovation Grant EFRI-1240383)National Science Foundation (U.S.) (Expedition Grant CCF-1138967
On the Structure of Equilibria in Basic Network Formation
We study network connection games where the nodes of a network perform edge
swaps in order to improve their communication costs. For the model proposed by
Alon et al. (2010), in which the selfish cost of a node is the sum of all
shortest path distances to the other nodes, we use the probabilistic method to
provide a new, structural characterization of equilibrium graphs. We show how
to use this characterization in order to prove upper bounds on the diameter of
equilibrium graphs in terms of the size of the largest -vicinity (defined as
the the set of vertices within distance from a vertex), for any
and in terms of the number of edges, thus settling positively a conjecture of
Alon et al. in the cases of graphs of large -vicinity size (including graphs
of large maximum degree) and of graphs which are dense enough.
Next, we present a new swap-based network creation game, in which selfish
costs depend on the immediate neighborhood of each node; in particular, the
profit of a node is defined as the sum of the degrees of its neighbors. We
prove that, in contrast to the previous model, this network creation game
admits an exact potential, and also that any equilibrium graph contains an
induced star. The existence of the potential function is exploited in order to
show that an equilibrium can be reached in expected polynomial time even in the
case where nodes can only acquire limited knowledge concerning non-neighboring
nodes.Comment: 11 pages, 4 figure
Beyond Worst-Case Analysis for Joins with Minesweeper
We describe a new algorithm, Minesweeper, that is able to satisfy stronger
runtime guarantees than previous join algorithms (colloquially, `beyond
worst-case guarantees') for data in indexed search trees. Our first
contribution is developing a framework to measure this stronger notion of
complexity, which we call {\it certificate complexity}, that extends notions of
Barbay et al. and Demaine et al.; a certificate is a set of propositional
formulae that certifies that the output is correct. This notion captures a
natural class of join algorithms. In addition, the certificate allows us to
define a strictly stronger notion of runtime complexity than traditional
worst-case guarantees. Our second contribution is to develop a dichotomy
theorem for the certificate-based notion of complexity. Roughly, we show that
Minesweeper evaluates -acyclic queries in time linear in the certificate
plus the output size, while for any -cyclic query there is some instance
that takes superlinear time in the certificate (and for which the output is no
larger than the certificate size). We also extend our certificate-complexity
analysis to queries with bounded treewidth and the triangle query.Comment: [This is the full version of our PODS'2014 paper.
A PTAS for planar group Steiner tree via spanner bootstrapping and prize collecting
We present the first polynomial-time approximation scheme (PTAS), i.e., (1 + ϵ)-approximation algorithm for any constant ϵ > 0, for the planar group Steiner tree problem (in which each group lies on a boundary of a face). This result improves on the best previous approximation factor of O(logn(loglogn)O(1)). We achieve this result via a novel and powerful technique called spanner bootstrapping, which allows one to bootstrap from a superconstant approximation factor (even superpolynomial in the input size) all the way down to a PTAS. This is in contrast with the popular existing approach for planar PTASs of constructing lightweight spanners in one iteration, which notably requires a constant-factor approximate solution to start from. Spanner bootstrapping removes one of the main barriers for designing PTASs for problems which have no known constant-factor approximation (even on planar graphs), and thus can be used to obtain PTASs for several difficult-to-approximate problems. Our second major contribution required for the planar group Steiner tree PTAS is a spanner construction, which reduces the graph to have total weight within a factor of the optimal solution while approximately preserving the optimal solution. This is particularly challenging because group Steiner tree requires deciding which terminal in each group to connect by the tree, making it much harder than recent previous approaches to construct spanners for planar TSP by Klein [SIAM J. Computing 2008], subset TSP by Klein [STOC 2006], Steiner tree by Borradaile, Klein, and Mathieu [ACM Trans. Algorithms 2009], and Steiner forest by Bateni, Hajiaghayi, and Marx [J. ACM 2011] (and its improvement to an efficient PTAS by Eisenstat, Klein, and Mathieu [SODA 2012]. The main conceptual contribution here is realizing that selecting which terminals may be relevant is essentially a complicated prize-collecting process: we have to carefully weigh the cost and benefits of reaching or avoiding certain terminals in the spanner. Via a sequence of involved prize-collecting procedures, we can construct a spanner that reaches a set of terminals that is sufficient for an almost-optimal solution. Our PTAS for planar group Steiner tree implies the first PTAS for geometric Euclidean group Steiner tree with obstacles, as well as a (2 + ϵ)-approximation algorithm for group TSP with obstacles, improving over the best previous constant-factor approximation algorithms. By contrast, we show that planar group Steiner forest, a slight generalization of planar group Steiner tree, is APX-hard on planar graphs of treewidth 3, even if the groups are pairwise disjoint and every group is a vertex or an edge
Network Creation Games: Think Global - Act Local
We investigate a non-cooperative game-theoretic model for the formation of
communication networks by selfish agents. Each agent aims for a central
position at minimum cost for creating edges. In particular, the general model
(Fabrikant et al., PODC'03) became popular for studying the structure of the
Internet or social networks. Despite its significance, locality in this game
was first studied only recently (Bil\`o et al., SPAA'14), where a worst case
locality model was presented, which came with a high efficiency loss in terms
of quality of equilibria. Our main contribution is a new and more optimistic
view on locality: agents are limited in their knowledge and actions to their
local view ranges, but can probe different strategies and finally choose the
best. We study the influence of our locality notion on the hardness of
computing best responses, convergence to equilibria, and quality of equilibria.
Moreover, we compare the strength of local versus non-local strategy-changes.
Our results address the gap between the original model and the worst case
locality variant. On the bright side, our efficiency results are in line with
observations from the original model, yet we have a non-constant lower bound on
the price of anarchy.Comment: An extended abstract of this paper has been accepted for publication
in the proceedings of the 40th International Conference on Mathematical
Foundations on Computer Scienc
Reflections on Tiles (in Self-Assembly)
We define the Reflexive Tile Assembly Model (RTAM), which is obtained from
the abstract Tile Assembly Model (aTAM) by allowing tiles to reflect across
their horizontal and/or vertical axes. We show that the class of directed
temperature-1 RTAM systems is not computationally universal, which is
conjectured but unproven for the aTAM, and like the aTAM, the RTAM is
computationally universal at temperature 2. We then show that at temperature 1,
when starting from a single tile seed, the RTAM is capable of assembling n x n
squares for n odd using only n tile types, but incapable of assembling n x n
squares for n even. Moreover, we show that n is a lower bound on the number of
tile types needed to assemble n x n squares for n odd in the temperature-1
RTAM. The conjectured lower bound for temperature-1 aTAM systems is 2n-1.
Finally, we give preliminary results toward the classification of which finite
connected shapes in Z^2 can be assembled (strictly or weakly) by a singly
seeded (i.e. seed of size 1) RTAM system, including a complete classification
of which finite connected shapes be strictly assembled by a "mismatch-free"
singly seeded RTAM system.Comment: New results which classify the types of shapes which can
self-assemble in the RTAM have been adde
Augmenting graphs to minimize the diameter
We study the problem of augmenting a weighted graph by inserting edges of
bounded total cost while minimizing the diameter of the augmented graph. Our
main result is an FPT 4-approximation algorithm for the problem.Comment: 15 pages, 3 figure
The Power of Duples (in Self-Assembly): It's Not So Hip To Be Square
In this paper we define the Dupled abstract Tile Assembly Model (DaTAM),
which is a slight extension to the abstract Tile Assembly Model (aTAM) that
allows for not only the standard square tiles, but also "duple" tiles which are
rectangles pre-formed by the joining of two square tiles. We show that the
addition of duples allows for powerful behaviors of self-assembling systems at
temperature 1, meaning systems which exclude the requirement of cooperative
binding by tiles (i.e., the requirement that a tile must be able to bind to at
least 2 tiles in an existing assembly if it is to attach). Cooperative binding
is conjectured to be required in the standard aTAM for Turing universal
computation and the efficient self-assembly of shapes, but we show that in the
DaTAM these behaviors can in fact be exhibited at temperature 1. We then show
that the DaTAM doesn't provide asymptotic improvements over the aTAM in its
ability to efficiently build thin rectangles. Finally, we present a series of
results which prove that the temperature-2 aTAM and temperature-1 DaTAM have
mutually exclusive powers. That is, each is able to self-assemble shapes that
the other can't, and each has systems which cannot be simulated by the other.
Beyond being of purely theoretical interest, these results have practical
motivation as duples have already proven to be useful in laboratory
implementations of DNA-based tiles
- …