11 research outputs found
An Improved Interactive Streaming Algorithm for the Distinct Elements Problem
The exact computation of the number of distinct elements (frequency moment
) is a fundamental problem in the study of data streaming algorithms. We
denote the length of the stream by where each symbol is drawn from a
universe of size . While it is well known that the moments can
be approximated by efficient streaming algorithms, it is easy to see that exact
computation of requires space . In previous work, Cormode
et al. therefore considered a model where the data stream is also processed by
a powerful helper, who provides an interactive proof of the result. They gave
such protocols with a polylogarithmic number of rounds of communication between
helper and verifier for all functions in NC. This number of rounds
can quickly make such
protocols impractical.
Cormode et al. also gave a protocol with rounds for the exact
computation of where the space complexity is but the total communication . They managed to give round protocols with
complexity for many other interesting problems
including , Inner product, and Range-sum, but computing exactly with
polylogarithmic space and communication and rounds remained open.
In this work, we give a streaming interactive protocol with rounds
for exact computation of using bits of space and the communication is . The update
time of the verifier per symbol received is .Comment: Submitted to ICALP 201
Semi-Streaming Algorithms for Annotated Graph Streams
Considerable effort has been devoted to the development of streaming
algorithms for analyzing massive graphs. Unfortunately, many results have been
negative, establishing that a wide variety of problems require
space to solve. One of the few bright spots has been the development of
semi-streaming algorithms for a handful of graph problems -- these algorithms
use space .
In the annotated data streaming model of Chakrabarti et al., a
computationally limited client wants to compute some property of a massive
input, but lacks the resources to store even a small fraction of the input, and
hence cannot perform the desired computation locally. The client therefore
accesses a powerful but untrusted service provider, who not only performs the
requested computation, but also proves that the answer is correct.
We put forth the notion of semi-streaming algorithms for annotated graph
streams (semi-streaming annotation schemes for short). These are protocols in
which both the client's space usage and the length of the proof are . We give evidence that semi-streaming annotation schemes
represent a substantially more robust solution concept than does the standard
semi-streaming model. On the positive side, we give semi-streaming annotation
schemes for two dynamic graph problems that are intractable in the standard
model: (exactly) counting triangles, and (exactly) computing maximum matchings.
The former scheme answers a question of Cormode. On the negative side, we
identify for the first time two natural graph problems (connectivity and
bipartiteness in a certain edge update model) that can be solved in the
standard semi-streaming model, but cannot be solved by annotation schemes of
"sub-semi-streaming" cost. That is, these problems are just as hard in the
annotations model as they are in the standard model.Comment: This update includes some additional discussion of the results
proven. The result on counting triangles was previously included in an ECCC
technical report by Chakrabarti et al. available at
http://eccc.hpi-web.de/report/2013/180/. That report has been superseded by
this manuscript, and the CCC 2015 paper "Verifiable Stream Computation and
Arthur-Merlin Communication" by Chakrabarti et a
Streaming Verification of Graph Properties
Streaming interactive proofs (SIPs) are a framework for outsourced
computation. A computationally limited streaming client (the verifier) hands
over a large data set to an untrusted server (the prover) in the cloud and the
two parties run a protocol to confirm the correctness of result with high
probability. SIPs are particularly interesting for problems that are hard to
solve (or even approximate) well in a streaming setting. The most notable of
these problems is finding maximum matchings, which has received intense
interest in recent years but has strong lower bounds even for constant factor
approximations.
In this paper, we present efficient streaming interactive proofs that can
verify maximum matchings exactly. Our results cover all flavors of matchings
(bipartite/non-bipartite and weighted). In addition, we also present streaming
verifiers for approximate metric TSP. In particular, these are the first
efficient results for weighted matchings and for metric TSP in any streaming
verification model.Comment: 26 pages, 2 figure, 1 tabl
Annotations for Sparse Data Streams
Motivated by cloud computing, a number of recent works have studied annotated
data streams and variants thereof. In this setting, a computationally weak
verifier (cloud user), lacking the resources to store and manipulate his
massive input locally, accesses a powerful but untrusted prover (cloud
service). The verifier must work within the restrictive data streaming
paradigm. The prover, who can annotate the data stream as it is read, must not
just supply the answer but also convince the verifier of its correctness.
Ideally, both the amount of annotation and the space used by the verifier
should be sublinear in the relevant input size parameters.
A rich theory of such algorithms -- which we call schemes -- has emerged.
Prior work has shown how to leverage the prover's power to efficiently solve
problems that have no non-trivial standard data stream algorithms. However,
while optimal schemes are now known for several basic problems, such optimality
holds only for streams whose length is commensurate with the size of the data
universe. In contrast, many real-world datasets are relatively sparse,
including graphs that contain only O(n^2) edges, and IP traffic streams that
contain much fewer than the total number of possible IP addresses, 2^128 in
IPv6.
We design the first schemes that allow both the annotation and the space
usage to be sublinear in the total number of stream updates rather than the
size of the data universe. We solve significant problems, including variations
of INDEX, SET-DISJOINTNESS, and FREQUENCY-MOMENTS, plus several natural
problems on graphs. On the other hand, we give a new lower bound that, for the
first time, rules out smooth tradeoffs between annotation and space usage for a
specific problem. Our technique brings out new nuances in Merlin-Arthur
communication complexity models, and provides a separation between online
versions of the MA and AMA models.Comment: 29 pages, 5 table
Doctor of Philosophy
dissertationThe contributions of this dissertation are centered around designing new algorithms in the general area of sublinear algorithms such as streaming, core sets and sublinear verification, with a special interest in problems arising from data analysis including data summarization, clustering, matrix problems and massive graphs. In the first part, we focus on summaries and coresets, which are among the main techniques for designing sublinear algorithms for massive data sets. We initiate the study of coresets for uncertain data and study coresets for various types of range counting queries on uncertain data. We focus mainly on the indecisive model of locational uncertainty since it comes up frequently in real-world applications when multiple readings of the same object are made. In this model, each uncertain point has a probability density describing its location, defined as distinct locations. Our goal is to construct a subset of the uncertain points, including their locational uncertainty, so that range counting queries can be answered by examining only this subset. For each type of query we provide coreset constructions with approximation-size trade-offs. We show that random sampling can be used to construct each type of coreset, and we also provide significantly improved bounds using discrepancy-based techniques on axis-aligned range queries. In the second part, we focus on designing sublinear-space algorithms for approximate computations on massive graphs. In particular, we consider graph MAXCUT and correlation clustering problems and develop sampling based approaches to construct truly sublinear () sized coresets for graphs that have polynomial (i.e., for any ) average degree. Our technique is based on analyzing properties of random induced subprograms of the linear program formulations of the problems. We demonstrate this technique with two examples. Firstly, we present a sublinear sized core set to approximate the value of the MAX CUT in a graph to a factor. To the best of our knowledge, all the known methods in this regime rely crucially on near-regularity assumptions. Secondly, we apply the same framework to construct a sublinear-sized coreset for correlation clustering. Our coreset construction also suggests 2-pass streaming algorithms for computing the MAX CUT and correlation clustering objective values which are left as future work at the time of writing this dissertation. Finally, we focus on streaming verification algorithms as another model for designing sublinear algorithms. We give the first polylog space and sublinear (in number of edges) communication protocols for any streaming verification problems in graphs. We present efficient streaming interactive proofs that can verify maximum matching exactly. Our results cover all flavors of matchings (bipartite/ nonbipartite and weighted). In addition, we also present streaming verifiers for approximate metric TSP and exact triangle counting, as well as for graph primitives such as the number of connected components, bipartiteness, minimum spanning tree and connectivity. In particular, these are the first results for weighted matchings and for metric TSP in any streaming verification model. Our streaming verifiers use only polylogarithmic space while exchanging only polylogarithmic communication with the prover in addition to the output size of the relevant solution. We also initiate a study of streaming interactive proofs (SIPs) for problems in data analysis and present efficient SIPs for some fundamental problems. We present protocols for clustering and shape fitting including minimum enclosing ball (MEB), width of a point set, -centers and -slab problem. We also present protocols for fundamental matrix analysis problems: We provide an improved protocol for rectangular matrix problems, which in turn can be used to verify (approximate) eigenvectors of an integer matrix . In general our solutions use polylogarithmic rounds of communication and polylogarithmic total communication and verifier space
Communication-Efficient Probabilistic Algorithms: Selection, Sampling, and Checking
Diese Dissertation behandelt drei grundlegende Klassen von Problemen in Big-Data-Systemen, fĂŒr die wir kommunikationseffiziente probabilistische Algorithmen entwickeln. Im ersten Teil betrachten wir verschiedene Selektionsprobleme, im zweiten Teil das Ziehen gewichteter Stichproben (Weighted Sampling) und im dritten Teil die probabilistische KorrektheitsprĂŒfung von Basisoperationen in Big-Data-Frameworks (Checking). Diese Arbeit ist durch einen wachsenden Bedarf an Kommunikationseffizienz motiviert, der daher rĂŒhrt, dass der auf das Netzwerk und seine Nutzung zurĂŒckzufĂŒhrende Anteil sowohl der Anschaffungskosten als auch des Energieverbrauchs von Supercomputern und der Laufzeit verteilter Anwendungen immer weiter wĂ€chst. Ăberraschend wenige kommunikationseffiziente Algorithmen sind fĂŒr grundlegende Big-Data-Probleme bekannt. In dieser Arbeit schlieĂen wir einige dieser LĂŒcken.
ZunĂ€chst betrachten wir verschiedene Selektionsprobleme, beginnend mit der verteilten Version des klassischen Selektionsproblems, d. h. dem Auffinden des Elements von Rang in einer groĂen verteilten Eingabe. Wir zeigen, wie dieses Problem kommunikationseffizient gelöst werden kann, ohne anzunehmen, dass die Elemente der Eingabe zufĂ€llig verteilt seien. Hierzu ersetzen wir die Methode zur Pivotwahl in einem schon lange bekannten Algorithmus und zeigen, dass dies hinreichend ist. AnschlieĂend zeigen wir, dass die Selektion aus lokal sortierten Folgen â multisequence selection â wesentlich schneller lösbar ist, wenn der genaue Rang des Ausgabeelements in einem gewissen Bereich variieren darf. Dies benutzen wir anschlieĂend, um eine verteilte PrioritĂ€tswarteschlange mit Bulk-Operationen zu konstruieren. SpĂ€ter werden wir diese verwenden, um gewichtete Stichproben aus Datenströmen zu ziehen (Reservoir Sampling). SchlieĂlich betrachten wir das Problem, die global hĂ€ufigsten Objekte sowie die, deren zugehörige Werte die gröĂten Summen ergeben, mit einem stichprobenbasierten Ansatz zu identifizieren.
Im Kapitel ĂŒber gewichtete Stichproben werden zunĂ€chst neue Konstruktionsalgorithmen fĂŒr eine klassische Datenstruktur fĂŒr dieses Problem, sogenannte Alias-Tabellen, vorgestellt. Zu Beginn stellen wir den ersten Linearzeit-Konstruktionsalgorithmus fĂŒr diese Datenstruktur vor, der mit konstant viel Zusatzspeicher auskommt. AnschlieĂend parallelisieren wir diesen Algorithmus fĂŒr Shared Memory und erhalten so den ersten parallelen Konstruktionsalgorithmus fĂŒr Aliastabellen. Hiernach zeigen wir, wie das Problem fĂŒr verteilte Systeme mit einem zweistufigen Algorithmus angegangen werden kann. AnschlieĂend stellen wir einen ausgabesensitiven Algorithmus fĂŒr gewichtete Stichproben mit ZurĂŒcklegen vor. Ausgabesensitiv bedeutet, dass die Laufzeit des Algorithmus sich auf die Anzahl der eindeutigen Elemente in der Ausgabe bezieht und nicht auf die GröĂe der Stichprobe. Dieser Algorithmus kann sowohl sequentiell als auch auf Shared-Memory-Maschinen und verteilten Systemen eingesetzt werden und ist der erste derartige Algorithmus in allen drei Kategorien. Wir passen ihn anschlieĂend an das Ziehen gewichteter Stichproben ohne ZurĂŒcklegen an, indem wir ihn mit einem SchĂ€tzer fĂŒr die Anzahl der eindeutigen Elemente in einer Stichprobe mit ZurĂŒcklegen kombinieren. Poisson-Sampling, eine Verallgemeinerung des Bernoulli-Sampling auf gewichtete Elemente, kann auf ganzzahlige Sortierung zurĂŒckgefĂŒhrt werden, und wir zeigen, wie ein bestehender Ansatz parallelisiert werden kann. FĂŒr das Sampling aus Datenströmen passen wir einen sequentiellen Algorithmus an und zeigen, wie er in einem Mini-Batch-Modell unter Verwendung unserer im Selektionskapitel eingefĂŒhrten Bulk-PrioritĂ€tswarteschlange parallelisiert werden kann. Das Kapitel endet mit einer ausfĂŒhrlichen Evaluierung unserer Aliastabellen-Konstruktionsalgorithmen, unseres ausgabesensitiven Algorithmus fĂŒr gewichtete Stichproben mit ZurĂŒcklegen und unseres Algorithmus fĂŒr gewichtetes Reservoir-Sampling.
Um die Korrektheit verteilter Algorithmen probabilistisch zu verifizieren, schlagen wir Checker fĂŒr grundlegende Operationen von Big-Data-Frameworks vor. Wir zeigen, dass die ĂberprĂŒfung zahlreicher Operationen auf zwei âKernâ-Checker reduziert werden kann, nĂ€mlich die PrĂŒfung von Aggregationen und ob eine Folge eine Permutation einer anderen Folge ist. WĂ€hrend mehrere AnsĂ€tze fĂŒr letzteres Problem seit geraumer Zeit bekannt sind und sich auch einfach parallelisieren lassen, ist unser Summenaggregations-Checker eine neuartige Anwendung der gleichen Datenstruktur, die auch zĂ€hlenden Bloom-Filtern und dem Count-Min-Sketch zugrunde liegt. Wir haben beide Checker in Thrill, einem Big-Data-Framework, implementiert. Experimente mit absichtlich herbeigefĂŒhrten Fehlern bestĂ€tigen die von unserer theoretischen Analyse vorhergesagte Erkennungsgenauigkeit. Dies gilt selbst dann, wenn wir hĂ€ufig verwendete schnelle Hash-Funktionen mit in der Theorie suboptimalen Eigenschaften verwenden. Skalierungsexperimente auf einem Supercomputer zeigen, dass unsere Checker nur sehr geringen Laufzeit-Overhead haben, welcher im Bereich von liegt und dabei die Korrektheit des Ergebnisses nahezu garantiert wird