8 research outputs found
Randomized Algorithms for Tracking Distributed Count, Frequencies, and Ranks
We show that randomization can lead to significant improvements for a few
fundamental problems in distributed tracking. Our basis is the {\em
count-tracking} problem, where there are players, each holding a counter
that gets incremented over time, and the goal is to track an
\eps-approximation of their sum continuously at all times,
using minimum communication. While the deterministic communication complexity
of the problem is \Theta(k/\eps \cdot \log N), where is the final value
of when the tracking finishes, we show that with randomization, the
communication cost can be reduced to \Theta(\sqrt{k}/\eps \cdot \log N). Our
algorithm is simple and uses only O(1) space at each player, while the lower
bound holds even assuming each player has infinite computing power. Then, we
extend our techniques to two related distributed tracking problems: {\em
frequency-tracking} and {\em rank-tracking}, and obtain similar improvements
over previous deterministic algorithms. Both problems are of central importance
in large data monitoring and analysis, and have been extensively studied in the
literature.Comment: 19 pages, 1 figur
Building Wavelet Histograms on Large Data in MapReduce
MapReduce is becoming the de facto framework for storing and processing
massive data, due to its excellent scalability, reliability, and elasticity. In
many MapReduce applications, obtaining a compact accurate summary of data is
essential. Among various data summarization tools, histograms have proven to be
particularly important and useful for summarizing data, and the wavelet
histogram is one of the most widely used histograms. In this paper, we
investigate the problem of building wavelet histograms efficiently on large
datasets in MapReduce. We measure the efficiency of the algorithms by both
end-to-end running time and communication cost. We demonstrate straightforward
adaptations of existing exact and approximate methods for building wavelet
histograms to MapReduce clusters are highly inefficient. To that end, we design
new algorithms for computing exact and approximate wavelet histograms and
discuss their implementation in MapReduce. We illustrate our techniques in
Hadoop, and compare to baseline solutions with extensive experiments performed
in a heterogeneous Hadoop cluster of 16 nodes, using large real and synthetic
datasets, up to hundreds of gigabytes. The results suggest significant (often
orders of magnitude) performance improvement achieved by our new algorithms.Comment: VLDB201
Doctor of Philosophy
dissertationWe are living in an age where data are being generated faster than anyone has previously imagined across a broad application domain, including customer studies, social media, sensor networks, and the sciences, among many others. In some cases, data are generated in massive quantities as terabytes or petabytes. There have been numerous emerging challenges when dealing with massive data, including: (1) the explosion in size of data; (2) data have increasingly more complex structures and rich semantics, such as representing temporal data as a piecewise linear representation; (3) uncertain data are becoming a common occurrence for numerous applications, e.g., scientific measurements or observations such as meteorological measurements; (4) and data are becoming increasingly distributed, e.g., distributed data collected and integrated from distributed locations as well as data stored in a distributed file system within a cluster. Due to the massive nature of modern data, it is oftentimes infeasible for computers to efficiently manage and query them exactly. An attractive alternative is to use data summarization techniques to construct data summaries, where even efficiently constructing data summaries is a challenging task given the enormous size of data. The data summaries we focus on in this thesis include the histogram and ranking operator. Both data summaries enable us to summarize a massive dataset to a more succinct representation which can then be used to make queries orders of magnitude more efficient while still allowing approximation guarantees on query answers. Our study has focused on the critical task of designing efficient algorithms to summarize, query, and manage massive data
Optimal Sampling Algorithms for Frequency Estimation in Distributed Data
Consider a distributed system with n nodes where each node holds a multiset of items. In this paper, we design sampling algorithms that allow us to estimate the global frequency of any item with a standard deviation of εN, where N denotes the total cardinality of all these multisets. Our algorithms have a communication cost of O(n + √n/ε), which is never worse than the O(n + 1/ε2) cost of uniform sampling, and could be much better when n ≪ 1/ε2. In addition, we prove that one version of our algorithm is instance-optimal in a fairly general sampling framework. We also design algorithms that achieve optimality on the bit level, by combining Bloom filters of various granularities. Finally, we present some simulation results comparing our algorithms with previous techniques. Other than the performance improvement, our algorithms are also much simpler and easily implementable in a large-scale distributed system. © 2011 IEEE