Taming Big Data By Streaming

Abstract

Data streams have emerged as a natural computational model for numerous applications of big data processing. In this model, algorithms are assumed to have access to a limited amount of memory and can only make a single pass (or a few passes) over the data, but need to produce sufficiently accurate answers for some objective functions on the dataset. This model captures various real-world applications and stimulates new scalable tools for solving important problems in the big data era. This dissertation focuses on the following two aspects of the streaming model. 1. Understanding the capability of the streaming model. For a vector aggregation stream, i.e., when the stream is a sequence of updates to an underlying nn-dimensional vector vv (for very large nn), we establish nearly tight space bounds on streaming algorithms of approximating functions of the form i=1ng(vi)\sum_{i=1}^n g(v_i) for nearly all functions gg of one-variable and l(v)l(v) for all symmetric norms ll. These results provide a deeper understanding of the streaming computation model. 2. Tighter upper bounds. We provide better streaming kk-median clustering algorithms in a dynamic points stream, i.e., a stream of insertion and deletion of points on a discrete Euclidean space ([Δ]d[\Delta]^d for sufficiently large Δ\Delta and dd). Our algorithms use k\cdot\poly(d \log \Delta) space/update time and maintain with high probability an approximate kk-median solution to the streaming dataset. All previous algorithms for computing an approximation for the kk-median problem over dynamic data streams required space and update time exponential in dd

    Similar works