9 research outputs found

    Stochastic Streams: Sample Complexity vs. Space Complexity

    Get PDF
    We address the trade-off between the computational resources needed to process a large data set and the number of samples available from the data set. Specifically, we consider the following abstraction: we receive a potentially infinite stream of IID samples from some unknown distribution D, and are tasked with computing some function f(D). If the stream is observed for time t, how much memory, s, is required to estimate f(D)? We refer to t as the sample complexity and s as the space complexity. The main focus of this paper is investigating the trade-offs between the space and sample complexity. We study these trade-offs for several canonical problems studied in the data stream model: estimating the collision probability, i.e., the second moment of a distribution, deciding if a graph is connected, and approximating the dimension of an unknown subspace. Our results are based on techniques for simulating different classical sampling procedures in this model, emulating random walks given a sequence of IID samples, as well as leveraging a characterization between communication bounded protocols and statistical query algorithms

    Space-Efficient Estimation of Statistics Over Sub-Sampled Streams

    Get PDF
    In many stream monitoring situations, the data arrival rate is so high that it is not even possible to observe each element of the stream. The most common solution is to subsample the data stream and use the sample to infer properties and estimate aggregates of the original stream. However, in many cases, the estimation of aggregates on the original stream cannot be accomplished through simply estimating them on the sampled stream, followed by a normalization. We present algorithms for estimating frequency moments, support size, entropy, and heavy hitters of the original stream, through a single pass over the sampled stream

    A Survey on Concept Drift Adaptation

    Get PDF
    Concept drift primarily refers to an online supervised learning scenario when the relation between the in- put data and the target variable changes over time. Assuming a general knowledge of supervised learning in this paper we characterize adaptive learning process, categorize existing strategies for handling concept drift, discuss the most representative, distinct and popular techniques and algorithms, discuss evaluation methodology of adaptive algorithms, and present a set of illustrative applications. This introduction to the concept drift adaptation presents the state of the art techniques and a collection of benchmarks for re- searchers, industry analysts and practitioners. The survey aims at covering the different facets of concept drift in an integrated way to reflect on the existing scattered state-of-the-art

    Parallel paradigms for Data Stream Processing

    Get PDF
    The aim of this thesis is to address Data Stream Processing issues from the point of view of High Performance Computing. In particular our work focused on the definition of parallel paradigms for DaSP problems. An implementation of a parallel scheme for the solution of the stream join problem is given along with the tests performed and their analysis

    Parallel paradigms for Data Stream Processing

    Get PDF
    The aim of this thesis is to address Data Stream Processing issues from the point of view of High Performance Computing. In particular our work focused on the definition of parallel paradigms for DaSP problems. An implementation of a parallel scheme for the solution of the stream join problem is given along with the tests performed and their analysis

    Streaming Algorithms for High Throughput Massive Datasets

    Get PDF
    The field of streaming algorithms has enjoyed a deal of focus from the theoretical computer science community over the last 20 years. Many great algorithms and mathematical results have been developed in this time, allowing for a broad class of functions to be computed and problems to be solved in the streaming model. In the same amount of time, the amount of data being generated by practical computer systems is simply staggering. In this thesis, we focus on solving problems in the streaming model that have a unified goal of being relevant to practical problems outside of the theory community. In terms of a common technical thread throughout this work, the theme here is an attention to runtime and the ability to handle large datasets that not only challenge in terms of memory available, but also in the throughput of the data and the speed at which the data must be processed. We provide these solutions in the form of both theoretical algorithm and practical systems, and demonstrate that using practice to drive theory, and vice versa, can generate powerful new approaches for difficult problems in the streaming model

    Sketching sampled data streams

    No full text
    Abstract—Sampling is used as a universal method to reduce the running time of computations – the computation is performed on a much smaller sample and then the result is scaled to compensate for the difference in size. Sketches are a popular approximation method for data streams and they proved to be useful for estimating frequency moments and aggregates over joins. A possibility to further improve the time performance of sketches is to compute the sketch over a sample of the stream rather than the entire data stream. In this paper we analyze the behavior of the sketch estimator when computed over a sample of the stream, not the entire data stream, for the size of join and the self-join size problems. Our analysis is developed for a generic sampling process. We instantiate the results of the analysis for all three major types of sampling – Bernoulli sampling which is used for load shedding, sampling with replacement which is used to generate i.i.d. samples from a distribution, and sampling without replacement which is used by online aggregation engines – and compare these particular results with the results of the basic sketch estimator. Our experimental results show that the accuracy of the sketch computed over a small sample of the data is, in general, close to the accuracy of the sketch estimator computed over the entire data even when the sample size is only 10 % or less of the dataset size. This is equivalent to a speed-up factor of at least 10 when updating the sketch. I
    corecore