9 research outputs found
Stochastic Streams: Sample Complexity vs. Space Complexity
We address the trade-off between the computational resources needed to process a large data set and the number of samples available from the data set. Specifically, we consider the following abstraction: we receive a potentially infinite stream of IID samples from some unknown distribution D, and are tasked with computing some function f(D). If the stream is observed for time t, how much memory, s, is required to estimate f(D)? We refer to t as the sample complexity and s as the space complexity. The main focus of this paper is investigating the trade-offs between the space and sample complexity. We study these trade-offs for several canonical problems studied in the data stream model: estimating the collision probability, i.e., the second moment of a distribution, deciding if a graph is connected, and approximating the dimension of an unknown subspace. Our results are based on techniques for simulating different classical sampling procedures in this model, emulating random walks given a sequence of IID samples, as well as leveraging a characterization between communication bounded protocols and statistical query algorithms
Space-Efficient Estimation of Statistics Over Sub-Sampled Streams
In many stream monitoring situations, the data arrival rate is so high that it is not even possible to observe each element of the stream. The most common solution is to subsample the data stream and use the sample to infer properties and estimate aggregates of the original stream. However, in many cases, the estimation of aggregates on the original stream cannot be accomplished through simply estimating them on the sampled stream, followed by a normalization. We present algorithms for estimating frequency moments, support size, entropy, and heavy hitters of the original stream, through a single pass over the sampled stream
A Survey on Concept Drift Adaptation
Concept drift primarily refers to an online supervised learning scenario when the relation between the in- put data and the target variable changes over time. Assuming a general knowledge of supervised learning in this paper we characterize adaptive learning process, categorize existing strategies for handling concept drift, discuss the most representative, distinct and popular techniques and algorithms, discuss evaluation methodology of adaptive algorithms, and present a set of illustrative applications. This introduction to the concept drift adaptation presents the state of the art techniques and a collection of benchmarks for re- searchers, industry analysts and practitioners. The survey aims at covering the different facets of concept drift in an integrated way to reflect on the existing scattered state-of-the-art
Recommended from our members
Streaming Algorithms Via Reductions
In the streaming algorithms model of computation we must process data in order and without enough memory to remember the entire input. We study reductions between problems in the streaming model with an eye to using reductions as an algorithm design technique. Our contributions include:
* Linear Transformation reductions, which compose with existing linear sketch techniques. We use these for small-space algorithms for numeric measurements of distance-from-periodicity, finding the period of a numeric stream, and detecting cyclic shifts.
* The first streaming graph algorithms in the sliding window\u27 model, where we must consider only the most recent L elements for some fixed threshold L. We develop basic algorithms for connectivity and unweighted maximum matching, then develop a variety of other algorithms via reductions to these problems.
* A new reduction from maximum weighted matching to maximum unweighted matching. This reduction immediately yields improved approximation guarantees for maximum weighted matching in the semistreaming, sliding window, and MapReduce models, and extends to the more general problem of finding maximum independent sets in p-systems.
* Algorithms in a stream-of-samples model which exhibit clear sample vs. space tradeoffs. These algorithms are also inspired by examining reductions. We provide algorithms for calculating F_k frequency moments and graph connectivity
Parallel paradigms for Data Stream Processing
The aim of this thesis is to address Data Stream Processing issues from the point of view of High Performance Computing. In particular our work focused on the definition of parallel paradigms for DaSP problems.
An implementation of a parallel scheme for the solution of the stream join problem is given along with the tests performed and their analysis
Parallel paradigms for Data Stream Processing
The aim of this thesis is to address Data Stream Processing issues from the point of view of High Performance Computing. In particular our work focused on the definition of parallel paradigms for DaSP problems.
An implementation of a parallel scheme for the solution of the stream join problem is given along with the tests performed and their analysis
Streaming Algorithms for High Throughput Massive Datasets
The field of streaming algorithms has enjoyed a deal of focus from the theoretical computer science community over the last 20 years. Many great algorithms and mathematical results have been developed in this time, allowing for a broad class of functions to be computed and problems to be solved in the streaming model. In the same amount of time, the amount of data being generated by practical computer systems is simply staggering.
In this thesis, we focus on solving problems in the streaming model that have a unified goal of being relevant to practical problems outside of the theory community. In terms of a common technical thread throughout this work, the theme here is an attention to runtime and the ability to handle large datasets that not only challenge in terms of memory available, but also in the throughput of the data and the speed at which the data must be processed.
We provide these solutions in the form of both theoretical algorithm and practical systems, and demonstrate that using practice to drive theory, and vice versa, can generate powerful new approaches for difficult problems in the streaming model
Sketching sampled data streams
Abstract—Sampling is used as a universal method to reduce the running time of computations – the computation is performed on a much smaller sample and then the result is scaled to compensate for the difference in size. Sketches are a popular approximation method for data streams and they proved to be useful for estimating frequency moments and aggregates over joins. A possibility to further improve the time performance of sketches is to compute the sketch over a sample of the stream rather than the entire data stream. In this paper we analyze the behavior of the sketch estimator when computed over a sample of the stream, not the entire data stream, for the size of join and the self-join size problems. Our analysis is developed for a generic sampling process. We instantiate the results of the analysis for all three major types of sampling – Bernoulli sampling which is used for load shedding, sampling with replacement which is used to generate i.i.d. samples from a distribution, and sampling without replacement which is used by online aggregation engines – and compare these particular results with the results of the basic sketch estimator. Our experimental results show that the accuracy of the sketch computed over a small sample of the data is, in general, close to the accuracy of the sketch estimator computed over the entire data even when the sample size is only 10 % or less of the dataset size. This is equivalent to a speed-up factor of at least 10 when updating the sketch. I