519 research outputs found
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window
The past decade has witnessed many interesting algorithms for maintaining
statistics over a data stream. This paper initiates a theoretical study of
algorithms for monitoring distributed data streams over a time-based sliding
window (which contains a variable number of items and possibly out-of-order
items). The concern is how to minimize the communication between individual
streams and the root, while allowing the root, at any time, to be able to
report the global statistics of all streams within a given error bound. This
paper presents communication-efficient algorithms for three classical
statistics, namely, basic counting, frequent items and quantiles. The
worst-case communication cost over a window is bits for basic counting and words for the remainings, where is the number of distributed
data streams, is the total number of items in the streams that arrive or
expire in the window, and is the desired error bound. Matching
and nearly matching lower bounds are also obtained.Comment: 12 pages, to appear in the 27th International Symposium on
Theoretical Aspects of Computer Science (STACS), 201
Time-decaying Sketches for Robust Aggregation of Sensor Data
We present a new sketch for summarizing network data. The sketch has the following properties which make it useful in communication-efficient aggregation in distributed streaming scenarios, such as sensor networks: the sketch is duplicate insensitive, i.e., reinsertions of the same data will not affect the sketch and hence the estimates of aggregates. Unlike previous duplicate-insensitive sketches for sensor data aggregation [S. Nath et al., Synposis diffusion for robust aggregation in sensor networks, in Proceedings of the 2nd International Conference on Embedded Network Sensor Systems, (2004), pp. 250–262], [J. Considine et al., Approximate aggregation techniques for sensor databases, in Proceedings of the 20th International Conference on Data Engineering (ICDE), 2004, pp. 449–460], it is also time decaying, so that the weight of a data item in the sketch can decrease with time according to a user-specified decay function. The sketch can give provably approximate guarantees for various aggregates of data, including the sum, median, quantiles, and frequent elements. The size of the sketch and the time taken to update it are both polylogarithmic in the size of the relevant data. Further, multiple sketches computed over distributed data can be combined without loss of accuracy. To our knowledge, this is the first sketch that combines all the above properties
An evaluation of streaming algorithms for distinct counting over a sliding window
Counting the number of distinct elements in a data stream (distinct counting) is a fundamental aggregation task in database query processing, query optimization, and network monitoring. On a stream of elements, it is commonly needed to compute an aggregate over only the most recent elements, leading to the problem of distinct counting over a “sliding window” of the stream. We present a detailed experimental study of the performance of different algorithms for distinct counting over a sliding window. We observe that the performance of an algorithm depends on the basic method used, as well as aspects such as the hash function, the mix of query and updates, and the method used to boost accuracy. We compare the performance of prominent algorithms and evaluate the influence of these factors, leading to practical recommendations for implementation. To the best of our knowledge, this is the first detailed experimental study of distinct counting over a sliding window
Near Optimal Linear Algebra in the Online and Sliding Window Models
We initiate the study of numerical linear algebra in the sliding window
model, where only the most recent updates in a stream form the underlying
data set. We first introduce a unified row-sampling based framework that gives
randomized algorithms for spectral approximation, low-rank
approximation/projection-cost preservation, and -subspace embeddings in
the sliding window model, which often use nearly optimal space and achieve
nearly input sparsity runtime. Our algorithms are based on "reverse online"
versions of offline sampling distributions such as (ridge) leverage scores,
sensitivities, and Lewis weights to quantify both the importance and
the recency of a row. Our row-sampling framework rather surprisingly implies
connections to the well-studied online model; our structural results also give
the first sample optimal (up to lower order terms) online algorithm for
low-rank approximation/projection-cost preservation. Using this powerful
primitive, we give online algorithms for column/row subset selection and
principal component analysis that resolves the main open question of Bhaskara
et. al.,(FOCS 2019). We also give the first online algorithm for
-subspace embeddings. We further formalize the connection between the
online model and the sliding window model by introducing an additional unified
framework for deterministic algorithms using a merge and reduce paradigm and
the concept of online coresets. Our sampling based algorithms in the
row-arrival online model yield online coresets, giving deterministic algorithms
for spectral approximation, low-rank approximation/projection-cost
preservation, and -subspace embeddings in the sliding window model that
use nearly optimal space
Approximate order-k Voronoi cells over positional streams
Handling streams of positional updates from numerous moving objects has become a challenging task for many monitoring applications. Several algorithms have been recently proposed for providing exact answers particularly to continuous range and k-nearest neighbor queries against current object positions. In this work, we introduce a processing technique for efficiently maintaining an approximate order-k Voronoi cell around a certain point of interest when all objects continuously change their locations. This heuristic can easily provide a fairly reliable estimate of the k-nearest neighbors for any query point found inside the constructed cell. We further extend our method to handle positional updates that are not received concurrently for all objects, but instead remain valid for a specific time interval according to a sliding window model. Extensive experimental analysis over synthetic datasets confirms the robustness and scalability of this approach offering near real-time cell maintenance with acceptable error margins
- …