Search CORE

769 research outputs found

SQUAD: Combining Sketching and Sampling Is Better than Either for Per-item Quantile Estimation

Author: Basat Ran Ben
Friedman Roy
Shahout Rana
Publication venue: 'American College of Medical Physics (ACMP)'
Publication date: 06/01/2022
Field of study

Latency quantiles measurements are essential as they often capture the user's utility. For example, if a video connection has high tail latency, the perceived quality will suffer, even if the average and median latencies are low. In this work, we consider the problem of approximating the per-item quantiles. Elements in our stream are (ID, latency) tuples, and we wish to track the latency quantiles for each ID. Existing quantile sketches are designed for a single number stream (e.g., containing just the latency). While one could allocate a separate sketch instance for each ID, this may require an infeasible amount of memory. Instead, we consider tracking the quantiles for the heavy hitters (most frequent items), which are often considered particularly important, without knowing them beforehand. We first present a simple sampling algorithm that serves as a benchmark. Then, we design an algorithm that augments a quantile sketch within each entry of a heavy hitter algorithm, resulting in similar space complexity but with a deterministic error guarantee. Finally, we present SQUAD, a method that combines sampling and sketching while improving the asymptotic space complexity. Intuitively, SQUAD uses a background sampling process to capture the behaviour of the latencies of an item before it is allocated with a sketch, thereby allowing us to use fewer samples and sketches. Our solutions are rigorously analyzed, and we demonstrate the superiority of our approach using extensive simulations

arXiv.org e-Print Archive

UCL Discovery

A Fast Algorithm for Approximate Quantiles in High Speed Data Streams

Author: Qi Zhang
Wei Wang
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2007
Field of study

We present a fast algorithm for computing approx-imate quantiles in high speed data streams with deter-ministic error bounds. For data streams of size N where N is unknown in advance, our algorithm par-titions the stream into sub-streams of exponentially increasing size as they arrive. For each sub-stream which has a xed size, we compute and maintain a multi-level summary structure using a novel algorithm. In order to achieve high speed performance, the algo-rithm uses simple block-wise merge and sample oper-ations. Overall, our algorithms for xed-size streams and arbitrary-size streams have a computational cost of O(N log ( 1 log N)) and an average per-element update cost of O(log log N) if is xed.

CiteSeerX

Crossref

Randomized Algorithms for Tracking Distributed Count, Frequencies, and Ranks

Author: Huang Zengfeng
Yi Ke
Zhang Qin
Publication venue
Publication date: 02/12/2011
Field of study

We show that randomization can lead to significant improvements for a few fundamental problems in distributed tracking. Our basis is the {\em count-tracking} problem, where there are

k

players, each holding a counter

n_i

that gets incremented over time, and the goal is to track an \eps-approximation of their sum

n=\sum_i n_i

continuously at all times, using minimum communication. While the deterministic communication complexity of the problem is \Theta(k/\eps \cdot \log N), where

N

is the final value of

n

when the tracking finishes, we show that with randomization, the communication cost can be reduced to \Theta(\sqrt{k}/\eps \cdot \log N). Our algorithm is simple and uses only O(1) space at each player, while the lower bound holds even assuming each player has infinite computing power. Then, we extend our techniques to two related distributed tracking problems: {\em frequency-tracking} and {\em rank-tracking}, and obtain similar improvements over previous deterministic algorithms. Both problems are of central importance in large data monitoring and analysis, and have been extensively studied in the literature.Comment: 19 pages, 1 figur

arXiv.org e-Print Archive

Hong Kong University of Science and Technology Institutional Repository

Quantiles over data streams : experimental comparisons, new analyses, and further improvements

Author: Cormode Graham
Luo Ge
Wang Lu
Yi Ke
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 08/02/2016
Field of study

A fundamental problem in data management and analysis is to generate descriptions of the distribution of data. It is most common to give such descriptions in terms of the cumulative distribution, which is characterized by the quantiles of the data. The design and engineering of efficient methods to find these quantiles has attracted much study, especially in the case where the data are given incrementally, and we must compute the quantiles in an online, streaming fashion. While such algorithms have proved to be extremely useful in practice, there has been limited formal comparison of the competing methods, and no comprehensive study of their performance. In this paper, we remedy this deficit by providing a taxonomy of different methods and describe efficient implementations. In doing so, we propose new variants that have not been studied before, yet which outperform existing methods. To illustrate this, we provide detailed experimental comparisons demonstrating the trade-offs between space, time, and accuracy for quantile computation

Crossref

Warwick Research Archives Portal Repository

Sequential Quantiles via Hermite Series Density Estimation

Author: Macdonald Iain
Stephanou Michael
Varughese Melvin
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2017
Field of study

Sequential quantile estimation refers to incorporating observations into quantile estimates in an incremental fashion thus furnishing an online estimate of one or more quantiles at any given point in time. Sequential quantile estimation is also known as online quantile estimation. This area is relevant to the analysis of data streams and to the one-pass analysis of massive data sets. Applications include network traffic and latency analysis, real time fraud detection and high frequency trading. We introduce new techniques for online quantile estimation based on Hermite series estimators in the settings of static quantile estimation and dynamic quantile estimation. In the static quantile estimation setting we apply the existing Gauss-Hermite expansion in a novel manner. In particular, we exploit the fact that Gauss-Hermite coefficients can be updated in a sequential manner. To treat dynamic quantile estimation we introduce a novel expansion with an exponentially weighted estimator for the Gauss-Hermite coefficients which we term the Exponentially Weighted Gauss-Hermite (EWGH) expansion. These algorithms go beyond existing sequential quantile estimation algorithms in that they allow arbitrary quantiles (as opposed to pre-specified quantiles) to be estimated at any point in time. In doing so we provide a solution to online distribution function and online quantile function estimation on data streams. In particular we derive an analytical expression for the CDF and prove consistency results for the CDF under certain conditions. In addition we analyse the associated quantile estimator. Simulation studies and tests on real data reveal the Gauss-Hermite based algorithms to be competitive with a leading existing algorithm.Comment: 43 pages, 9 figures. Improved version incorporating referee comments, as appears in Electronic Journal of Statistic

arXiv.org e-Print Archive

Crossref

Optimal Tracking of Distributed Heavy Hitters and Quantiles

Author: Yi Ke
Zhang Qin
Publication venue
Publication date: 30/11/2008
Field of study

We consider the the problem of tracking heavy hitters and quantiles in the distributed streaming model. The heavy hitters and quantiles are two important statistics for characterizing a data distribution. Let

A

be a multiset of elements, drawn from the universe

U=\{1,...,u\}

. For a given

0 \le \phi \le 1

, the

\phi

-heavy hitters are those elements of

A

whose frequency in

A

is at least

\phi |A|

; the

\phi

-quantile of

A

is an element

x

U

such that at most

\phi|A|

elements of

A

are smaller than

A

and at most

(1-\phi)|A|

elements of

A

are greater than

x

. Suppose the elements of

A

are received at

k

remote {\em sites} over time, and each of the sites has a two-way communication channel to a designated {\em coordinator}, whose goal is to track the set of

\phi

-heavy hitters and the

\phi

-quantile of

A

approximately at all times with minimum communication. We give tracking algorithms with worst-case communication cost O(k/\eps \cdot \log n) for both problems, where

n

is the total number of items in

A

, and \eps is the approximation error. This substantially improves upon the previous known algorithms. We also give matching lower bounds on the communication costs for both problems, showing that our algorithms are optimal. We also consider a more general version of the problem where we simultaneously track the

\phi

-quantiles for all

0 \le \phi \le 1

.Comment: 10 pages, 1 figur

arXiv.org e-Print Archive

CiteSeerX

Hong Kong University of Science and Technology Institutional Repository