Search CORE

465 research outputs found

Heavy Hitters over Interval Queries

Author: Basat Ran Ben
Friedman Roy
Shahout Rana
Publication venue
Publication date: 13/11/2018
Field of study

Heavy hitters and frequency measurements are fundamental in many networking applications such as load balancing, QoS, and network security. This paper considers a generalized sliding window model that supports frequency and heavy hitters queries over an interval given at \emph{query time}. This enables drill-down queries, in which the behavior of the network can be examined in finer and finer granularities. For this model, we asymptotically improve the space bounds of existing work, reduce the update and query time to a constant, and provide deterministic solutions. When evaluated over real Internet packet traces, our fastest algorithm processes packets

90

250

times faster, serves queries at least

730

times quicker and consumes at least

40\%

less space than the known method

arXiv.org e-Print Archive

CR-precis: A deterministic summary structure for update data streams

Author: Ganguly Sumit
Majumder Anirban
Publication venue
Publication date: 10/10/2006
Field of study

We present the \crprecis structure, that is a general-purpose, deterministic and sub-linear data structure for summarizing \emph{update} data streams. The \crprecis structure yields the \emph{first deterministic sub-linear space/time algorithms for update streams} for answering a variety of fundamental stream queries, such as, (a) point queries, (b) range queries, (c) finding approximate frequent items, (d) finding approximate quantiles, (e) finding approximate hierarchical heavy hitters, (f) estimating inner-products, (g) near-optimal

B

-bucket histograms, etc..Comment: 11 page

arXiv.org e-Print Archive

BPTree: an $\ell_2$ heavy hitters algorithm using constant memory

Author: Braverman Vladimir
Chestnut Stephen R.
Ivkin Nikita
Nelson Jelani
Wang Zhengyu
Woodruff David P.
Publication venue
Publication date: 09/11/2017
Field of study

The task of finding heavy hitters is one of the best known and well studied problems in the area of data streams. One is given a list

i_1,i_2,\ldots,i_m\in[n]

and the goal is to identify the items among

[n]

that appear frequently in the list. In sub-polynomial space, the strongest guarantee available is the

\ell_2

guarantee, which requires finding all items that occur at least

\epsilon\|f\|_2

times in the stream, where the vector

f\in\mathbb{R}^n

is the count histogram of the stream with

i

th coordinate equal to the number of times~

i

appears

f_i:=\#\{j\in[m]:i_j=i\}

. The first algorithm to achieve the

\ell_2

guarantee was the CountSketch of [CCF04], which requires

O(\epsilon^{-2}\log n)

words of memory and

O(\log n)

update time and is known to be space-optimal if the stream allows for deletions. The recent work of [BCIW16] gave an improved algorithm for insertion-only streams, using only

O(\epsilon^{-2}\log\epsilon^{-1}\log\log n)

words of memory. In this work, we give an algorithm \bptree for

\ell_2

heavy hitters in insertion-only streams that achieves

O(\epsilon^{-2}\log\epsilon^{-1})

words of memory and

O(\log\epsilon^{-1})

update time, which is the optimal dependence on

n

and

m

. In addition, we describe an algorithm for tracking

\|f\|_2

at all times with

O(\epsilon^{-2})

memory and update time. Our analyses rely on bounding the expected supremum of a Bernoulli process involving Rademachers with limited independence, which we accomplish via a Dudley-like chaining argument that may have applications elsewhere.Comment: v4: PODS'17 camera-ready version, includes improved space l_2 tracking (by log(1/epsilon) factor); v3: fixed accidental mis-sorting of author last names; v2: added section explaining why pick-and-drop sampling fails for l2 heavy hitters, and fixed minor typo

arXiv.org e-Print Archive

Locality-Sensitive Sketching for Resilient Network Flow Monitoring

Author: Chen Kai
Fu Yongquan
Li Dongsheng
Shen Siqi
Zhang Yiming
Publication venue
Publication date: 08/05/2019
Field of study

Network monitoring is vital in modern clouds and data center networks for traffic engineering, network diagnosis, network intrusion detection, which need diverse traffic statistics ranging from flow size distributions to heavy hitters. To cope with increasing network rates and massive traffic volumes, sketch based approximate measurement has been extensively studied to trade the accuracy for memory and computation cost, which unfortunately, is sensitive to hash collisions. In addition, deploying the sketch involves fine-grained performance control and instrumentation. This paper presents a locality-sensitive sketch (LSS) to be resilient to hash collisions. LSS proactively minimizes the estimation error due to hash collisions with an autoencoder based optimization model, and reduces the estimation variance by keeping similar network flows to the same bucket array. To illustrate the feasibility of the sketch, we develop a disaggregated monitoring application that supports non-intrusive sketching deployment and native network-wide analysis. Testbed shows that the framework adapts to line rates and provides accurate query results. Real-world trace-driven simulations show that LSS remains stable performance under wide ranges of parameters and dramatically outperforms state-of-the-art sketching structures, with over

10^3

10^5

times reduction in relative errors for per-flow queries as the ratio of the number of buckets to the number of network flows reduces from 10\% to 0.1\%

arXiv.org e-Print Archive

Memento: Making Sliding Windows Efficient for Heavy Hitters

Author: Basat Ran Ben
Einziger Gil
Keslassy Isaac
Orda Ariel
Vargaftik Shay
Waisbard Erez
Publication venue
Publication date: 24/10/2018
Field of study

Cloud operators require real-time identification of Heavy Hitters (HH) and Hierarchical Heavy Hitters (HHH) for applications such as load balancing, traffic engineering, and attack mitigation. However, existing techniques are slow in detecting new heavy hitters. In this paper, we make the case for identifying heavy hitters through \textit{sliding windows}. Sliding windows detect heavy hitters quicker and more accurately than current methods, but to date had no practical algorithms. Accordingly, we introduce, design and analyze the \textit{Memento} family of sliding window algorithms for the HH and HHH problems in the single-device and network-wide settings. Using extensive evaluations, we show that our single-device solutions attain similar accuracy and are by up to

273\times

faster than existing window-based techniques. Furthermore, we exemplify our network-wide HHH detection capabilities on a realistic testbed. To that end, we implemented Memento as an open-source extension to the popular HAProxy cloud load-balancer. In our evaluations, using an HTTP flood by 50 subnets, our network-wide approach detected the new subnets faster, and reduced the number of undetected flood requests by up to

37\times

compared to the alternatives.Comment: This is an extended version of the paper that will appear in ACM CoNEXT 201

arXiv.org e-Print Archive

Hokusai - Sketching Streams in Real Time

Author: Ahmed Amr
Matusevych Sergiy
Smola Alex
Publication venue
Publication date: 16/10/2012
Field of study

We describe Hokusai, a real time system which is able to capture frequency information for streams of arbitrary sequences of symbols. The algorithm uses the CountMin sketch as its basis and exploits the fact that sketching is linear. It provides real time statistics of arbitrary events, e.g. streams of queries as a function of time. We use a factorizing approximation to provide point estimates at arbitrary (time, item) combinations. Queries can be answered in constant time.Comment: Appears in Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence (UAI2012

arXiv.org e-Print Archive

MacroBase: Prioritizing Attention in Fast Data

Author: Bailis Peter
Gan Edward
Madden Samuel
Narayanan Deepak
Rong Kexin
Suri Sahaana
Publication venue
Publication date: 24/03/2017
Field of study

As data volumes continue to rise, manual inspection is becoming increasingly untenable. In response, we present MacroBase, a data analytics engine that prioritizes end-user attention in high-volume fast data streams. MacroBase enables efficient, accurate, and modular analyses that highlight and aggregate important and unusual behavior, acting as a search engine for fast data. MacroBase is able to deliver order-of-magnitude speedups over alternatives by optimizing the combination of explanation and classification tasks and by leveraging a new reservoir sampler and heavy-hitters sketch specialized for fast data streams. As a result, MacroBase delivers accurate results at speeds of up to 2M events per second per query on a single core. The system has delivered meaningful results in production, including at a telematics company monitoring hundreds of thousands of vehicles.Comment: SIGMOD 201

arXiv.org e-Print Archive

The Online Event-Detection Problem

Author: Bender Michael A.
Berry Jonathan W.
Farach-Colton Martin
Johnson Rob
Kroeger Thomas M.
Pandey Prashant
Phillips Cynthia A.
Singh Shikha
Publication venue
Publication date: 23/12/2018
Field of study

Given a stream

S = (s_1, s_2, ..., s_N)

, a

\phi

-heavy hitter is an item

s_i

that occurs at least

\phi N

times in

S

. The problem of finding heavy-hitters has been extensively studied in the database literature. In this paper, we study a related problem. We say that there is a

\phi

-event at time

t

s_t

occurs exactly

\phi N

times in

(s_1, s_2, ..., s_t)

. Thus, for each

\phi

-heavy hitter there is a single

\phi

-event which occurs when its count reaches the reporting threshold

\phi N

. We define the online event-detection problem (OEDP) as: given

\phi

and a stream

S

, report all

\phi

-events as soon as they occur. Many real-world monitoring systems demand event detection where all events must be reported (no false negatives), in a timely manner, with no non-events reported (no false positives), and a low reporting threshold. As a result, the OEDP requires a large amount of space (Omega(N) words) and is not solvable in the streaming model or via standard sampling-based approaches. Since OEDP requires large space, we focus on cache-efficient algorithms in the external-memory model. We provide algorithms for the OEDP that are within a log factor of optimal. Our algorithms are tunable: its parameters can be set to allow for a bounded false-positives and a bounded delay in reporting. None of our relaxations allow false negatives since reporting all events is a strict requirement of our applications. Finally, we show improved results when the count of items in the input stream follows a power-law distribution

arXiv.org e-Print Archive

Beating CountSketch for Heavy Hitters in Insertion Streams

Author: Braverman Vladimir
Chestnut Stephen R.
Ivkin Nikita
Woodruff David P.
Publication venue
Publication date: 02/11/2015
Field of study

Given a stream

p_1, \ldots, p_m

of items from a universe

\mathcal{U}

, which, without loss of generality we identify with the set of integers

\{1, 2, \ldots, n\}

, we consider the problem of returning all

\ell_2

-heavy hitters, i.e., those items

j

for which

f_j \geq \epsilon \sqrt{F_2}

, where

f_j

is the number of occurrences of item

j

in the stream, and

F_2 = \sum_{i \in [n]} f_i^2

. Such a guarantee is considerably stronger than the

\ell_1

-guarantee, which finds those

j

for which

f_j \geq \epsilon m

. In 2002, Charikar, Chen, and Farach-Colton suggested the {\sf CountSketch} data structure, which finds all such

j

using

\Theta(\log^2 n)

bits of space (for constant

\epsilon > 0

). The only known lower bound is

\Omega(\log n)

bits of space, which comes from the need to specify the identities of the items found. In this paper we show it is possible to achieve

O(\log n \log \log n)

bits of space for this problem. Our techniques, based on Gaussian processes, lead to a number of other new results for data streams, including (1) The first algorithm for estimating

F_2

simultaneously at all points in a stream using only

O(\log n\log\log n)

bits of space, improving a natural union bound and the algorithm of Huang, Tai, and Yi (2014). (2) A way to estimate the

\ell_{\infty}

norm of a stream up to additive error

\epsilon \sqrt{F_2}

with

O(\log n\log\log n)

bits of space, resolving Open Question 3 from the IITK 2006 list for insertion only streams

arXiv.org e-Print Archive

Sub-linear Memory Sketches for Near Neighbor Search on Streaming Data

Author: Baraniuk Richard G.
Coleman Benjamin
Shrivastava Anshumali
Publication venue
Publication date: 14/09/2020
Field of study

We present the first sublinear memory sketch that can be queried to find the nearest neighbors in a dataset. Our online sketching algorithm compresses an N element dataset to a sketch of size

O(N^b \log^3 N)

O(N^{(b+1)} \log^3 N)

time, where

b < 1

. This sketch can correctly report the nearest neighbors of any query that satisfies a stability condition parameterized by

b

. We achieve sublinear memory performance on stable queries by combining recent advances in locality sensitive hash (LSH)-based estimators, online kernel density estimation, and compressed sensing. Our theoretical results shed new light on the memory-accuracy tradeoff for nearest neighbor search, and our sketch, which consists entirely of short integer arrays, has a variety of attractive features in practice. We evaluate the memory-recall tradeoff of our method on a friend recommendation task in the Google Plus social media network. We obtain orders of magnitude better compression than the random projection based alternative while retaining the ability to report the nearest neighbors of practical queries.Comment: Published in ICML202

arXiv.org e-Print Archive