465 research outputs found
Heavy Hitters over Interval Queries
Heavy hitters and frequency measurements are fundamental in many networking
applications such as load balancing, QoS, and network security. This paper
considers a generalized sliding window model that supports frequency and heavy
hitters queries over an interval given at \emph{query time}. This enables
drill-down queries, in which the behavior of the network can be examined in
finer and finer granularities. For this model, we asymptotically improve the
space bounds of existing work, reduce the update and query time to a constant,
and provide deterministic solutions. When evaluated over real Internet packet
traces, our fastest algorithm processes packets -- times faster,
serves queries at least times quicker and consumes at least less
space than the known method
CR-precis: A deterministic summary structure for update data streams
We present the \crprecis structure, that is a general-purpose, deterministic
and sub-linear data structure for summarizing \emph{update} data streams. The
\crprecis structure yields the \emph{first deterministic sub-linear space/time
algorithms for update streams} for answering a variety of fundamental stream
queries, such as, (a) point queries, (b) range queries, (c) finding approximate
frequent items, (d) finding approximate quantiles, (e) finding approximate
hierarchical heavy hitters, (f) estimating inner-products, (g) near-optimal
-bucket histograms, etc..Comment: 11 page
BPTree: an heavy hitters algorithm using constant memory
The task of finding heavy hitters is one of the best known and well studied
problems in the area of data streams. One is given a list
and the goal is to identify the items among
that appear frequently in the list. In sub-polynomial space, the strongest
guarantee available is the guarantee, which requires finding all items
that occur at least times in the stream, where the vector
is the count histogram of the stream with th coordinate
equal to the number of times~ appears . The first
algorithm to achieve the guarantee was the CountSketch of [CCF04],
which requires words of memory and update
time and is known to be space-optimal if the stream allows for deletions. The
recent work of [BCIW16] gave an improved algorithm for insertion-only streams,
using only words of memory. In
this work, we give an algorithm \bptree for heavy hitters in
insertion-only streams that achieves words
of memory and update time, which is the optimal
dependence on and . In addition, we describe an algorithm for tracking
at all times with memory and update time. Our
analyses rely on bounding the expected supremum of a Bernoulli process
involving Rademachers with limited independence, which we accomplish via a
Dudley-like chaining argument that may have applications elsewhere.Comment: v4: PODS'17 camera-ready version, includes improved space l_2
tracking (by log(1/epsilon) factor); v3: fixed accidental mis-sorting of
author last names; v2: added section explaining why pick-and-drop sampling
fails for l2 heavy hitters, and fixed minor typo
Locality-Sensitive Sketching for Resilient Network Flow Monitoring
Network monitoring is vital in modern clouds and data center networks for
traffic engineering, network diagnosis, network intrusion detection, which need
diverse traffic statistics ranging from flow size distributions to heavy
hitters. To cope with increasing network rates and massive traffic volumes,
sketch based approximate measurement has been extensively studied to trade the
accuracy for memory and computation cost, which unfortunately, is sensitive to
hash collisions. In addition, deploying the sketch involves fine-grained
performance control and instrumentation.
This paper presents a locality-sensitive sketch (LSS) to be resilient to hash
collisions. LSS proactively minimizes the estimation error due to hash
collisions with an autoencoder based optimization model, and reduces the
estimation variance by keeping similar network flows to the same bucket array.
To illustrate the feasibility of the sketch, we develop a disaggregated
monitoring application that supports non-intrusive sketching deployment and
native network-wide analysis. Testbed shows that the framework adapts to line
rates and provides accurate query results. Real-world trace-driven simulations
show that LSS remains stable performance under wide ranges of parameters and
dramatically outperforms state-of-the-art sketching structures, with over
to times reduction in relative errors for per-flow queries as the
ratio of the number of buckets to the number of network flows reduces from 10\%
to 0.1\%
Memento: Making Sliding Windows Efficient for Heavy Hitters
Cloud operators require real-time identification of Heavy Hitters (HH) and
Hierarchical Heavy Hitters (HHH) for applications such as load balancing,
traffic engineering, and attack mitigation. However, existing techniques are
slow in detecting new heavy hitters.
In this paper, we make the case for identifying heavy hitters through
\textit{sliding windows}. Sliding windows detect heavy hitters quicker and more
accurately than current methods, but to date had no practical algorithms.
Accordingly, we introduce, design and analyze the \textit{Memento} family of
sliding window algorithms for the HH and HHH problems in the single-device and
network-wide settings. Using extensive evaluations, we show that our
single-device solutions attain similar accuracy and are by up to
faster than existing window-based techniques. Furthermore, we exemplify our
network-wide HHH detection capabilities on a realistic testbed. To that end, we
implemented Memento as an open-source extension to the popular HAProxy cloud
load-balancer. In our evaluations, using an HTTP flood by 50 subnets, our
network-wide approach detected the new subnets faster, and reduced the number
of undetected flood requests by up to compared to the alternatives.Comment: This is an extended version of the paper that will appear in ACM
CoNEXT 201
Hokusai - Sketching Streams in Real Time
We describe Hokusai, a real time system which is able to capture frequency
information for streams of arbitrary sequences of symbols. The algorithm uses
the CountMin sketch as its basis and exploits the fact that sketching is
linear. It provides real time statistics of arbitrary events, e.g. streams of
queries as a function of time. We use a factorizing approximation to provide
point estimates at arbitrary (time, item) combinations. Queries can be answered
in constant time.Comment: Appears in Proceedings of the Twenty-Eighth Conference on Uncertainty
in Artificial Intelligence (UAI2012
MacroBase: Prioritizing Attention in Fast Data
As data volumes continue to rise, manual inspection is becoming increasingly
untenable. In response, we present MacroBase, a data analytics engine that
prioritizes end-user attention in high-volume fast data streams. MacroBase
enables efficient, accurate, and modular analyses that highlight and aggregate
important and unusual behavior, acting as a search engine for fast data.
MacroBase is able to deliver order-of-magnitude speedups over alternatives by
optimizing the combination of explanation and classification tasks and by
leveraging a new reservoir sampler and heavy-hitters sketch specialized for
fast data streams. As a result, MacroBase delivers accurate results at speeds
of up to 2M events per second per query on a single core. The system has
delivered meaningful results in production, including at a telematics company
monitoring hundreds of thousands of vehicles.Comment: SIGMOD 201
The Online Event-Detection Problem
Given a stream , a -heavy hitter is an item
that occurs at least times in . The problem of finding
heavy-hitters has been extensively studied in the database literature. In this
paper, we study a related problem. We say that there is a -event at time
if occurs exactly times in . Thus, for
each -heavy hitter there is a single -event which occurs when its
count reaches the reporting threshold . We define the online
event-detection problem (OEDP) as: given and a stream , report all
-events as soon as they occur.
Many real-world monitoring systems demand event detection where all events
must be reported (no false negatives), in a timely manner, with no non-events
reported (no false positives), and a low reporting threshold. As a result, the
OEDP requires a large amount of space (Omega(N) words) and is not solvable in
the streaming model or via standard sampling-based approaches.
Since OEDP requires large space, we focus on cache-efficient algorithms in
the external-memory model.
We provide algorithms for the OEDP that are within a log factor of optimal.
Our algorithms are tunable: its parameters can be set to allow for a bounded
false-positives and a bounded delay in reporting. None of our relaxations allow
false negatives since reporting all events is a strict requirement of our
applications. Finally, we show improved results when the count of items in the
input stream follows a power-law distribution
Beating CountSketch for Heavy Hitters in Insertion Streams
Given a stream of items from a universe ,
which, without loss of generality we identify with the set of integers , we consider the problem of returning all -heavy hitters,
i.e., those items for which , where is
the number of occurrences of item in the stream, and . Such a guarantee is considerably stronger than the
-guarantee, which finds those for which . In
2002, Charikar, Chen, and Farach-Colton suggested the {\sf CountSketch} data
structure, which finds all such using bits of space (for
constant ). The only known lower bound is bits
of space, which comes from the need to specify the identities of the items
found. In this paper we show it is possible to achieve
bits of space for this problem. Our techniques, based on Gaussian processes,
lead to a number of other new results for data streams, including
(1) The first algorithm for estimating simultaneously at all points in
a stream using only bits of space, improving a natural
union bound and the algorithm of Huang, Tai, and Yi (2014).
(2) A way to estimate the norm of a stream up to additive
error with bits of space, resolving
Open Question 3 from the IITK 2006 list for insertion only streams
Sub-linear Memory Sketches for Near Neighbor Search on Streaming Data
We present the first sublinear memory sketch that can be queried to find the
nearest neighbors in a dataset. Our online sketching algorithm compresses an N
element dataset to a sketch of size in time, where . This sketch can correctly report the nearest neighbors
of any query that satisfies a stability condition parameterized by . We
achieve sublinear memory performance on stable queries by combining recent
advances in locality sensitive hash (LSH)-based estimators, online kernel
density estimation, and compressed sensing. Our theoretical results shed new
light on the memory-accuracy tradeoff for nearest neighbor search, and our
sketch, which consists entirely of short integer arrays, has a variety of
attractive features in practice. We evaluate the memory-recall tradeoff of our
method on a friend recommendation task in the Google Plus social media network.
We obtain orders of magnitude better compression than the random projection
based alternative while retaining the ability to report the nearest neighbors
of practical queries.Comment: Published in ICML202
- β¦