465 research outputs found

    Heavy Hitters over Interval Queries

    Full text link
    Heavy hitters and frequency measurements are fundamental in many networking applications such as load balancing, QoS, and network security. This paper considers a generalized sliding window model that supports frequency and heavy hitters queries over an interval given at \emph{query time}. This enables drill-down queries, in which the behavior of the network can be examined in finer and finer granularities. For this model, we asymptotically improve the space bounds of existing work, reduce the update and query time to a constant, and provide deterministic solutions. When evaluated over real Internet packet traces, our fastest algorithm processes packets 9090--250250 times faster, serves queries at least 730730 times quicker and consumes at least 40%40\% less space than the known method

    CR-precis: A deterministic summary structure for update data streams

    Full text link
    We present the \crprecis structure, that is a general-purpose, deterministic and sub-linear data structure for summarizing \emph{update} data streams. The \crprecis structure yields the \emph{first deterministic sub-linear space/time algorithms for update streams} for answering a variety of fundamental stream queries, such as, (a) point queries, (b) range queries, (c) finding approximate frequent items, (d) finding approximate quantiles, (e) finding approximate hierarchical heavy hitters, (f) estimating inner-products, (g) near-optimal BB-bucket histograms, etc..Comment: 11 page

    BPTree: an β„“2\ell_2 heavy hitters algorithm using constant memory

    Full text link
    The task of finding heavy hitters is one of the best known and well studied problems in the area of data streams. One is given a list i1,i2,…,im∈[n]i_1,i_2,\ldots,i_m\in[n] and the goal is to identify the items among [n][n] that appear frequently in the list. In sub-polynomial space, the strongest guarantee available is the β„“2\ell_2 guarantee, which requires finding all items that occur at least Ο΅βˆ₯fβˆ₯2\epsilon\|f\|_2 times in the stream, where the vector f∈Rnf\in\mathbb{R}^n is the count histogram of the stream with iith coordinate equal to the number of times~ii appears fi:=#{j∈[m]:ij=i}f_i:=\#\{j\in[m]:i_j=i\}. The first algorithm to achieve the β„“2\ell_2 guarantee was the CountSketch of [CCF04], which requires O(Ο΅βˆ’2log⁑n)O(\epsilon^{-2}\log n) words of memory and O(log⁑n)O(\log n) update time and is known to be space-optimal if the stream allows for deletions. The recent work of [BCIW16] gave an improved algorithm for insertion-only streams, using only O(Ο΅βˆ’2logβ‘Ο΅βˆ’1log⁑log⁑n)O(\epsilon^{-2}\log\epsilon^{-1}\log\log n) words of memory. In this work, we give an algorithm \bptree for β„“2\ell_2 heavy hitters in insertion-only streams that achieves O(Ο΅βˆ’2logβ‘Ο΅βˆ’1)O(\epsilon^{-2}\log\epsilon^{-1}) words of memory and O(logβ‘Ο΅βˆ’1)O(\log\epsilon^{-1}) update time, which is the optimal dependence on nn and mm. In addition, we describe an algorithm for tracking βˆ₯fβˆ₯2\|f\|_2 at all times with O(Ο΅βˆ’2)O(\epsilon^{-2}) memory and update time. Our analyses rely on bounding the expected supremum of a Bernoulli process involving Rademachers with limited independence, which we accomplish via a Dudley-like chaining argument that may have applications elsewhere.Comment: v4: PODS'17 camera-ready version, includes improved space l_2 tracking (by log(1/epsilon) factor); v3: fixed accidental mis-sorting of author last names; v2: added section explaining why pick-and-drop sampling fails for l2 heavy hitters, and fixed minor typo

    Locality-Sensitive Sketching for Resilient Network Flow Monitoring

    Full text link
    Network monitoring is vital in modern clouds and data center networks for traffic engineering, network diagnosis, network intrusion detection, which need diverse traffic statistics ranging from flow size distributions to heavy hitters. To cope with increasing network rates and massive traffic volumes, sketch based approximate measurement has been extensively studied to trade the accuracy for memory and computation cost, which unfortunately, is sensitive to hash collisions. In addition, deploying the sketch involves fine-grained performance control and instrumentation. This paper presents a locality-sensitive sketch (LSS) to be resilient to hash collisions. LSS proactively minimizes the estimation error due to hash collisions with an autoencoder based optimization model, and reduces the estimation variance by keeping similar network flows to the same bucket array. To illustrate the feasibility of the sketch, we develop a disaggregated monitoring application that supports non-intrusive sketching deployment and native network-wide analysis. Testbed shows that the framework adapts to line rates and provides accurate query results. Real-world trace-driven simulations show that LSS remains stable performance under wide ranges of parameters and dramatically outperforms state-of-the-art sketching structures, with over 10310^3 to 10510^5 times reduction in relative errors for per-flow queries as the ratio of the number of buckets to the number of network flows reduces from 10\% to 0.1\%

    Memento: Making Sliding Windows Efficient for Heavy Hitters

    Full text link
    Cloud operators require real-time identification of Heavy Hitters (HH) and Hierarchical Heavy Hitters (HHH) for applications such as load balancing, traffic engineering, and attack mitigation. However, existing techniques are slow in detecting new heavy hitters. In this paper, we make the case for identifying heavy hitters through \textit{sliding windows}. Sliding windows detect heavy hitters quicker and more accurately than current methods, but to date had no practical algorithms. Accordingly, we introduce, design and analyze the \textit{Memento} family of sliding window algorithms for the HH and HHH problems in the single-device and network-wide settings. Using extensive evaluations, we show that our single-device solutions attain similar accuracy and are by up to 273Γ—273\times faster than existing window-based techniques. Furthermore, we exemplify our network-wide HHH detection capabilities on a realistic testbed. To that end, we implemented Memento as an open-source extension to the popular HAProxy cloud load-balancer. In our evaluations, using an HTTP flood by 50 subnets, our network-wide approach detected the new subnets faster, and reduced the number of undetected flood requests by up to 37Γ—37\times compared to the alternatives.Comment: This is an extended version of the paper that will appear in ACM CoNEXT 201

    Hokusai - Sketching Streams in Real Time

    Full text link
    We describe Hokusai, a real time system which is able to capture frequency information for streams of arbitrary sequences of symbols. The algorithm uses the CountMin sketch as its basis and exploits the fact that sketching is linear. It provides real time statistics of arbitrary events, e.g. streams of queries as a function of time. We use a factorizing approximation to provide point estimates at arbitrary (time, item) combinations. Queries can be answered in constant time.Comment: Appears in Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence (UAI2012

    MacroBase: Prioritizing Attention in Fast Data

    Full text link
    As data volumes continue to rise, manual inspection is becoming increasingly untenable. In response, we present MacroBase, a data analytics engine that prioritizes end-user attention in high-volume fast data streams. MacroBase enables efficient, accurate, and modular analyses that highlight and aggregate important and unusual behavior, acting as a search engine for fast data. MacroBase is able to deliver order-of-magnitude speedups over alternatives by optimizing the combination of explanation and classification tasks and by leveraging a new reservoir sampler and heavy-hitters sketch specialized for fast data streams. As a result, MacroBase delivers accurate results at speeds of up to 2M events per second per query on a single core. The system has delivered meaningful results in production, including at a telematics company monitoring hundreds of thousands of vehicles.Comment: SIGMOD 201

    The Online Event-Detection Problem

    Full text link
    Given a stream S=(s1,s2,...,sN)S = (s_1, s_2, ..., s_N), a Ο•\phi-heavy hitter is an item sis_i that occurs at least Ο•N\phi N times in SS. The problem of finding heavy-hitters has been extensively studied in the database literature. In this paper, we study a related problem. We say that there is a Ο•\phi-event at time tt if sts_t occurs exactly Ο•N\phi N times in (s1,s2,...,st)(s_1, s_2, ..., s_t). Thus, for each Ο•\phi-heavy hitter there is a single Ο•\phi-event which occurs when its count reaches the reporting threshold Ο•N\phi N. We define the online event-detection problem (OEDP) as: given Ο•\phi and a stream SS, report all Ο•\phi-events as soon as they occur. Many real-world monitoring systems demand event detection where all events must be reported (no false negatives), in a timely manner, with no non-events reported (no false positives), and a low reporting threshold. As a result, the OEDP requires a large amount of space (Omega(N) words) and is not solvable in the streaming model or via standard sampling-based approaches. Since OEDP requires large space, we focus on cache-efficient algorithms in the external-memory model. We provide algorithms for the OEDP that are within a log factor of optimal. Our algorithms are tunable: its parameters can be set to allow for a bounded false-positives and a bounded delay in reporting. None of our relaxations allow false negatives since reporting all events is a strict requirement of our applications. Finally, we show improved results when the count of items in the input stream follows a power-law distribution

    Beating CountSketch for Heavy Hitters in Insertion Streams

    Full text link
    Given a stream p1,…,pmp_1, \ldots, p_m of items from a universe U\mathcal{U}, which, without loss of generality we identify with the set of integers {1,2,…,n}\{1, 2, \ldots, n\}, we consider the problem of returning all β„“2\ell_2-heavy hitters, i.e., those items jj for which fjβ‰₯Ο΅F2f_j \geq \epsilon \sqrt{F_2}, where fjf_j is the number of occurrences of item jj in the stream, and F2=βˆ‘i∈[n]fi2F_2 = \sum_{i \in [n]} f_i^2. Such a guarantee is considerably stronger than the β„“1\ell_1-guarantee, which finds those jj for which fjβ‰₯Ο΅mf_j \geq \epsilon m. In 2002, Charikar, Chen, and Farach-Colton suggested the {\sf CountSketch} data structure, which finds all such jj using Θ(log⁑2n)\Theta(\log^2 n) bits of space (for constant Ο΅>0\epsilon > 0). The only known lower bound is Ξ©(log⁑n)\Omega(\log n) bits of space, which comes from the need to specify the identities of the items found. In this paper we show it is possible to achieve O(log⁑nlog⁑log⁑n)O(\log n \log \log n) bits of space for this problem. Our techniques, based on Gaussian processes, lead to a number of other new results for data streams, including (1) The first algorithm for estimating F2F_2 simultaneously at all points in a stream using only O(log⁑nlog⁑log⁑n)O(\log n\log\log n) bits of space, improving a natural union bound and the algorithm of Huang, Tai, and Yi (2014). (2) A way to estimate the β„“βˆž\ell_{\infty} norm of a stream up to additive error Ο΅F2\epsilon \sqrt{F_2} with O(log⁑nlog⁑log⁑n)O(\log n\log\log n) bits of space, resolving Open Question 3 from the IITK 2006 list for insertion only streams

    Sub-linear Memory Sketches for Near Neighbor Search on Streaming Data

    Full text link
    We present the first sublinear memory sketch that can be queried to find the nearest neighbors in a dataset. Our online sketching algorithm compresses an N element dataset to a sketch of size O(Nblog⁑3N)O(N^b \log^3 N) in O(N(b+1)log⁑3N)O(N^{(b+1)} \log^3 N) time, where b<1b < 1. This sketch can correctly report the nearest neighbors of any query that satisfies a stability condition parameterized by bb. We achieve sublinear memory performance on stable queries by combining recent advances in locality sensitive hash (LSH)-based estimators, online kernel density estimation, and compressed sensing. Our theoretical results shed new light on the memory-accuracy tradeoff for nearest neighbor search, and our sketch, which consists entirely of short integer arrays, has a variety of attractive features in practice. We evaluate the memory-recall tradeoff of our method on a friend recommendation task in the Google Plus social media network. We obtain orders of magnitude better compression than the random projection based alternative while retaining the ability to report the nearest neighbors of practical queries.Comment: Published in ICML202
    • …
    corecore