656 research outputs found

    Distinct counting with a self-learning bitmap

    Full text link
    Counting the number of distinct elements (cardinality) in a dataset is a fundamental problem in database management. In recent years, due to many of its modern applications, there has been significant interest to address the distinct counting problem in a data stream setting, where each incoming data can be seen only once and cannot be stored for long periods of time. Many probabilistic approaches based on either sampling or sketching have been proposed in the computer science literature, that only require limited computing and memory resources. However, the performances of these methods are not scale-invariant, in the sense that their relative root mean square estimation errors (RRMSE) depend on the unknown cardinalities. This is not desirable in many applications where cardinalities can be very dynamic or inhomogeneous and many cardinalities need to be estimated. In this paper, we develop a novel approach, called self-learning bitmap (S-bitmap) that is scale-invariant for cardinalities in a specified range. S-bitmap uses a binary vector whose entries are updated from 0 to 1 by an adaptive sampling process for inferring the unknown cardinality, where the sampling rates are reduced sequentially as more and more entries change from 0 to 1. We prove rigorously that the S-bitmap estimate is not only unbiased but scale-invariant. We demonstrate that to achieve a small RRMSE value of ϵ\epsilon or less, our approach requires significantly less memory and consumes similar or less operations than state-of-the-art methods for many common practice cardinality scales. Both simulation and experimental studies are reported.Comment: Journal of the American Statistical Association (accepted

    Efficient algorithms for passive network measurement

    Get PDF
    Network monitoring has become a necessity to aid in the management and operation of large networks. Passive network monitoring consists of extracting metrics (or any information of interest) by analyzing the traffic that traverses one or more network links. Extracting information from a high-speed network link is challenging, given the great data volumes and short packet inter-arrival times. These difficulties can be alleviated by using extremely efficient algorithms or by sampling the incoming traffic. This work improves the state of the art in both these approaches. For one-way packet delay measurement, we propose a series of improvements over a recently appeared technique called Lossy Difference Aggregator. A main limitation of this technique is that it does not provide per-flow measurements. We propose a data structure called Lossy Difference Sketch that is capable of providing such per-flow delay measurements, and, unlike recent related works, does not rely on any model of packet delays. In the problem of collecting measurements under the sliding window model, we focus on the estimation of the number of active flows and in traffic filtering. Using a common approach, we propose one algorithm for each problem that obtains great accuracy with significant resource savings. In the traffic sampling area, the selection of the sampling rate is a crucial aspect. The most sensible approach involves dynamically adjusting sampling rates according to network traffic conditions, which is known as adaptive sampling. We propose an algorithm called Cuckoo Sampling that can operate with a fixed memory budget and perform adaptive flow-wise packet sampling. It is based on a very simple data structure and is computationally extremely lightweight. The techniques presented in this work are thoroughly evaluated through a combination of theoretical and experimental analysis.Postprint (published version

    Efficient Identification of TOP-K Heavy Hitters over Sliding Windows

    Get PDF
    This is the author accepted manuscript. The final version is available from Springer Verlag via the DOI in this recordDue to the increasing volume of network traffic and growing complexity of network environment, rapid identification of heavy hitters is quite challenging. To deal with the massive data streams in real-time, accurate and scalable solution is required. The traditional method to keep an individual counter for each host in the whole data streams is very resource-consuming. This paper presents a new data structure called FCM and its associated algorithms. FCM combines the count-min sketch with the stream-summary structure simultaneously for efficient TOP-K heavy hitter identification in one pass. The key point of this algorithm is that it introduces a novel filter-and-jump mechanism. Given that the Internet traffic has the property of being heavy-tailed and hosts of low frequencies account for the majority of the IP addresses, FCM periodically filters the mice from input streams to efficiently improve the accuracy of TOP-K heavy hitter identification. On the other hand, considering that abnormal events are always time sensitive, our algorithm works by adjusting its measurement window to the newly arrived elements in the data streams automatically. Our experimental results demonstrate that the performance of FCM is superior to the previous related algorithm. Additionally this solution has a good prospect of application in advanced network environment.Chinese Academy of SciencesNational Natural Science Foundation of Chin
    • …
    corecore