18 research outputs found

    Infrequent item mining in multiple data streams

    Full text link
    The problem of extracting infrequent patterns from streams and building associations between these patterns is becoming increasingly relevant today as many events of interest such as attacks in network data or unusual stories in news data occur rarely. The complexity of the problem is compounded when a system is required to deal with data from multiple streams. To address these problems, we present a framework that combines the time based association mining with a pyramidal structure that allows a rolling analysis of the stream and maintains a synopsis of the data without requiring increasing memory resources. We apply the algorithms and show the usefulness of the techniques. © 2007 Crown Copyright

    Building Wavelet Histograms on Large Data in MapReduce

    Full text link
    MapReduce is becoming the de facto framework for storing and processing massive data, due to its excellent scalability, reliability, and elasticity. In many MapReduce applications, obtaining a compact accurate summary of data is essential. Among various data summarization tools, histograms have proven to be particularly important and useful for summarizing data, and the wavelet histogram is one of the most widely used histograms. In this paper, we investigate the problem of building wavelet histograms efficiently on large datasets in MapReduce. We measure the efficiency of the algorithms by both end-to-end running time and communication cost. We demonstrate straightforward adaptations of existing exact and approximate methods for building wavelet histograms to MapReduce clusters are highly inefficient. To that end, we design new algorithms for computing exact and approximate wavelet histograms and discuss their implementation in MapReduce. We illustrate our techniques in Hadoop, and compare to baseline solutions with extensive experiments performed in a heterogeneous Hadoop cluster of 16 nodes, using large real and synthetic datasets, up to hundreds of gigabytes. The results suggest significant (often orders of magnitude) performance improvement achieved by our new algorithms.Comment: VLDB201

    On Frequency Estimation and Detection of Heavy Hitters in Data Streams

    Get PDF
    A stream can be thought of as a very large set of data, sometimes even infinite, which arrives sequentially and must be processed without the possibility of being stored. In fact, the memory available to the algorithm is limited and it is not possible to store the whole stream of data which is instead scanned upon arrival and summarized through a succinct data structure in order to maintain only the information of interest. Two of the main tasks related to data stream processing are frequency estimation and heavy hitter detection. The frequency estimation problem requires estimating the frequency of each item, that is the number of times or the weight with which each appears in the stream, while heavy hitter detection means the detection of all those items with a frequency higher than a fixed threshold. In this work we design and analyze ACMSS, an algorithm for frequency estimation and heavy hitter detection, and compare it against the state of the art ASKETCH algorithm. We show that, given the same budgeted amount of memory, for the task of frequency estimation our algorithm outperforms ASKETCH with regard to accuracy. Furthermore, we show that, under the assumptions stated by its authors, ASKETCH may not be able to report all of the heavy hitters whilst ACMSS will provide with high probability the full list of heavy hitters

    kBF: A Bloom Filter for key-value storage with an application on approximate state machines

    Full text link

    Enabling event-triggered data plane monitoring

    Get PDF
    We propose a push-based approach to network monitoring that allows the detection, within the dataplane, of traffic aggregates. Notifications from the switch to the controller are sent only if required, avoiding the transmission or processing of unnecessary data. Furthermore, the dataplane iteratively refines the responsible IP prefixes, allowing the controller to receive information with a flexible granularity. We implemented our solution, Elastic Trie, in P4 and for two different FPGA devices. We evaluated it with packet traces from an ISP backbone. Our approach can spot changes in the traffic patterns and detect (with 95% of accuracy) either hierarchical heavy hitters with less than 8KB or superspreaders with less than 300KB of memory, respectively. Additionally, it reduces controller-dataplane communication overheads by up to two orders of magnitude with respect to state-of-the-art solutions

    Optimal sampling algorithms for frequency estimation in distributed data

    Full text link

    Identifying Global Icebergs in Distributed Streams

    Get PDF
    International audience—We consider the problem of identifying global iceberg attacks in massive and physically distributed streams. A global iceberg is a distributed denial of service attack, where some elements globally recur many times across the distributed streams, but locally, they do not appear as a deny of service. A natural solution to defend against global iceberg attacks is to rely on multiple routers that locally scan their network traffic, and regularly provide monitoring information to a server in charge of collecting and aggregating all the monitored information. Any relevant solution to this problem must minimise the communication between the routers and the coordinator, and the space required by each node to analyse its stream. We propose a distributed algorithm that tracks global icebergs on the fly with guaranteed error bounds, limited memory and processing requirements. We present a thorough analysis of our algorithm performance. In particular we derive a tight upper bound on the number of bits communicated between the multiple routers and the coordinator in presence of an oblivious adversary. Finally, we present the main results of the experiments we have run on a cluster of single-board computers. Those experiments confirm the efficiency and accuracy of our algorithm to track global icebergs hidden in very large input data streams exhibiting different shapes
    corecore