10 research outputs found

    Stream Aggregation Through Order Sampling

    Full text link
    This is paper introduces a new single-pass reservoir weighted-sampling stream aggregation algorithm, Priority-Based Aggregation (PBA). While order sampling is a powerful and e cient method for weighted sampling from a stream of uniquely keyed items, there is no current algorithm that realizes the benefits of order sampling in the context of stream aggregation over non-unique keys. A naive approach to order sample regardless of key then aggregate the results is hopelessly inefficient. In distinction, our proposed algorithm uses a single persistent random variable across the lifetime of each key in the cache, and maintains unbiased estimates of the key aggregates that can be queried at any point in the stream. The basic approach can be supplemented with a Sample and Hold pre-sampling stage with a sampling rate adaptation controlled by PBA. This approach represents a considerable reduction in computational complexity compared with the state of the art in adapting Sample and Hold to operate with a fixed cache size. Concerning statistical properties, we prove that PBA provides unbiased estimates of the true aggregates. We analyze the computational complexity of PBA and its variants, and provide a detailed evaluation of its accuracy on synthetic and trace data. Weighted relative error is reduced by 40% to 65% at sampling rates of 5% to 17%, relative to Adaptive Sample and Hold; there is also substantial improvement for rank queriesComment: 10 page

    Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation

    Full text link
    We introduce and study a new data sketch for processing massive datasets. It addresses two common problems: 1) computing a sum given arbitrary filter conditions and 2) identifying the frequent items or heavy hitters in a data set. For the former, the sketch provides unbiased estimates with state of the art accuracy. It handles the challenging scenario when the data is disaggregated so that computing the per unit metric of interest requires an expensive aggregation. For example, the metric of interest may be total clicks per user while the raw data is a click stream with multiple rows per user. Thus the sketch is suitable for use in a wide range of applications including computing historical click through rates for ad prediction, reporting user metrics from event streams, and measuring network traffic for IP flows. We prove and empirically show the sketch has good properties for both the disaggregated subset sum estimation and frequent item problems. On i.i.d. data, it not only picks out the frequent items but gives strongly consistent estimates for the proportion of each frequent item. The resulting sketch asymptotically draws a probability proportional to size sample that is optimal for estimating sums over the data. For non i.i.d. data, we show that it typically does much better than random sampling for the frequent item problem and never does worse. For subset sum estimation, we show that even for pathological sequences, the variance is close to that of an optimal sampling design. Empirically, despite the disadvantage of operating on disaggregated data, our method matches or bests priority sampling, a state of the art method for pre-aggregated data and performs orders of magnitude better on skewed data compared to uniform sampling. We propose extensions to the sketch that allow it to be used in combining multiple data sets, in distributed systems, and for time decayed aggregation

    Tighter estimation using bottom k sketches

    Full text link

    Load Balance and Resource Efficiency in Communication Networks

    Get PDF
    Network management is critical for today’s network. This study investigates both load balancing and resource efficiency in network management. For load balancing, one unfavorable situation is that the active traffic uses a portion of the equal-cost paths instead of all. The root causes of load imbalance are not easily identified and located by network operators. Most research work related in this area concerns the design of load balancing mechanisms or network-wide troubleshooting that does not specify the causes of load imbalance. In this study, we describe a computational framework based on network measurements to identify the correlation mechanism causing the load imbalance. We also describe a novel framework based on Coprime to mitigate the load imbalance brought by hash correlations. In evaluation based on real network trace data and topologies, we have proved that we can reduces the error (CV or K-S statistic) by at least one magnitude. For resource efficiency, today’s network demands an increasing switch memory to support the essential functions, such as forwarding, monitoring, etc. However, the cache memory is restricted when processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes (typically just one). This study introduces a new single-pass reservoir weighted-sampling stream aggregation algorithm, Priority-Based Aggregation (PBA). A naive approach to order sample regardless of key then aggregate the results is hopelessly inefficient. In distinction, our proposed algorithm uses a single persistent random variable across the lifetime of each key in the cache and maintains unbiased estimates of the key aggregates that can be queried at any point in the stream. Concerning statistical properties, we prove that PBA provides unbiased estimates of the true aggregates. We analyze the computational complexity of PBA and its variants and provide a detailed evaluation of its accuracy on synthetic and trace data. In addition to sampling, this study also considers placing classification rules into switches from various network functions. While much work has focused on compressing the rules, most of this work proposes solutions operating in the memory of a single switch. Instead, this study proposed a collaborative approach encompassing switches and network functions. This architecture enables trade-off between usage of (expensive) switch memory and (cheaper) downstream network bandwidth and network function resources. Our system can reduce memory usage significantly compared to strawman approaches as demonstrated with extensive simulations and prototype evaluation with real traffic traces and rules

    Sketching unaggregated data streams for subpopulation-size queries

    No full text

    ABSTRACT Sketching Unaggregated Data Streams for Subpopulation-Size Queries

    No full text
    IP packet streams consist of multiple interleaving IP flows. Statistical summaries of these streams, collected for different measurement periods, are used for characterization of traffic, billing, anomaly detection, inferring traffic demands, configuring packet filters and routing protocols, and more. While queries are posed over the set of flows, the summarization algorithm is applied to the stream of packets. Aggregation of traffic into flows before summarization requires storage of per-flow counters, which is often infeasible. Therefore, the summary has to be produced over the unaggregated stream. An important aggregate performed over a summary is to approximate the size of a subpopulation of flows that is specified a posteriori. For example, flows belonging to an application such as Web or DNS or flows that originate from a certain Autonomous System. We design efficient streaming algorithms that summarize unaggregated streams and provide corresponding unbiased estimators for subpopulation sizes. Our summaries outperform, in terms of estimates accuracy, those produced by packet sampling deployed by Cisco’s sampled NetFlow, the most widely deployed such system. Performance of our best method, step sample-and-hold is close to that of summaries that can be obtained from pre-aggregated traffic
    corecore