423,905 research outputs found

    Stream Aggregation Through Order Sampling

    Full text link
    This is paper introduces a new single-pass reservoir weighted-sampling stream aggregation algorithm, Priority-Based Aggregation (PBA). While order sampling is a powerful and e cient method for weighted sampling from a stream of uniquely keyed items, there is no current algorithm that realizes the benefits of order sampling in the context of stream aggregation over non-unique keys. A naive approach to order sample regardless of key then aggregate the results is hopelessly inefficient. In distinction, our proposed algorithm uses a single persistent random variable across the lifetime of each key in the cache, and maintains unbiased estimates of the key aggregates that can be queried at any point in the stream. The basic approach can be supplemented with a Sample and Hold pre-sampling stage with a sampling rate adaptation controlled by PBA. This approach represents a considerable reduction in computational complexity compared with the state of the art in adapting Sample and Hold to operate with a fixed cache size. Concerning statistical properties, we prove that PBA provides unbiased estimates of the true aggregates. We analyze the computational complexity of PBA and its variants, and provide a detailed evaluation of its accuracy on synthetic and trace data. Weighted relative error is reduced by 40% to 65% at sampling rates of 5% to 17%, relative to Adaptive Sample and Hold; there is also substantial improvement for rank queriesComment: 10 page

    Stream Sampling for Frequency Cap Statistics

    Full text link
    Unaggregated data, in streamed or distributed form, is prevalent and come from diverse application domains which include interactions of users with web services and IP traffic. Data elements have {\em keys} (cookies, users, queries) and elements with different keys interleave. Analytics on such data typically utilizes statistics stated in terms of the frequencies of keys. The two most common statistics are {\em distinct}, which is the number of active keys in a specified segment, and {\em sum}, which is the sum of the frequencies of keys in the segment. Both are special cases of {\em cap} statistics, defined as the sum of frequencies {\em capped} by a parameter TT, which are popular in online advertising platforms. Aggregation by key, however, is costly, requiring state proportional to the number of distinct keys, and therefore we are interested in estimating these statistics or more generally, sampling the data, without aggregation. We present a sampling framework for unaggregated data that uses a single pass (for streams) or two passes (for distributed data) and state proportional to the desired sample size. Our design provides the first effective solution for general frequency cap statistics. Our \ell-capped samples provide estimates with tight statistical guarantees for cap statistics with T=Θ()T=\Theta(\ell) and nonnegative unbiased estimates of {\em any} monotone non-decreasing frequency statistics. An added benefit of our unified design is facilitating {\em multi-objective samples}, which provide estimates with statistical guarantees for a specified set of different statistics, using a single, smaller sample.Comment: 21 pages, 4 figures, preliminary version will appear in KDD 201

    Evaluating techniques for sampling stream crayfish (paranephrops planifrons)

    Get PDF
    We evaluated several capture and analysis techniques for estimating abundance and size structure of freshwater crayfish (Paranephrops planifrons) (koura) from a forested North Island, New Zealand stream to provide a methodological basis for future population studies. Direct observation at night and collecting with baited traps were not considered useful. A quadrat sampler was highly biased toward collecting small individuals. Handnetting at night and estimating abundances using the depletion method were not as efficient as handnetting on different dates and analysing by a mark-recapture technique. Electrofishing was effective in collecting koura from different habitats and resulted in the highest abundance estimates, and mark-recapture estimates appeared to be more precise than depletion estimates, especially if multiple recaptures were made. Handnetting captured more large crayfish relative to electrofishing or the quadrat sampler

    Weighted Reservoir Sampling from Distributed Streams

    Get PDF
    We consider message-efficient continuous random sampling from a distributed stream, where the probability of inclusion of an item in the sample is proportional to a weight associated with the item. The unweighted version, where all weights are equal, is well studied, and admits tight upper and lower bounds on message complexity. For weighted sampling with replacement, there is a simple reduction to unweighted sampling with replacement. However, in many applications the stream has only a few heavy items which may dominate a random sample when chosen with replacement. Weighted sampling \textit{without replacement} (weighted SWOR) eludes this issue, since such heavy items can be sampled at most once. In this work, we present the first message-optimal algorithm for weighted SWOR from a distributed stream. Our algorithm also has optimal space and time complexity. As an application of our algorithm for weighted SWOR, we derive the first distributed streaming algorithms for tracking \textit{heavy hitters with residual error}. Here the goal is to identify stream items that contribute significantly to the residual stream, once the heaviest items are removed. Residual heavy hitters generalize the notion of 1\ell_1 heavy hitters and are important in streams that have a skewed distribution of weights. In addition to the upper bound, we also provide a lower bound on the message complexity that is nearly tight up to a log(1/ϵ)\log(1/\epsilon) factor. Finally, we use our weighted sampling algorithm to improve the message complexity of distributed L1L_1 tracking, also known as count tracking, which is a widely studied problem in distributed streaming. We also derive a tight message lower bound, which closes the message complexity of this fundamental problem.Comment: To appear in PODS 201

    Biofilm monitoring coupon system and method of use

    Get PDF
    An apparatus and method is disclosed for biofilm monitoring of a water distribution system which includes the mounting of at least one fitting in a wall port of a manifold in the water distribution system with a passage through the fitting in communication. The insertion of a biofilm sampling member is through the fitting with planar sampling surfaces of different surface treatment provided on linearly arrayed sample coupons of the sampling member disposed in the flow stream in edge-on parallel relation to the direction of the flow stream of the manifold under fluid-tight sealed conditions. The sampling member is adapted to be aseptically removed from or inserted in the fitting and manifold under a positive pressure condition and the fitting passage sealed immediately thereafter by appropriate closure means so as to preclude contamination of the water distribution system through the fitting. The apparatus includes means for clamping the sampling member and for establishing electrical continuity between the sampling surfaces and the system for minimizing electropotential effects. The apparatus may also include a plurality of fittings and sampling members mounted on the manifold to permit extraction of the sampling members in a timed sequence throughout the monitoring period
    corecore