423,905 research outputs found
Stream Aggregation Through Order Sampling
This is paper introduces a new single-pass reservoir weighted-sampling stream
aggregation algorithm, Priority-Based Aggregation (PBA). While order sampling
is a powerful and e cient method for weighted sampling from a stream of
uniquely keyed items, there is no current algorithm that realizes the benefits
of order sampling in the context of stream aggregation over non-unique keys. A
naive approach to order sample regardless of key then aggregate the results is
hopelessly inefficient. In distinction, our proposed algorithm uses a single
persistent random variable across the lifetime of each key in the cache, and
maintains unbiased estimates of the key aggregates that can be queried at any
point in the stream. The basic approach can be supplemented with a Sample and
Hold pre-sampling stage with a sampling rate adaptation controlled by PBA. This
approach represents a considerable reduction in computational complexity
compared with the state of the art in adapting Sample and Hold to operate with
a fixed cache size. Concerning statistical properties, we prove that PBA
provides unbiased estimates of the true aggregates. We analyze the
computational complexity of PBA and its variants, and provide a detailed
evaluation of its accuracy on synthetic and trace data. Weighted relative error
is reduced by 40% to 65% at sampling rates of 5% to 17%, relative to Adaptive
Sample and Hold; there is also substantial improvement for rank queriesComment: 10 page
Stream Sampling for Frequency Cap Statistics
Unaggregated data, in streamed or distributed form, is prevalent and come
from diverse application domains which include interactions of users with web
services and IP traffic. Data elements have {\em keys} (cookies, users,
queries) and elements with different keys interleave. Analytics on such data
typically utilizes statistics stated in terms of the frequencies of keys. The
two most common statistics are {\em distinct}, which is the number of active
keys in a specified segment, and {\em sum}, which is the sum of the frequencies
of keys in the segment. Both are special cases of {\em cap} statistics, defined
as the sum of frequencies {\em capped} by a parameter , which are popular in
online advertising platforms. Aggregation by key, however, is costly, requiring
state proportional to the number of distinct keys, and therefore we are
interested in estimating these statistics or more generally, sampling the data,
without aggregation. We present a sampling framework for unaggregated data that
uses a single pass (for streams) or two passes (for distributed data) and state
proportional to the desired sample size. Our design provides the first
effective solution for general frequency cap statistics. Our -capped
samples provide estimates with tight statistical guarantees for cap statistics
with and nonnegative unbiased estimates of {\em any} monotone
non-decreasing frequency statistics. An added benefit of our unified design is
facilitating {\em multi-objective samples}, which provide estimates with
statistical guarantees for a specified set of different statistics, using a
single, smaller sample.Comment: 21 pages, 4 figures, preliminary version will appear in KDD 201
Evaluating techniques for sampling stream crayfish (paranephrops planifrons)
We evaluated several capture and analysis techniques for estimating abundance and size structure of freshwater crayfish (Paranephrops planifrons) (koura) from a forested North Island, New Zealand stream to provide a methodological basis for future population studies. Direct observation at night and collecting with baited traps were not considered useful. A quadrat sampler was highly biased toward collecting small individuals. Handnetting at night and estimating abundances using the depletion method were not as efficient as handnetting on different dates and analysing by a mark-recapture technique. Electrofishing was effective in collecting koura from different habitats and resulted in the highest abundance estimates, and mark-recapture estimates appeared to be more precise than depletion estimates, especially if multiple recaptures were made. Handnetting captured more large crayfish relative to electrofishing or the quadrat sampler
Weighted Reservoir Sampling from Distributed Streams
We consider message-efficient continuous random sampling from a distributed
stream, where the probability of inclusion of an item in the sample is
proportional to a weight associated with the item. The unweighted version,
where all weights are equal, is well studied, and admits tight upper and lower
bounds on message complexity. For weighted sampling with replacement, there is
a simple reduction to unweighted sampling with replacement. However, in many
applications the stream has only a few heavy items which may dominate a random
sample when chosen with replacement. Weighted sampling \textit{without
replacement} (weighted SWOR) eludes this issue, since such heavy items can be
sampled at most once.
In this work, we present the first message-optimal algorithm for weighted
SWOR from a distributed stream. Our algorithm also has optimal space and time
complexity. As an application of our algorithm for weighted SWOR, we derive the
first distributed streaming algorithms for tracking \textit{heavy hitters with
residual error}. Here the goal is to identify stream items that contribute
significantly to the residual stream, once the heaviest items are removed.
Residual heavy hitters generalize the notion of heavy hitters and are
important in streams that have a skewed distribution of weights. In addition to
the upper bound, we also provide a lower bound on the message complexity that
is nearly tight up to a factor. Finally, we use our weighted
sampling algorithm to improve the message complexity of distributed
tracking, also known as count tracking, which is a widely studied problem in
distributed streaming. We also derive a tight message lower bound, which closes
the message complexity of this fundamental problem.Comment: To appear in PODS 201
Biofilm monitoring coupon system and method of use
An apparatus and method is disclosed for biofilm monitoring of a water distribution system which includes the mounting of at least one fitting in a wall port of a manifold in the water distribution system with a passage through the fitting in communication. The insertion of a biofilm sampling member is through the fitting with planar sampling surfaces of different surface treatment provided on linearly arrayed sample coupons of the sampling member disposed in the flow stream in edge-on parallel relation to the direction of the flow stream of the manifold under fluid-tight sealed conditions. The sampling member is adapted to be aseptically removed from or inserted in the fitting and manifold under a positive pressure condition and the fitting passage sealed immediately thereafter by appropriate closure means so as to preclude contamination of the water distribution system through the fitting. The apparatus includes means for clamping the sampling member and for establishing electrical continuity between the sampling surfaces and the system for minimizing electropotential effects. The apparatus may also include a plurality of fittings and sampling members mounted on the manifold to permit extraction of the sampling members in a timed sequence throughout the monitoring period
- …