190 research outputs found
Structure-Aware Sampling: Flexible and Accurate Summarization
In processing large quantities of data, a fundamental problem is to obtain a
summary which supports approximate query answering. Random sampling yields
flexible summaries which naturally support subset-sum queries with unbiased
estimators and well-understood confidence bounds.
Classic sample-based summaries, however, are designed for arbitrary subset
queries and are oblivious to the structure in the set of keys. The particular
structure, such as hierarchy, order, or product space (multi-dimensional),
makes range queries much more relevant for most analysis of the data.
Dedicated summarization algorithms for range-sum queries have also been
extensively studied. They can outperform existing sampling schemes in terms of
accuracy on range queries per summary size. Their accuracy, however, rapidly
degrades when, as is often the case, the query spans multiple ranges. They are
also less flexible - being targeted for range sum queries alone - and are often
quite costly to build and use.
In this paper we propose and evaluate variance optimal sampling schemes that
are structure-aware. These summaries improve over the accuracy of existing
structure-oblivious sampling schemes on range queries while retaining the
benefits of sample-based summaries: flexible summaries, with high accuracy on
both range queries and arbitrary subset queries
Signature inversion for monotone paths
The aim of this article is to provide a simple sampling procedure to
reconstruct any monotone path from its signature. For every N, we sample a
lattice path of N steps with weights given by the coefficient of the
corresponding word in the signature. We show that these weights on lattice
paths satisfy the large deviations principle. In particular, this implies that
the probability of picking up a "wrong" path is exponentially small in N. The
argument relies on a probabilistic interpretation of the signature for monotone
paths
On the tradeoff between stability and fit
In computing, as in many aspects of life, changes incur cost. Many optimization problems are formulated as a one-time instance starting from scratch. However, a common case that arises is when we already have a set of prior assignments and must decide how to respond to a new set of constraints, given that each change from the current assignment comes at a price. That is, we would like to maximize the fitness or efficiency of our system, but we need to balance it with the changeout cost from the previous state.
We provide a precise formulation for this tradeoff and analyze the resulting stable extensions of some fundamental problems in measurement and analytics. Our main technical contribution is a stable extension of Probability Proportional to Size (PPS) weighted random sampling, with applications to monitoring and anomaly detection problems. We also provide a general framework that applies to top-k, minimum spanning tree, and assignment. In both cases, we are able to provide exact solutions and discuss efficient incremental algorithms that can find new solutions as the input changes
Graph Sample and Hold: A Framework for Big-Graph Analytics
Sampling is a standard approach in big-graph analytics; the goal is to
efficiently estimate the graph properties by consulting a sample of the whole
population. A perfect sample is assumed to mirror every property of the whole
population. Unfortunately, such a perfect sample is hard to collect in complex
populations such as graphs (e.g. web graphs, social networks etc), where an
underlying network connects the units of the population. Therefore, a good
sample will be representative in the sense that graph properties of interest
can be estimated with a known degree of accuracy. While previous work focused
particularly on sampling schemes used to estimate certain graph properties
(e.g. triangle count), much less is known for the case when we need to estimate
various graph properties with the same sampling scheme. In this paper, we
propose a generic stream sampling framework for big-graph analytics, called
Graph Sample and Hold (gSH). To begin, the proposed framework samples from
massive graphs sequentially in a single pass, one edge at a time, while
maintaining a small state. We then show how to produce unbiased estimators for
various graph properties from the sample. Given that the graph analysis
algorithms will run on a sample instead of the whole population, the runtime
complexity of these algorithm is kept under control. Moreover, given that the
estimators of graph properties are unbiased, the approximation error is kept
under control. Finally, we show the performance of the proposed framework (gSH)
on various types of graphs, such as social graphs, among others
Tiresias: Online Anomaly Detection for Hierarchical Operational Network Data
Operational network data, management data such as customer care call logs and
equipment system logs, is a very important source of information for network
operators to detect problems in their networks. Unfortunately, there is lack of
efficient tools to automatically track and detect anomalous events on
operational data, causing ISP operators to rely on manual inspection of this
data. While anomaly detection has been widely studied in the context of network
data, operational data presents several new challenges, including the
volatility and sparseness of data, and the need to perform fast detection
(complicating application of schemes that require offline processing or
large/stable data sets to converge).
To address these challenges, we propose Tiresias, an automated approach to
locating anomalous events on hierarchical operational data. Tiresias leverages
the hierarchical structure of operational data to identify high-impact
aggregates (e.g., locations in the network, failure modes) likely to be
associated with anomalous events. To accommodate different kinds of operational
network data, Tiresias consists of an online detection algorithm with low time
and space complexity, while preserving high detection accuracy. We present
results from two case studies using operational data collected at a large
commercial IP network operated by a Tier-1 ISP: customer care call logs and
set-top box crash logs. By comparing with a reference set verified by the ISP's
operational group, we validate that Tiresias can achieve >94% accuracy in
locating anomalies. Tiresias also discovered several previously unknown
anomalies in the ISP's customer care cases, demonstrating its effectiveness
Stream Aggregation Through Order Sampling
This is paper introduces a new single-pass reservoir weighted-sampling stream
aggregation algorithm, Priority-Based Aggregation (PBA). While order sampling
is a powerful and e cient method for weighted sampling from a stream of
uniquely keyed items, there is no current algorithm that realizes the benefits
of order sampling in the context of stream aggregation over non-unique keys. A
naive approach to order sample regardless of key then aggregate the results is
hopelessly inefficient. In distinction, our proposed algorithm uses a single
persistent random variable across the lifetime of each key in the cache, and
maintains unbiased estimates of the key aggregates that can be queried at any
point in the stream. The basic approach can be supplemented with a Sample and
Hold pre-sampling stage with a sampling rate adaptation controlled by PBA. This
approach represents a considerable reduction in computational complexity
compared with the state of the art in adapting Sample and Hold to operate with
a fixed cache size. Concerning statistical properties, we prove that PBA
provides unbiased estimates of the true aggregates. We analyze the
computational complexity of PBA and its variants, and provide a detailed
evaluation of its accuracy on synthetic and trace data. Weighted relative error
is reduced by 40% to 65% at sampling rates of 5% to 17%, relative to Adaptive
Sample and Hold; there is also substantial improvement for rank queriesComment: 10 page
- …