There is a growing realization that uncertain information is a first-class
citizen in modern database management. As such, we need techniques to correctly
and efficiently process uncertain data in database systems. In particular, data
reduction techniques that can produce concise, accurate synopses of large
probabilistic relations are crucial. Similar to their deterministic relation
counterparts, such compact probabilistic data synopses can form the foundation
for human understanding and interactive data exploration, probabilistic query
planning and optimization, and fast approximate query processing in
probabilistic database systems.
In this paper, we introduce definitions and algorithms for building
histogram- and wavelet-based synopses on probabilistic data. The core problem
is to choose a set of histogram bucket boundaries or wavelet coefficients to
optimize the accuracy of the approximate representation of a collection of
probabilistic tuples under a given error metric. For a variety of different
error metrics, we devise efficient algorithms that construct optimal or near
optimal B-term histogram and wavelet synopses. This requires careful analysis
of the structure of the probability distributions, and novel extensions of
known dynamic-programming-based techniques for the deterministic domain. Our
experiments show that this approach clearly outperforms simple ideas, such as
building summaries for samples drawn from the data distribution, while taking
equal or less time