3,512 research outputs found
Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation
We introduce and study a new data sketch for processing massive datasets. It
addresses two common problems: 1) computing a sum given arbitrary filter
conditions and 2) identifying the frequent items or heavy hitters in a data
set. For the former, the sketch provides unbiased estimates with state of the
art accuracy. It handles the challenging scenario when the data is
disaggregated so that computing the per unit metric of interest requires an
expensive aggregation. For example, the metric of interest may be total clicks
per user while the raw data is a click stream with multiple rows per user. Thus
the sketch is suitable for use in a wide range of applications including
computing historical click through rates for ad prediction, reporting user
metrics from event streams, and measuring network traffic for IP flows.
We prove and empirically show the sketch has good properties for both the
disaggregated subset sum estimation and frequent item problems. On i.i.d. data,
it not only picks out the frequent items but gives strongly consistent
estimates for the proportion of each frequent item. The resulting sketch
asymptotically draws a probability proportional to size sample that is optimal
for estimating sums over the data. For non i.i.d. data, we show that it
typically does much better than random sampling for the frequent item problem
and never does worse. For subset sum estimation, we show that even for
pathological sequences, the variance is close to that of an optimal sampling
design. Empirically, despite the disadvantage of operating on disaggregated
data, our method matches or bests priority sampling, a state of the art method
for pre-aggregated data and performs orders of magnitude better on skewed data
compared to uniform sampling. We propose extensions to the sketch that allow it
to be used in combining multiple data sets, in distributed systems, and for
time decayed aggregation
Quantum Amplitude Amplification and Estimation
Consider a Boolean function that partitions set
between its good and bad elements, where is good if and bad
otherwise. Consider also a quantum algorithm such that is a quantum superposition of the
elements of , and let denote the probability that a good element is
produced if is measured. If we repeat the process of running ,
measuring the output, and using to check the validity of the result, we
shall expect to repeat times on the average before a solution is found.
*Amplitude amplification* is a process that allows to find a good after an
expected number of applications of and its inverse which is proportional to
, assuming algorithm makes no measurements. This is a
generalization of Grover's searching algorithm in which was restricted to
producing an equal superposition of all members of and we had a promise
that a single existed such that . Our algorithm works whether or
not the value of is known ahead of time. In case the value of is known,
we can find a good after a number of applications of and its inverse
which is proportional to even in the worst case. We show that this
quadratic speedup can also be obtained for a large family of search problems
for which good classical heuristics exist. Finally, as our main result, we
combine ideas from Grover's and Shor's quantum algorithms to perform amplitude
estimation, a process that allows to estimate the value of . We apply
amplitude estimation to the problem of *approximate counting*, in which we wish
to estimate the number of such that . We obtain optimal
quantum algorithms in a variety of settings.Comment: 32 pages, no figure
The quantum complexity of approximating the frequency moments
The 'th frequency moment of a sequence of integers is defined as , where is the number of times that occurs in the
sequence. Here we study the quantum complexity of approximately computing the
frequency moments in two settings. In the query complexity setting, we wish to
minimise the number of queries to the input used to approximate up to
relative error . We give quantum algorithms which outperform the best
possible classical algorithms up to quadratically. In the multiple-pass
streaming setting, we see the elements of the input one at a time, and seek to
minimise the amount of storage space, or passes over the data, used to
approximate . We describe quantum algorithms for , and
in this model which substantially outperform the best possible
classical algorithms in certain parameter regimes.Comment: 22 pages; v3: essentially published versio
- …