52 research outputs found
Composable Sketches for Functions of Frequencies: Beyond the Worst Case
Recently there has been increased interest in using machine learning
techniques to improve classical algorithms. In this paper we study when it is
possible to construct compact, composable sketches for weighted sampling and
statistics estimation according to functions of data frequencies. Such
structures are now central components of large-scale data analytics and machine
learning pipelines. However, many common functions, such as thresholds and p-th
frequency moments with p > 2, are known to require polynomial-size sketches in
the worst case. We explore performance beyond the worst case under two
different types of assumptions. The first is having access to noisy advice on
item frequencies. This continues the line of work of Hsu et al. (ICLR 2019),
who assume predictions are provided by a machine learning model. The second is
providing guaranteed performance on a restricted class of input frequency
distributions that are better aligned with what is observed in practice. This
extends the work on heavy hitters under Zipfian distributions in a seminal
paper of Charikar et al. (ICALP 2002). Surprisingly, we show analytically and
empirically that "in practice" small polylogarithmic-size sketches provide
accuracy for "hard" functions.Comment: Full version of a paper from ICML 2020. Python implementation
available as part of the supplemental material accompanying the ICML
publicatio
Stream sketches, sampling, and sabotage
Exact solutions are unattainable for important problems. The calculations are limited by the memory of our computers and the length of time that we can wait for a solution. The field of approximation algorithms has grown to address this problem; it is practically important and theoretically fascinating. We address three questions along these lines. What are the limits of streaming computation? Can we efficiently compute the likelihood of a given network of relationships? How robust are the solutions to combinatorial optimization problems?
High speed network monitoring and rapid acquisition of scientific data require the development of space efficient algorithms. In these settings it is impractical or impossible to store all of the data, nonetheless the need for analyzing it persists. Typically, the goal is to compute some simple statistics on the input using sublinear, or even polylogarithmic, space. Our main contributions here are the complete classification of the space necessary for several types of statistics. Our sharpest results characterize the complexity in terms of the domain size and stream length. Furthermore, our algorithms are universal for their respective classes of statistics.
A network of relationships, for example friendships or species-habitat pairings, can often be represented as a binary contingency table, which is {0,1}-matrix with given row and column sums. A natural null model for hypothesis testing here is the uniform distribution on the set of binary contingency tables with the same line sums as the observation. However, exact calculation, asymptotic approximation, and even Monte-Carlo approximation of p-values are so-far practically unattainable for many interesting examples. This thesis presents two new algorithms for sampling contingency tables. One is a hybrid algorithm that combines elements of two previously known algorithms. It is intended to exploit certain properties of the margins that are observed in some data sets. Our other algorithm samples from a larger set of tables, but it has the advantage of being fast.
The robustness of a system can be assessed from optimal attack strategies. Interdiction problems ask about the worst-case impact of a limited change to an underlying optimization problem. Most interdiction problems are NP-hard, and furthermore, even designing efficient approximation algorithms that allow for estimating the order of magnitude of a worst-case impact has turned out to be very difficult. We suggest a general method to obtain pseudoapproximations for many interdiction problems
- …