    Estimating Entropy of Data Streams Using Compressed Counting

    The Shannon entropy is a widely used summary statistic, for example, network traffic measurement, anomaly detection, neural computations, spike trains, etc. This study focuses on estimating Shannon entropy of data streams. It is known that Shannon entropy can be approximated by Reenyi entropy or Tsallis entropy, which are both functions of the p-th frequency moments and approach Shannon entropy as p->1. Compressed Counting (CC) is a new method for approximating the p-th frequency moments of data streams. Our contributions include: 1) We prove that Renyi entropy is (much) better than Tsallis entropy for approximating Shannon entropy. 2) We propose the optimal quantile estimator for CC, which considerably improves the previous estimators. 3) Our experiments demonstrate that CC is indeed highly effective approximating the moments and entropies. We also demonstrate the crucial importance of utilizing the variance-bias trade-off

    The Sketching Complexity of Graph and Hypergraph Counting

    Subgraph counting is a fundamental primitive in graph processing, with applications in social network analysis (e.g., estimating the clustering coefficient of a graph), database processing and other areas. The space complexity of subgraph counting has been studied extensively in the literature, but many natural settings are still not well understood. In this paper we revisit the subgraph (and hypergraph) counting problem in the sketching model, where the algorithm's state as it processes a stream of updates to the graph is a linear function of the stream. This model has recently received a lot of attention in the literature, and has become a standard model for solving dynamic graph streaming problems. In this paper we give a tight bound on the sketching complexity of counting the number of occurrences of a small subgraph HH in a bounded degree graph GG presented as a stream of edge updates. Specifically, we show that the space complexity of the problem is governed by the fractional vertex cover number of the graph HH. Our subgraph counting algorithm implements a natural vertex sampling approach, with sampling probabilities governed by the vertex cover of HH. Our main technical contribution lies in a new set of Fourier analytic tools that we develop to analyze multiplayer communication protocols in the simultaneous communication model, allowing us to prove a tight lower bound. We believe that our techniques are likely to find applications in other settings. Besides giving tight bounds for all graphs HH, both our algorithm and lower bounds extend to the hypergraph setting, albeit with some loss in space complexity

    Separating k-Player from t-Player One-Way Communication, with Applications to Data Streams

    In a k-party communication problem, the k players with inputs x_1, x_2, ..., x_k, respectively, want to evaluate a function f(x_1, x_2, ..., x_k) using as little communication as possible. We consider the message-passing model, in which the inputs are partitioned in an arbitrary, possibly worst-case manner, among a smaller number t of players (t<k). The t-player communication cost of computing f can only be smaller than the k-player communication cost, since the t players can trivially simulate the k-player protocol. But how much smaller can it be? We study deterministic and randomized protocols in the one-way model, and provide separations for product input distributions, which are optimal for low error probability protocols. We also provide much stronger separations when the input distribution is non-product. A key application of our results is in proving lower bounds for data stream algorithms. In particular, we give an optimal Omega(epsilon^{-2}log(N) log log(mM)) bits of space lower bound for the fundamental problem of (1 +/-{epsilon})-approximating the number |x |_0 of non-zero entries of an n-dimensional vector x after m updates each of magnitude M, and with success probability >= 2/3, in a strict turnstile stream. Our result matches the best known upper bound when epsilon >= 1/polylog(mM). It also improves on the prior Omega({epsilon}^{-2}log(mM)) lower bound and separates the complexity of approximating L_0 from approximating the p-norm L_p for p bounded away from 0, since the latter has an O(epsilon^{-2}log(mM)) bit upper bound

    Finding structure in data streams : correlations, independent sets, and matchings

    The streaming model supposes that, rather than being available all at once, the data is received in a piecemeal fashion. In a world of massive data sets, streaming algorithms give a complementary approach to distributed algorithms: with the data all being available in one place but at different times, rather than at the same time in different places. We examine three different single-pass streaming problems where existing results show limited feasibility. We consider realistic relaxations or restrictions of these problems which allow for more efficient algorithms. In the correlation outliers problem, we wish to identify pairs of unusually correlated signals from a streamed matrix of observations. We show that a simple application of existing technique is space-optimal but has slow query time when the outlier threshold is small. We demonstrate how we can achieve faster query times at the cost of storing a larger data summary. In the maximum independent set problem, we wish to find an edge-less induced subgraph of maximum size. For arbitrary graphs, given as a stream of edges, it is known that no space-efficient algorithm exists. We consider a variant streaming model, where the graph is received vertex by vertex. While we show this model still does not admit efficient algorithms for general graphs, we demonstrate efficient approximation algorithms for various special graph classes. In the maximum matching problem, we wish to find a disjoint subset of edges of largest possible size. The greedy algorithm gives us an easy 2-approximation for streams of edges, but the problem becomes infeasible to solve if we allow unlimited edge deletions. We consider a model where, instead, a limited number of deletions are allowed. We describe several new approximation algorithms with complexity parameterised by the number of deletions. We also present new techniques which may lead to the development of corresponding tight lower bounds

    Stream sketches, sampling, and sabotage

    Exact solutions are unattainable for important problems. The calculations are limited by the memory of our computers and the length of time that we can wait for a solution. The field of approximation algorithms has grown to address this problem; it is practically important and theoretically fascinating. We address three questions along these lines. What are the limits of streaming computation? Can we efficiently compute the likelihood of a given network of relationships? How robust are the solutions to combinatorial optimization problems? High speed network monitoring and rapid acquisition of scientific data require the development of space efficient algorithms. In these settings it is impractical or impossible to store all of the data, nonetheless the need for analyzing it persists. Typically, the goal is to compute some simple statistics on the input using sublinear, or even polylogarithmic, space. Our main contributions here are the complete classification of the space necessary for several types of statistics. Our sharpest results characterize the complexity in terms of the domain size and stream length. Furthermore, our algorithms are universal for their respective classes of statistics. A network of relationships, for example friendships or species-habitat pairings, can often be represented as a binary contingency table, which is {0,1}-matrix with given row and column sums. A natural null model for hypothesis testing here is the uniform distribution on the set of binary contingency tables with the same line sums as the observation. However, exact calculation, asymptotic approximation, and even Monte-Carlo approximation of p-values are so-far practically unattainable for many interesting examples. This thesis presents two new algorithms for sampling contingency tables. One is a hybrid algorithm that combines elements of two previously known algorithms. It is intended to exploit certain properties of the margins that are observed in some data sets. Our other algorithm samples from a larger set of tables, but it has the advantage of being fast. The robustness of a system can be assessed from optimal attack strategies. Interdiction problems ask about the worst-case impact of a limited change to an underlying optimization problem. Most interdiction problems are NP-hard, and furthermore, even designing efficient approximation algorithms that allow for estimating the order of magnitude of a worst-case impact has turned out to be very difficult. We suggest a general method to obtain pseudoapproximations for many interdiction problems

    On the Exact Space Complexity of Sketching and Streaming Small Norms

    Streaming model supplies solutions for handling enormous data flows for over 20 years now. The model works with sequential data access and states sublinear memory as its primary restriction. Although the majority of the algorithms are randomized and approximate, the field facilitates numerous applications from handling networking traffic to analyzing cosmology simulations and beyond. This thesis focuses on one of the most foundational and well-studied problems of finding heavy hitters, i.e. frequent items: 1.We challenge the long-lasting complexity gap in finding heavy hitters with L2 guarantee in the insertion-only stream and present the first optimal algorithm with a space complexity of O(1) words and O(1) update time. Our result improves on Count Sketch algorithm with space and time complexity of O(log n) by Charikar et al. 2002 [39]. 2. We consider the L2-heavy hitter problem in the interval query settings, rapidly emerging in the field. Compared to well known sliding window model where an algorithm is required to report the function of interest computed over the last N updates,interval query provides query flexibility, such that at any moment t one can query the function value on any interval (t1,t2)⊆(t−N,t). We present the first L2-heavy hitter algorithm in that model and extend the result to estimation all streamable functions of a frequency vector. 3. We provide the experimental study for the recent space optimal result on streaming quantiles by Karnin et al. 2016 [85]. The problem can be considered as a generalization to the heavy hitters. Additionally, we suggest several variations to the algorithms which improve the running time from O(1/ε) to O(log 1/ε), provide twice better space vs. precision trade-off, and extend the algorithm for the case of weighted updates. 4. We establish the connection between finding "halos", i.e. dense areas, in cosmology N-body simulation and finding heavy hitters. We build the first halo finder and scale it up to handle data sets with up-to 10^12 particles via GPU boosting, sampling and parallel I/O. We investigate its behavior and compare it to traditional in-memory halo finders. Our solution pushes the memory footprint from several terabytes down to less than a gigabyte, therefore, make the problem feasible for small servers and even desktops
