18 research outputs found

    AMS Without 4-Wise Independence on Product Domains

    Get PDF
    In their seminal work, Alon, Matias, and Szegedy introduced several sketching techniques, including showing that 4-wise independence is sufficient to obtain good approximations of the second frequency moment. In this work, we show that their sketching technique can be extended to product domains [n]k[n]^k by using the product of 4-wise independent functions on [n][n]. Our work extends that of Indyk and McGregor, who showed the result for k=2k = 2. Their primary motivation was the problem of identifying correlations in data streams. In their model, a stream of pairs (i,j)[n]2(i,j) \in [n]^2 arrive, giving a joint distribution (X,Y)(X,Y), and they find approximation algorithms for how close the joint distribution is to the product of the marginal distributions under various metrics, which naturally corresponds to how close XX and YY are to being independent. By using our technique, we obtain a new result for the problem of approximating the 2\ell_2 distance between the joint distribution and the product of the marginal distributions for kk-ary vectors, instead of just pairs, in a single pass. Our analysis gives a randomized algorithm that is a (1±ϵ)(1 \pm \epsilon) approximation (with probability 1δ1-\delta) that requires space logarithmic in nn and mm and proportional to 3k3^k

    Finding Subcube Heavy Hitters in Analytics Data Streams

    Full text link
    Data streams typically have items of large number of dimensions. We study the fundamental heavy-hitters problem in this setting. Formally, the data stream consists of dd-dimensional items x1,,xm[n]dx_1,\ldots,x_m \in [n]^d. A kk-dimensional subcube TT is a subset of distinct coordinates {T1,,Tk}[d]\{ T_1,\cdots,T_k \} \subseteq [d]. A subcube heavy hitter query Query(T,v){\rm Query}(T,v), v[n]kv \in [n]^k, outputs YES if fT(v)γf_T(v) \geq \gamma and NO if fT(v)<γ/4f_T(v) < \gamma/4, where fTf_T is the ratio of number of stream items whose coordinates TT have joint values vv. The all subcube heavy hitters query AllQuery(T){\rm AllQuery}(T) outputs all joint values vv that return YES to Query(T,v){\rm Query}(T,v). The one dimensional version of this problem where d=1d=1 was heavily studied in data stream theory, databases, networking and signal processing. The subcube heavy hitters problem is applicable in all these cases. We present a simple reservoir sampling based one-pass streaming algorithm to solve the subcube heavy hitters problem in O~(kd/γ)\tilde{O}(kd/\gamma) space. This is optimal up to poly-logarithmic factors given the established lower bound. In the worst case, this is Θ(d2/γ)\Theta(d^2/\gamma) which is prohibitive for large dd, and our goal is to circumvent this quadratic bottleneck. Our main contribution is a model-based approach to the subcube heavy hitters problem. In particular, we assume that the dimensions are related to each other via the Naive Bayes model, with or without a latent dimension. Under this assumption, we present a new two-pass, O~(d/γ)\tilde{O}(d/\gamma)-space algorithm for our problem, and a fast algorithm for answering AllQuery(T){\rm AllQuery}(T) in O(k/γ2)O(k/\gamma^2) time. Our work develops the direction of model-based data stream analysis, with much that remains to be explored.Comment: To appear in WWW 201

    Approximating Subadditive Hadamard Functions on Implicit Matrices

    Get PDF
    An important challenge in the streaming model is to maintain small-space approximations of entrywise functions performed on a matrix that is generated by the outer product of two vectors given as a stream. In other works, streams typically define matrices in a standard way via a sequence of updates, as in the work of Woodruff (2014) and others. We describe the matrix formed by the outer product, and other matrices that do not fall into this category, as implicit matrices. As such, we consider the general problem of computing over such implicit matrices with Hadamard functions, which are functions applied entrywise on a matrix. In this paper, we apply this generalization to provide new techniques for identifying independence between two vectors in the streaming model. The previous state of the art algorithm of Braverman and Ostrovsky (2010) gave a (1±ϵ)(1 \pm \epsilon)-approximation for the L1L_1 distance between the product and joint distributions, using space O(log1024(nm)ϵ1024)O(\log^{1024}(nm) \epsilon^{-1024}), where mm is the length of the stream and nn denotes the size of the universe from which stream elements are drawn. Our general techniques include the L1L_1 distance as a special case, and we give an improved space bound of O(log12(n)log2(nmϵ)ϵ7)O(\log^{12}(n) \log^{2}({nm \over \epsilon})\epsilon^{-7})

    Differentially Private Fractional Frequency Moments Estimation with Polylogarithmic Space

    Get PDF
    We prove that Fp sketch, a well-celebrated streaming algorithm for frequency moments estimation, is differentially private as is when p ∈ (0, 1]. Fp sketch uses only polylogarithmic space, exponentially better than existing DP baselines and only worse than the optimal non-private baseline by a logarithmic factor. The evaluation shows that Fp sketch can achieve reasonable accuracy with differential privacy guarantee. The evaluation code is included in the supplementary material

    Correlation clustering in data streams

    Get PDF
    In this paper, we address the problem of correlation clustering in the dynamic data stream model. The stream consists of updates to the edge weights of a graph on n nodes and the goal is to find a node-partition such that the end-points of negative-weight edges are typically in different clusters whereas the end-points of positive-weight edges are typically in the same cluster. We present polynomial-time, O(n·polylog n)-space approximation algorithms for natural problems that arise. We first develop data structures based on linear sketches that allow the “quality” of a given node-partition to be measured. We then combine these data structures with convex programming and sampling techniques to solve the relevant approximation problem. However the standard LP and SDP formulations are not obviously solvable in O(n·polylog n)-space. Our work presents space-efficient algorithms for the convex programming required, as well as approaches to reduce the adaptivity of the sampling. Note that the improved space and running-time bounds achieved from streaming algorithms are also useful for offline settings such as MapReduce models

    Private Data Stream Analysis for Universal Symmetric Norm Estimation

    Get PDF
    We study how to release summary statistics on a data stream subject to the constraint of differential privacy. In particular, we focus on releasing the family of symmetric norms, which are invariant under sign-flips and coordinate-wise permutations on an input data stream and include L_p norms, k-support norms, top-k norms, and the box norm as special cases. Although it may be possible to design and analyze a separate mechanism for each symmetric norm, we propose a general parametrizable framework that differentially privately releases a number of sufficient statistics from which the approximation of all symmetric norms can be simultaneously computed. Our framework partitions the coordinates of the underlying frequency vector into different levels based on their magnitude and releases approximate frequencies for the "heavy" coordinates in important levels and releases approximate level sizes for the "light" coordinates in important levels. Surprisingly, our mechanism allows for the release of an arbitrary number of symmetric norm approximations without any overhead or additional loss in privacy. Moreover, our mechanism permits (1+?)-approximation to each of the symmetric norms and can be implemented using sublinear space in the streaming model for many regimes of the accuracy and privacy parameters
    corecore