2 research outputs found

    Subspace exploration: Bounds on Projected Frequency Estimation

    Full text link
    Given an nΓ—dn \times d dimensional dataset AA, a projection query specifies a subset CβŠ†[d]C \subseteq [d] of columns which yields a new nΓ—βˆ£C∣n \times |C| array. We study the space complexity of computing data analysis functions over such subspaces, including heavy hitters and norms, when the subspaces are revealed only after observing the data. We show that this important class of problems is typically hard: for many problems, we show 2Ξ©(d)2^{\Omega(d)} lower bounds. However, we present upper bounds which demonstrate space dependency better than 2d2^d. That is, for c,cβ€²βˆˆ(0,1)c,c' \in (0,1) and a parameter N=2dN=2^d an NcN^c-approximation can be obtained in space min⁑(Ncβ€²,n)\min(N^{c'},n), showing that it is possible to improve on the na\"{i}ve approach of keeping information for all 2d2^d subsets of dd columns. Our results are based on careful constructions of instances using coding theory and novel combinatorial reductions that exhibit such space-approximation tradeoffs

    Sampling Sketches for Concave Sublinear Functions of Frequencies

    Full text link
    We consider massive distributed datasets that consist of elements modeled as key-value pairs and the task of computing statistics or aggregates where the contribution of each key is weighted by a function of its frequency (sum of values of its elements). This fundamental problem has a wealth of applications in data analytics and machine learning, in particular, with concave sublinear functions of the frequencies that mitigate the disproportionate effect of keys with high frequency. The family of concave sublinear functions includes low frequency moments (p≀1p \leq 1), capping, logarithms, and their compositions. A common approach is to sample keys, ideally, proportionally to their contributions and estimate statistics from the sample. A simple but costly way to do this is by aggregating the data to produce a table of keys and their frequencies, apply our function to the frequency values, and then apply a weighted sampling scheme. Our main contribution is the design of composable sampling sketches that can be tailored to any concave sublinear function of the frequencies. Our sketch structure size is very close to the desired sample size and our samples provide statistical guarantees on the estimation quality that are very close to that of an ideal sample of the same size computed over aggregated data. Finally, we demonstrate experimentally the simplicity and effectiveness of our methods.Comment: Full version of a NeurIPS 2019 pape
    corecore