221 research outputs found

    Optimal Sketching Bounds for Sparse Linear Regression

    Full text link
    We study oblivious sketching for kk-sparse linear regression under various loss functions such as an p\ell_p norm, or from a broad class of hinge-like loss functions, which includes the logistic and ReLU losses. We show that for sparse 2\ell_2 norm regression, there is a distribution over oblivious sketches with Θ(klog(d/k)/ε2)\Theta(k\log(d/k)/\varepsilon^2) rows, which is tight up to a constant factor. This extends to p\ell_p loss with an additional additive O(klog(k/ε)/ε2)O(k\log(k/\varepsilon)/\varepsilon^2) term in the upper bound. This establishes a surprising separation from the related sparse recovery problem, which is an important special case of sparse regression. For this problem, under the 2\ell_2 norm, we observe an upper bound of O(klog(d)/ε+klog(k/ε)/ε2)O(k \log (d)/\varepsilon + k\log(k/\varepsilon)/\varepsilon^2) rows, showing that sparse recovery is strictly easier to sketch than sparse regression. For sparse regression under hinge-like loss functions including sparse logistic and sparse ReLU regression, we give the first known sketching bounds that achieve o(d)o(d) rows showing that O(μ2klog(μnd/ε)/ε2)O(\mu^2 k\log(\mu n d/\varepsilon)/\varepsilon^2) rows suffice, where μ\mu is a natural complexity parameter needed to obtain relative error bounds for these loss functions. We again show that this dimension is tight, up to lower order terms and the dependence on μ\mu. Finally, we show that similar sketching bounds can be achieved for LASSO regression, a popular convex relaxation of sparse regression, where one aims to minimize Axb22+λx1\|Ax-b\|_2^2+\lambda\|x\|_1 over xRdx\in\mathbb{R}^d. We show that sketching dimension O(log(d)/(λε)2)O(\log(d)/(\lambda \varepsilon)^2) suffices and that the dependence on dd and λ\lambda is tight.Comment: AISTATS 202

    Time lower bounds for nonadaptive turnstile streaming algorithms

    Full text link
    We say a turnstile streaming algorithm is "non-adaptive" if, during updates, the memory cells written and read depend only on the index being updated and random coins tossed at the beginning of the stream (and not on the memory contents of the algorithm). Memory cells read during queries may be decided upon adaptively. All known turnstile streaming algorithms in the literature are non-adaptive. We prove the first non-trivial update time lower bounds for both randomized and deterministic turnstile streaming algorithms, which hold when the algorithms are non-adaptive. While there has been abundant success in proving space lower bounds, there have been no non-trivial update time lower bounds in the turnstile model. Our lower bounds hold against classically studied problems such as heavy hitters, point query, entropy estimation, and moment estimation. In some cases of deterministic algorithms, our lower bounds nearly match known upper bounds

    Taming Big Data By Streaming

    Get PDF
    Data streams have emerged as a natural computational model for numerous applications of big data processing. In this model, algorithms are assumed to have access to a limited amount of memory and can only make a single pass (or a few passes) over the data, but need to produce sufficiently accurate answers for some objective functions on the dataset. This model captures various real-world applications and stimulates new scalable tools for solving important problems in the big data era. This dissertation focuses on the following two aspects of the streaming model. 1. Understanding the capability of the streaming model. For a vector aggregation stream, i.e., when the stream is a sequence of updates to an underlying nn-dimensional vector vv (for very large nn), we establish nearly tight space bounds on streaming algorithms of approximating functions of the form i=1ng(vi)\sum_{i=1}^n g(v_i) for nearly all functions gg of one-variable and l(v)l(v) for all symmetric norms ll. These results provide a deeper understanding of the streaming computation model. 2. Tighter upper bounds. We provide better streaming kk-median clustering algorithms in a dynamic points stream, i.e., a stream of insertion and deletion of points on a discrete Euclidean space ([Δ]d[\Delta]^d for sufficiently large Δ\Delta and dd). Our algorithms use k\cdot\poly(d \log \Delta) space/update time and maintain with high probability an approximate kk-median solution to the streaming dataset. All previous algorithms for computing an approximation for the kk-median problem over dynamic data streams required space and update time exponential in dd
    corecore