221 research outputs found
Optimal Sketching Bounds for Sparse Linear Regression
We study oblivious sketching for -sparse linear regression under various
loss functions such as an norm, or from a broad class of hinge-like
loss functions, which includes the logistic and ReLU losses. We show that for
sparse norm regression, there is a distribution over oblivious
sketches with rows, which is tight up to a
constant factor. This extends to loss with an additional additive
term in the upper bound. This
establishes a surprising separation from the related sparse recovery problem,
which is an important special case of sparse regression. For this problem,
under the norm, we observe an upper bound of rows, showing that sparse recovery is
strictly easier to sketch than sparse regression. For sparse regression under
hinge-like loss functions including sparse logistic and sparse ReLU regression,
we give the first known sketching bounds that achieve rows showing that
rows suffice, where
is a natural complexity parameter needed to obtain relative error bounds for
these loss functions. We again show that this dimension is tight, up to lower
order terms and the dependence on . Finally, we show that similar
sketching bounds can be achieved for LASSO regression, a popular convex
relaxation of sparse regression, where one aims to minimize
over . We show that sketching
dimension suffices and that the dependence
on and is tight.Comment: AISTATS 202
Time lower bounds for nonadaptive turnstile streaming algorithms
We say a turnstile streaming algorithm is "non-adaptive" if, during updates,
the memory cells written and read depend only on the index being updated and
random coins tossed at the beginning of the stream (and not on the memory
contents of the algorithm). Memory cells read during queries may be decided
upon adaptively. All known turnstile streaming algorithms in the literature are
non-adaptive.
We prove the first non-trivial update time lower bounds for both randomized
and deterministic turnstile streaming algorithms, which hold when the
algorithms are non-adaptive. While there has been abundant success in proving
space lower bounds, there have been no non-trivial update time lower bounds in
the turnstile model. Our lower bounds hold against classically studied problems
such as heavy hitters, point query, entropy estimation, and moment estimation.
In some cases of deterministic algorithms, our lower bounds nearly match known
upper bounds
Taming Big Data By Streaming
Data streams have emerged as a natural computational model for numerous applications of big data processing. In this model, algorithms are assumed to have access to a limited amount of memory and can only make a single pass (or a few passes) over the data, but need to produce sufficiently accurate answers for some objective functions on the dataset. This model captures various real-world applications and stimulates new scalable tools for solving important problems in the big data era.
This dissertation focuses on the following two aspects of the streaming model.
1. Understanding the capability of the streaming model.
For a vector aggregation stream, i.e., when the stream is a sequence of updates to an underlying -dimensional vector (for very large ), we establish nearly tight space bounds on streaming algorithms of approximating functions of the form for nearly all functions of one-variable and for all symmetric norms .
These results provide a deeper understanding of the streaming computation model.
2. Tighter upper bounds.
We provide better streaming -median clustering algorithms in a dynamic points stream, i.e., a stream of insertion and deletion of points on a discrete Euclidean space ( for sufficiently large and ).
Our algorithms use k\cdot\poly(d \log \Delta) space/update time and maintain with high probability an approximate -median solution to the streaming dataset. All previous algorithms for computing an approximation for the -median problem over dynamic data streams required space and update time exponential in
- …