Search CORE

561 research outputs found

Input Sparsity and Hardness for Robust Subspace Approximation

Author: Clarkson Kenneth L.
Woodruff David P.
Publication venue
Publication date: 20/10/2015
Field of study

In the subspace approximation problem, we seek a k-dimensional subspace F of R^d that minimizes the sum of p-th powers of Euclidean distances to a given set of n points a_1, ..., a_n in R^d, for p >= 1. More generally than minimizing sum_i dist(a_i,F)^p,we may wish to minimize sum_i M(dist(a_i,F)) for some loss function M(), for example, M-Estimators, which include the Huber and Tukey loss functions. Such subspaces provide alternatives to the singular value decomposition (SVD), which is the p=2 case, finding such an F that minimizes the sum of squares of distances. For p in [1,2), and for typical M-Estimators, the minimizing

F

gives a solution that is more robust to outliers than that provided by the SVD. We give several algorithmic and hardness results for these robust subspace approximation problems. We think of the n points as forming an n x d matrix A, and letting nnz(A) denote the number of non-zero entries of A. Our results hold for p in [1,2). We use poly(n) to denote n^{O(1)} as n -> infty. We obtain: (1) For minimizing sum_i dist(a_i,F)^p, we give an algorithm running in O(nnz(A) + (n+d)poly(k/eps) + exp(poly(k/eps))), (2) we show that the problem of minimizing sum_i dist(a_i, F)^p is NP-hard, even to output a (1+1/poly(d))-approximation, answering a question of Kannan and Vempala, and complementing prior results which held for p >2, (3) For loss functions for a wide class of M-Estimators, we give a problem-size reduction: for a parameter K=(log n)^{O(log k)}, our reduction takes O(nnz(A) log n + (n+d) poly(K/eps)) time to reduce the problem to a constrained version involving matrices whose dimensions are poly(K eps^{-1} log n). We also give bicriteria solutions, (4) Our techniques lead to the first O(nnz(A) + poly(d/eps)) time algorithms for (1+eps)-approximate regression for a wide class of convex M-Estimators.Comment: paper appeared in FOCS, 201

arXiv.org e-Print Archive

Crossref

Communication Lower Bounds for Statistical Estimation Problems via a Distributed Data Processing Inequality

Author: Braverman Mark
Garg Ankit
Ma Tengyu
Nguyen Huy L.
Woodruff David P.
Publication venue
Publication date: 09/05/2016
Field of study

We study the tradeoff between the statistical error and communication cost of distributed statistical estimation problems in high dimensions. In the distributed sparse Gaussian mean estimation problem, each of the

m

machines receives

n

data points from a

d

-dimensional Gaussian distribution with unknown mean

\theta

which is promised to be

k

-sparse. The machines communicate by message passing and aim to estimate the mean

\theta

. We provide a tight (up to logarithmic factors) tradeoff between the estimation error and the number of bits communicated between the machines. This directly leads to a lower bound for the distributed \textit{sparse linear regression} problem: to achieve the statistical minimax error, the total communication is at least

\Omega(\min\{n,d\}m)

, where

n

is the number of observations that each machine receives and

d

is the ambient dimension. These lower results improve upon [Sha14,SD'14] by allowing multi-round iterative communication model. We also give the first optimal simultaneous protocol in the dense case for mean estimation. As our main technique, we prove a \textit{distributed data processing inequality}, as a generalization of usual data processing inequalities, which might be of independent interest and useful for other problems.Comment: To appear at STOC 2016. Fixed typos in theorem 4.5 and incorporated reviewers' suggestion

arXiv.org e-Print Archive

Princeton University Open Access Repository

Crossref

Las Vegas Academy Jazz Band III: A Celebration of Black History Month

Author: Bowen Patrick
Loeb David L.
Shaw Marlena
Tanouye Nathan
The Cunninghams
Woodruff Carmen
Publication venue: Digital Scholarship@UNLV
Publication date: 01/02/2012
Field of study

Program listing performers and works performed

University of Nevada, Las Vegas Repository

Computing confidence intervals on solution costs for stochastic grid generation expansion problems.

Author: Watson Jean-Paul
Woodruff David L..
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date: 01/12/2010
Field of study

A range of core operations and planning problems for the national electrical grid are naturally formulated and solved as stochastic programming problems, which minimize expected costs subject to a range of uncertain outcomes relating to, for example, uncertain demands or generator output. A critical decision issue relating to such stochastic programs is: How many scenarios are required to ensure a specific error bound on the solution cost? Scenarios are the key mechanism used to sample from the uncertainty space, and the number of scenarios drives computational difficultly. We explore this question in the context of a long-term grid generation expansion problem, using a bounding procedure introduced by Mak, Morton, and Wood. We discuss experimental results using problem formulations independently minimizing expected cost and down-side risk. Our results indicate that we can use a surprisingly small number of scenarios to yield tight error bounds in the case of expected cost minimization, which has key practical implications. In contrast, error bounds in the case of risk minimization are significantly larger, suggesting more research is required in this area in order to achieve rigorous solutions for decision makers

Crossref

UNT Digital Library

The stochastic vehicle routing problem : a literature review, part II : solution methods

Author: Arntzen Halvard
Oyola Jorge
Woodruff David L.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Building on the work of Gendreau et al. (Oper Res 44(3):469–477, 1996), and complementing the first part of this survey, we review the solution methods used for the past 20 years in the scientific literature on stochastic vehicle routing problems (SVRP). We describe the methods and indicate how they are used when dealing with stochastic vehicle routing problems. Keywords: vehicle routing (VRP), stochastic programmingm, SVRPpublishedVersio

Brage HiM

The stochastic vehicle routing problem : a literature review, part I : models

Author: Arntzen Halvard
Oyola Jorge
Woodruff David L.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

Building on the work of Gendreau et al. (Eur J Oper Res 88(1):3–12; 1996), we review the past 20 years of scientific literature on stochastic vehicle routing problems. The numerous variants of the problem that have been studied in the literature are described and categorized. Keywords: vehicle routing (VRP), stochastic programming, SVRPpublishedVersio

Brage HiM

On Deterministic Sketching and Streaming for Sparse Recovery and Norm Estimation

Author: Nelson Jelani
Nguyẽn Huy L.
Woodruff David P.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2012
Field of study

We study classic streaming and sparse recovery problems using deterministic linear sketches, including

\ell_1/\ell_1

and

\ell_{\infty}/\ell_1

sparse recovery problems (the latter also being known as ℓ1ℓ1-heavy hitters), norm estimation, and approximate inner product. We focus on devising a fixed matrix

A \epsilon \mathbb{R}^{m \times n}

and a deterministic recovery/estimation procedure which work for all possible input vectors simultaneously. Our results improve upon existing work, the following being our main contributions: • A proof that

\ell_{\infty}/\ell_1

sparse recovery and inner product estimation are equivalent, and that incoherent matrices can be used to solve both problems. Our upper bound for the number of measurements is

m=O(\varepsilon^{-2}min\{log n,(log n/log(1/\varepsilon))^2\})

. We can also obtain fast sketching and recovery algorithms by making use of the Fast Johnson–Lindenstrauss transform. Both our running times and number of measurements improve upon previous work. We can also obtain better error guarantees than previous work in terms of a smaller tail of the input vector. • A new lower bound for the number of linear measurements required to solve

\ell_1/\ell_1

sparse recovery. We show

\Omega(k/\varepsilon^2+k log(n/k)/\varepsilon)

measurements are required to recover an x′ with

‖x-x′‖_1\leq(1+\varepsilon)‖x_{tail(k)}‖_1

, where

x_{tail(k)}

is x projected onto all but its largest k coordinates in magnitude. • A tight bound of

m=\theta(\varepsilon^{-2}log(\varepsilon^2n))

on the number of measurements required to solve deterministic norm estimation, i.e., to recover

‖x‖_2\pm\varepsilon‖x‖_1

. For all the problems we study, tight bounds are already known for the randomized complexity from previous work, except in the case of

\ell_1/\ell_1

sparse recovery, where a nearly tight bound is known. Our work thus aims to study the deterministic complexities of these problems. We remark that some of the matrices used in our algorithms, although known to exist, currently are not yet explicit in the sense that deterministic polynomial time constructions are not yet known, although in all cases polynomial time Monte Carlo algorithms are known.Engineering and Applied Science

CiteSeerX

Harvard University - DASH