11 research outputs found
On the Complexity of Random Satisfiability Problems with Planted Solutions
The problem of identifying a planted assignment given a random -SAT
formula consistent with the assignment exhibits a large algorithmic gap: while
the planted solution becomes unique and can be identified given a formula with
clauses, there are distributions over clauses for which the best
known efficient algorithms require clauses. We propose and study a
unified model for planted -SAT, which captures well-known special cases. An
instance is described by a planted assignment and a distribution on
clauses with literals. We define its distribution complexity as the largest
for which the distribution is not -wise independent ( for
any distribution with a planted assignment).
Our main result is an unconditional lower bound, tight up to logarithmic
factors, for statistical (query) algorithms [Kearns 1998, Feldman et. al 2012],
matching known upper bounds, which, as we show, can be implemented using a
statistical algorithm. Since known approaches for problems over distributions
have statistical analogues (spectral, MCMC, gradient-based, convex optimization
etc.), this lower bound provides a rigorous explanation of the observed
algorithmic gap. The proof introduces a new general technique for the analysis
of statistical query algorithms. It also points to a geometric paring
phenomenon in the space of all planted assignments.
We describe consequences of our lower bounds to Feige's refutation hypothesis
[Feige 2002] and to lower bounds on general convex programs that solve planted
-SAT. Our bounds also extend to other planted -CSP models, and, in
particular, provide concrete evidence for the security of Goldreich's one-way
function and the associated pseudorandom generator when used with a
sufficiently hard predicate [Goldreich 2000].Comment: Extended abstract appeared in STOC 201
Efficient Algorithms and Lower Bounds for Robust Linear Regression
We study the problem of high-dimensional linear regression in a robust model
where an -fraction of the samples can be adversarially corrupted. We
focus on the fundamental setting where the covariates of the uncorrupted
samples are drawn from a Gaussian distribution on
. We give nearly tight upper bounds and computational lower
bounds for this problem. Specifically, our main contributions are as follows:
For the case that the covariance matrix is known to be the identity, we give
a sample near-optimal and computationally efficient algorithm that outputs a
candidate hypothesis vector which approximates the unknown
regression vector within -norm , where is the standard deviation of the random observation
noise. An error of is information-theoretically
necessary, even with infinite sample size. Prior work gave an algorithm for
this problem with sample complexity whose
error guarantee scales with the -norm of .
For the case of unknown covariance, we show that we can efficiently achieve
the same error guarantee as in the known covariance case using an additional
unlabeled examples. On the other hand, an error of
can be information-theoretically attained with
samples. We prove a Statistical Query (SQ) lower bound
providing evidence that this quadratic tradeoff in the sample size is inherent.
More specifically, we show that any polynomial time SQ learning algorithm for
robust linear regression (in Huber's contamination model) with estimation
complexity , where is an arbitrarily small constant, must
incur an error of
Statistical Query Algorithms for Mean Vector Estimation and Stochastic Convex Optimization
Stochastic convex optimization, by which the objective is the expectation of a random convex function, is an important and widely used method with numerous applications in machine learning, statistics, operations research, and other areas. We study the complexity of stochastic convex optimization given only statistical query (SQ) access to the objective function. We show that well-known and popular first-order iterative methods can be implemented using only statistical queries. For many cases of interest, we derive nearly matching upper and lower bounds on the estimation (sample) complexity, including linear optimization in the most general setting. We then present several consequences for machine learning, differential privacy, and proving concrete lower bounds on the power of convex optimization–based methods. The key ingredient of our work is SQ algorithms and lower bounds for estimating the mean vector of a distribution over vectors supported on a convex body in Rd. This natural problem has not been previously studied, and we show that our solutions can be used to get substantially improved SQ versions of Perceptron and other online algorithms for learning halfspaces