182 research outputs found
Agnostically Learning Halfspaces
We consider the problem of learning a halfspace in the agnostic framework of Kearns et al., where a learner is given access to a distribution on labelled examples but the labelling may be arbitrary. The learner's goal is to output a hypothesis which performs almost as well as the optimal halfspace with respect to future draws from this distribution. Although the agnostic learning framework does not explicitly deal with noise, it is closely related to learning in worst-case noise models such as malicious noise. We give the first polynomial-time algorithm for agnostically learning halfspaces with respect to several distributions, such as the uniform distribution over the -dimensional Boolean cube {0,1}^n or unit sphere in n-dimensional Euclidean space, as well as any log-concave distribution in n-dimensional Euclidean space. Given any constant additive factor eps>0, our algorithm runs in poly(n) time and constructs a hypothesis whose error rate is within an additive eps of the optimal halfspace. We also show this algorithm agnostically learns Boolean disjunctions in time roughly 2^{\sqrt{n}} with respect to any distribution; this is the first subexponential-time algorithm for this problem. Finally, we obtain a new algorithm for PAC learning halfspaces under the uniform distribution on the unit sphere which can tolerate the highest level of malicious noise of any algorithm to date. Our main tool is a polynomial regression algorithm which finds a polynomial that best fits a set of points with respect to a particular metric. We show that, in fact, this algorithm is an arbitrary-distribution generalization of the well known "low-degree" Fourier algorithm of Linial, Mansour, and Nisan and has excellent noise tolerance properties when minimizing with respect to the L_1 norm. We apply this algorithm in conjunction with a non-standard Fourier transform (which does not use the traditional parity basis) for learning halfspaces over the uniform distribution on the unit sphere; we believe this technique is of independent interest
Moment-Matching Polynomials
We give a new framework for proving the existence of low-degree, polynomial
approximators for Boolean functions with respect to broad classes of
non-product distributions. Our proofs use techniques related to the classical
moment problem and deviate significantly from known Fourier-based methods,
which require the underlying distribution to have some product structure.
Our main application is the first polynomial-time algorithm for agnostically
learning any function of a constant number of halfspaces with respect to any
log-concave distribution (for any constant accuracy parameter). This result was
not known even for the case of learning the intersection of two halfspaces
without noise. Additionally, we show that in the "smoothed-analysis" setting,
the above results hold with respect to distributions that have sub-exponential
tails, a property satisfied by many natural and well-studied distributions in
machine learning.
Given that our algorithms can be implemented using Support Vector Machines
(SVMs) with a polynomial kernel, these results give a rigorous theoretical
explanation as to why many kernel methods work so well in practice
Weighted Polynomial Approximations: Limits for Learning and Pseudorandomness
Polynomial approximations to boolean functions have led to many positive
results in computer science. In particular, polynomial approximations to the
sign function underly algorithms for agnostically learning halfspaces, as well
as pseudorandom generators for halfspaces. In this work, we investigate the
limits of these techniques by proving inapproximability results for the sign
function.
Firstly, the polynomial regression algorithm of Kalai et al. (SIAM J. Comput.
2008) shows that halfspaces can be learned with respect to log-concave
distributions on in the challenging agnostic learning model. The
power of this algorithm relies on the fact that under log-concave
distributions, halfspaces can be approximated arbitrarily well by low-degree
polynomials. We ask whether this technique can be extended beyond log-concave
distributions, and establish a negative result. We show that polynomials of any
degree cannot approximate the sign function to within arbitrarily low error for
a large class of non-log-concave distributions on the real line, including
those with densities proportional to .
Secondly, we investigate the derandomization of Chernoff-type concentration
inequalities. Chernoff-type tail bounds on sums of independent random variables
have pervasive applications in theoretical computer science. Schmidt et al.
(SIAM J. Discrete Math. 1995) showed that these inequalities can be established
for sums of random variables with only -wise independence,
for a tail probability of . We show that their results are tight up to
constant factors.
These results rely on techniques from weighted approximation theory, which
studies how well functions on the real line can be approximated by polynomials
under various distributions. We believe that these techniques will have further
applications in other areas of computer science.Comment: 22 page
From average case complexity to improper learning complexity
The basic problem in the PAC model of computational learning theory is to
determine which hypothesis classes are efficiently learnable. There is
presently a dearth of results showing hardness of learning problems. Moreover,
the existing lower bounds fall short of the best known algorithms.
The biggest challenge in proving complexity results is to establish hardness
of {\em improper learning} (a.k.a. representation independent learning).The
difficulty in proving lower bounds for improper learning is that the standard
reductions from -hard problems do not seem to apply in this
context. There is essentially only one known approach to proving lower bounds
on improper learning. It was initiated in (Kearns and Valiant 89) and relies on
cryptographic assumptions.
We introduce a new technique for proving hardness of improper learning, based
on reductions from problems that are hard on average. We put forward a (fairly
strong) generalization of Feige's assumption (Feige 02) about the complexity of
refuting random constraint satisfaction problems. Combining this assumption
with our new technique yields far reaching implications. In particular,
1. Learning 's is hard.
2. Agnostically learning halfspaces with a constant approximation ratio is
hard.
3. Learning an intersection of halfspaces is hard.Comment: 34 page
Learning Kernel-Based Halfspaces with the Zero-One Loss
We describe and analyze a new algorithm for agnostically learning
kernel-based halfspaces with respect to the \emph{zero-one} loss function.
Unlike most previous formulations which rely on surrogate convex loss functions
(e.g. hinge-loss in SVM and log-loss in logistic regression), we provide finite
time/sample guarantees with respect to the more natural zero-one loss function.
The proposed algorithm can learn kernel-based halfspaces in worst-case time
\poly(\exp(L\log(L/\epsilon))), for \emph{any} distribution, where is a
Lipschitz constant (which can be thought of as the reciprocal of the margin),
and the learned classifier is worse than the optimal halfspace by at most
. We also prove a hardness result, showing that under a certain
cryptographic assumption, no algorithm can learn kernel-based halfspaces in
time polynomial in .Comment: This is a full version of the paper appearing in the 23rd
International Conference on Learning Theory (COLT 2010). Compared to the
previous arXiv version, this version contains some small corrections in the
proof of Lemma 3 and in appendix
Agnostic Learning of Disjunctions on Symmetric Distributions
We consider the problem of approximating and learning disjunctions (or
equivalently, conjunctions) on symmetric distributions over .
Symmetric distributions are distributions whose PDF is invariant under any
permutation of the variables. We give a simple proof that for every symmetric
distribution , there exists a set of
functions , such that for every disjunction , there is function
, expressible as a linear combination of functions in , such
that -approximates in distance on or
. This directly
gives an agnostic learning algorithm for disjunctions on symmetric
distributions that runs in time . The best known
previous bound is and follows from approximation of the
more general class of halfspaces (Wimmer, 2010). We also show that there exists
a symmetric distribution , such that the minimum degree of a
polynomial that -approximates the disjunction of all variables is
distance on is . Therefore the
learning result above cannot be achieved via -regression with a
polynomial basis used in most other agnostic learning algorithms.
Our technique also gives a simple proof that for any product distribution
and every disjunction , there exists a polynomial of
degree such that -approximates in
distance on . This was first proved by Blais et al.
(2008) via a more involved argument
Approximate resilience, monotonicity, and the complexity of agnostic learning
A function is -resilient if all its Fourier coefficients of degree at
most are zero, i.e., is uncorrelated with all low-degree parities. We
study the notion of of Boolean
functions, where we say that is -approximately -resilient if
is -close to a -valued -resilient function in
distance. We show that approximate resilience essentially characterizes the
complexity of agnostic learning of a concept class over the uniform
distribution. Roughly speaking, if all functions in a class are far from
being -resilient then can be learned agnostically in time and
conversely, if contains a function close to being -resilient then
agnostic learning of in the statistical query (SQ) framework of Kearns has
complexity of at least . This characterization is based on the
duality between approximation by degree- polynomials and
approximate -resilience that we establish. In particular, it implies that
approximation by low-degree polynomials, known to be sufficient for
agnostic learning over product distributions, is in fact necessary.
Focusing on monotone Boolean functions, we exhibit the existence of
near-optimal -approximately
-resilient monotone functions for all
. Prior to our work, it was conceivable even that every monotone
function is -far from any -resilient function. Furthermore, we
construct simple, explicit monotone functions based on and that are close to highly resilient functions. Our constructions are
based on a fairly general resilience analysis and amplification. These
structural results, together with the characterization, imply nearly optimal
lower bounds for agnostic learning of monotone juntas
The Average Sensitivity of an Intersection of Half Spaces
We prove new bounds on the average sensitivity of the indicator function of
an intersection of halfspaces. In particular, we prove the optimal bound of
. This generalizes a result of Nazarov, who proved the
analogous result in the Gaussian case, and improves upon a result of Harsha,
Klivans and Meka. Furthermore, our result has implications for the runtime
required to learn intersections of halfspaces
- …