44 research outputs found
Depth Separation for Neural Networks
Let be a function of
the form
for . We give a simple proof that shows that poly-size
depth two neural networks with (exponentially) bounded weights cannot
approximate whenever cannot be approximated by a low degree polynomial.
Moreover, for many 's, such as , the number of neurons
must be . Furthermore, the result holds
w.r.t.\ the uniform distribution on .
As many functions of the above form can be well approximated by poly-size depth
three networks with poly-bounded weights, this establishes a separation between
depth two and depth three networks w.r.t.\ the uniform distribution on
Complexity Theoretic Limitations on Learning Halfspaces
We study the problem of agnostically learning halfspaces which is defined by
a fixed but unknown distribution on . We define as the least error
of a halfspace classifier for . A learner who can access
has to return a hypothesis whose error is small compared to
.
Using the recently developed method of the author, Linial and Shalev-Shwartz
we prove hardness of learning results under a natural assumption on the
complexity of refuting random - formulas. We show that no
efficient learning algorithm has non-trivial worst-case performance even under
the guarantees that for
arbitrarily small constant , and that is supported in
. Namely, even under these favorable conditions
its error must be for every . In
particular, no efficient algorithm can achieve a constant approximation ratio.
Under a stronger version of the assumption (where can be poly-logarithmic
in ), we can take for arbitrarily small
. Interestingly, this is even stronger than the best known lower bounds
(Arora et. al. 1993, Feldamn et. al. 2006, Guruswami and Raghavendra 2006) for
the case that the learner is restricted to return a halfspace classifier (i.e.
proper learning)
Locally Private Learning without Interaction Requires Separation
We consider learning under the constraint of local differential privacy
(LDP). For many learning problems known efficient algorithms in this model
require many rounds of communication between the server and the clients holding
the data points. Yet multi-round protocols are prohibitively slow in practice
due to network latency and, as a result, currently deployed large-scale systems
are limited to a single round. Despite significant research interest, very
little is known about which learning problems can be solved by such
non-interactive systems. The only lower bound we are aware of is for PAC
learning an artificial class of functions with respect to a uniform
distribution (Kasiviswanathan et al. 2011).
We show that the margin complexity of a class of Boolean functions is a lower
bound on the complexity of any non-interactive LDP algorithm for
distribution-independent PAC learning of the class. In particular, the classes
of linear separators and decision lists require exponential number of samples
to learn non-interactively even though they can be learned in polynomial time
by an interactive LDP algorithm. This gives the first example of a natural
problem that is significantly harder to solve without interaction and also
resolves an open problem of Kasiviswanathan et al. (2011). We complement this
lower bound with a new efficient learning algorithm whose complexity is
polynomial in the margin complexity of the class. Our algorithm is
non-interactive on labeled samples but still needs interactive access to
unlabeled samples. All of our results also apply to the statistical query model
and any model in which the number of bits communicated about each data point is
constrained
The price of bandit information in multiclass online classification
We consider two scenarios of multiclass online learning of a hypothesis class
. In the {\em full information} scenario, the learner is
exposed to instances together with their labels. In the {\em bandit} scenario,
the true label is not exposed, but rather an indication whether the learner's
prediction is correct or not. We show that the ratio between the error rates in
the two scenarios is at most in the realizable case,
and in the agnostic case. The results are tight up to a
logarithmic factor and essentially answer an open question from (Daniely et.
al. - Multiclass learnability and the erm principle).
We apply these results to the class of -margin multiclass linear
classifiers in . We show that the bandit error rate of this class is
in the realizable case and
in the agnostic case. This
resolves an open question from (Kakade et. al. - Efficient bandit algorithms
for online multiclass prediction)
Tight products and Expansion
In this paper we study a new product of graphs called {\em tight product}. A
graph is said to be a tight product of two (undirected multi) graphs
and , if and both projection maps and are covering maps. It is not a priori clear when
two given graphs have a tight product (in fact, it is -hard to decide). We
investigate the conditions under which this is possible. This perspective
yields a new characterization of class-1 -regular graphs. We also
obtain a new model of random -regular graphs whose second eigenvalue is
almost surely at most . This construction resembles random graph
lifts, but requires fewer random bits
Competitive ratio versus regret minimization: achieving the best of both worlds
We consider online algorithms under both the competitive ratio criteria and
the regret minimization one. Our main goal is to build a unified methodology
that would be able to guarantee both criteria simultaneously.
For a general class of online algorithms, namely any Metrical Task System
(MTS), we show that one can simultaneously guarantee the best known competitive
ratio and a natural regret bound. For the paging problem we further show an
efficient online algorithm (polynomial in the number of pages) with this
guarantee.
To this end, we extend an existing regret minimization algorithm
(specifically, Kapralov and Panigrahy) to handle movement cost (the cost of
switching between states of the online system). We then show how to use the
extended regret minimization algorithm to combine multiple online algorithms.
Our end result is an online algorithm that can combine a "base" online
algorithm, having a guaranteed competitive ratio, with a range of online
algorithms that guarantee a small regret over any interval of time. The
combined algorithm guarantees both that the competitive ratio matches that of
the base algorithm and a low regret over any time interval.
As a by product, we obtain an expert algorithm with close to optimal regret
bound on every time interval, even in the presence of switching costs. This
result is of independent interest
Complexity theoretic limitations on learning DNF's
Using the recently developed framework of [Daniely et al, 2014], we show that
under a natural assumption on the complexity of refuting random K-SAT formulas,
learning DNF formulas is hard. Furthermore, the same assumption implies the
hardness of learning intersections of halfspaces,
agnostically learning conjunctions, as well as virtually all (distribution
free) learning problems that were previously shown hard (under complexity
assumptions).Comment: arXiv admin note: substantial text overlap with arXiv:1311.227
Memorizing Gaussians with no over-parameterizaion via gradient decent on neural networks
We prove that a single step of gradient decent over depth two network, with
hidden neurons, starting from orthogonal initialization, can memorize
independent and randomly labeled
Gaussians in . The result is valid for a large class of
activation functions, which includes the absolute value
Optimal Learners for Multiclass Problems
The fundamental theorem of statistical learning states that for binary
classification problems, any Empirical Risk Minimization (ERM) learning rule
has close to optimal sample complexity. In this paper we seek for a generic
optimal learner for multiclass prediction. We start by proving a surprising
result: a generic optimal multiclass learner must be improper, namely, it must
have the ability to output hypotheses which do not belong to the hypothesis
class, even though it knows that all the labels are generated by some
hypothesis from the class. In particular, no ERM learner is optimal. This
brings back the fundmamental question of "how to learn"? We give a complete
answer to this question by giving a new analysis of the one-inclusion
multiclass learner of Rubinstein et al (2006) showing that its sample
complexity is essentially optimal. Then, we turn to study the popular
hypothesis class of generalized linear classifiers. We derive optimal learners
that, unlike the one-inclusion algorithm, are computationally efficient.
Furthermore, we show that the sample complexity of these learners is better
than the sample complexity of the ERM rule, thus settling in negative an open
question due to Collins (2005)
Neural Networks Learning and Memorization with (almost) no Over-Parameterization
Many results in recent years established polynomial time learnability of
various models via neural networks algorithms. However, unless the model is
linear separable, or the activation is a polynomial, these results require very
large networks -- much more than what is needed for the mere existence of a
good predictor.
In this paper we prove that SGD on depth two neural networks can memorize
samples, learn polynomials with bounded weights, and learn certain kernel
spaces, with near optimal network size, sample complexity, and runtime. In
particular, we show that SGD on depth two network with
hidden neurons (and hence
parameters) can memorize random labeled points in