29 research outputs found
Algorithmic luckiness
Classical statistical learning theory studies the generalisation performance of machine
learning algorithms rather indirectly. One of the main detours is that algorithms are studied
in terms of the hypothesis class that they draw their hypotheses from. In this paper,
motivated by the luckiness framework of Shawe-Taylor et al. (1998), we study learning
algorithms more directly and in a way that allows us to exploit the serendipity of the
training sample. The main di erence to previous approaches lies in the complexity measure;
rather than covering all hypotheses in a given hypothesis space it is only necessary to cover
the functions which could have been learned using the xed learning algorithm. We show
how the resulting framework relates to the VC, luckiness and compression frameworks.
Finally, we present an application of this framework to the maximum margin algorithm
for linear classi ers which results in a bound that exploits the margin, the sparsity of the
resultant weight vector, and the degree of clustering of the training data in feature space
On the importance of small coordinate projections
It has been recently shown that sharp generalization bounds can be obtained when the function
class from which the algorithm chooses its hypotheses is “small” in the sense that the Rademacher
averages of this function class are small. We show that a new more general principle guarantees
good generalization bounds. The new principle requires that random coordinate projections of the
function class evaluated on random samples are “small” with high probability and that the random
class of functions allows symmetrization. As an example, we prove that this geometric property
of the function class is exactly the reason why the two lately proposed frameworks, the luckiness
(Shawe-Taylor et al., 1998) and the algorithmic luckiness (Herbrich and Williamson, 2002), can be
used to establish generalization bounds
The Sample Complexity of Dictionary Learning
A large set of signals can sometimes be described sparsely using a
dictionary, that is, every element can be represented as a linear combination
of few elements from the dictionary. Algorithms for various signal processing
applications, including classification, denoising and signal separation, learn
a dictionary from a set of signals to be represented. Can we expect that the
representation found by such a dictionary for a previously unseen example from
the same source will have L_2 error of the same magnitude as those for the
given examples? We assume signals are generated from a fixed distribution, and
study this questions from a statistical learning theory perspective.
We develop generalization bounds on the quality of the learned dictionary for
two types of constraints on the coefficient selection, as measured by the
expected L_2 error in representation when the dictionary is used. For the case
of l_1 regularized coefficient selection we provide a generalization bound of
the order of O(sqrt(np log(m lambda)/m)), where n is the dimension, p is the
number of elements in the dictionary, lambda is a bound on the l_1 norm of the
coefficient vector and m is the number of samples, which complements existing
results. For the case of representing a new signal as a combination of at most
k dictionary elements, we provide a bound of the order O(sqrt(np log(m k)/m))
under an assumption on the level of orthogonality of the dictionary (low Babel
function). We further show that this assumption holds for most dictionaries in
high dimensions in a strong probabilistic sense. Our results further yield fast
rates of order 1/m as opposed to 1/sqrt(m) using localized Rademacher
complexity. We provide similar results in a general setting using kernels with
weak smoothness requirements
Learning the kernel with hyperkernels
This paper addresses the problem of choosing a kernel suitable for estimation with a support
vector machine, hence further automating machine learning. This goal is achieved by defining
a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a
statistical estimation problem similar to the problem of minimizing a regularized risk functional.
We state the equivalent representer theorem for the choice of kernels and present a semidefinite
programming formulation of the resulting optimization problem. Several recipes for constructing
hyperkernels are provided, as well as the details of common machine learning problems. Experimental
results for classification, regression and novelty detection on UCI data show the feasibility
of our approach
Generalization in Deep Learning
This paper provides theoretical insights into why and how deep learning can
generalize well, despite its large capacity, complexity, possible algorithmic
instability, nonrobustness, and sharp minima, responding to an open question in
the literature. We also discuss approaches to provide non-vacuous
generalization guarantees for deep learning. Based on theoretical observations,
we propose new open problems and discuss the limitations of our results.Comment: To appear in Mathematics of Deep Learning, Cambridge University
Press. All previous results remain unchange