47 research outputs found
On the Sample Complexity of Subspace Learning
A large number of algorithms in machine learning, from principal component
analysis (PCA), and its non-linear (kernel) extensions, to more recent spectral
embedding and support estimation methods, rely on estimating a linear subspace
from samples. In this paper we introduce a general formulation of this problem
and derive novel learning error estimates. Our results rely on natural
assumptions on the spectral properties of the covariance operator associated to
the data distribu- tion, and hold for a wide class of metrics between
subspaces. As special cases, we discuss sharp error estimates for the
reconstruction properties of PCA and spectral support estimation. Key to our
analysis is an operator theoretic approach that has broad applicability to
spectral learning methods.Comment: Extendend Version of conference pape
Estimates of the Approximation Error Using Rademacher Complexity: Learning Vector-Valued Functions
For certain families of multivariable vector-valued functions to be approximated, the accuracy of approximation schemes made up of linear combinations of computational units containing adjustable parameters is investigated. Upper bounds on the approximation error are derived that depend on the Rademacher complexities of the families. The estimates exploit possible relationships among the components of the multivariable vector-valued functions. All such components are approximated simultaneously in such a way to use, for a desired approximation accuracy, less computational units than those required by componentwise approximation. An application to -stage optimization problems is discussed
On information plus noise kernel random matrices
Kernel random matrices have attracted a lot of interest in recent years, from
both practical and theoretical standpoints. Most of the theoretical work so far
has focused on the case were the data is sampled from a low-dimensional
structure. Very recently, the first results concerning kernel random matrices
with high-dimensional input data were obtained, in a setting where the data was
sampled from a genuinely high-dimensional structure---similar to standard
assumptions in random matrix theory. In this paper, we consider the case where
the data is of the type "informationnoise." In other words, each
observation is the sum of two independent elements: one sampled from a
"low-dimensional" structure, the signal part of the data, the other being
high-dimensional noise, normalized to not overwhelm but still affect the
signal. We consider two types of noise, spherical and elliptical. In the
spherical setting, we show that the spectral properties of kernel random
matrices can be understood from a new kernel matrix, computed only from the
signal part of the data, but using (in general) a slightly different kernel.
The Gaussian kernel has some special properties in this setting. The elliptical
setting, which is important from a robustness standpoint, is less prone to easy
interpretation.Comment: Published in at http://dx.doi.org/10.1214/10-AOS801 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
The Sample Complexity of Dictionary Learning
A large set of signals can sometimes be described sparsely using a
dictionary, that is, every element can be represented as a linear combination
of few elements from the dictionary. Algorithms for various signal processing
applications, including classification, denoising and signal separation, learn
a dictionary from a set of signals to be represented. Can we expect that the
representation found by such a dictionary for a previously unseen example from
the same source will have L_2 error of the same magnitude as those for the
given examples? We assume signals are generated from a fixed distribution, and
study this questions from a statistical learning theory perspective.
We develop generalization bounds on the quality of the learned dictionary for
two types of constraints on the coefficient selection, as measured by the
expected L_2 error in representation when the dictionary is used. For the case
of l_1 regularized coefficient selection we provide a generalization bound of
the order of O(sqrt(np log(m lambda)/m)), where n is the dimension, p is the
number of elements in the dictionary, lambda is a bound on the l_1 norm of the
coefficient vector and m is the number of samples, which complements existing
results. For the case of representing a new signal as a combination of at most
k dictionary elements, we provide a bound of the order O(sqrt(np log(m k)/m))
under an assumption on the level of orthogonality of the dictionary (low Babel
function). We further show that this assumption holds for most dictionaries in
high dimensions in a strong probabilistic sense. Our results further yield fast
rates of order 1/m as opposed to 1/sqrt(m) using localized Rademacher
complexity. We provide similar results in a general setting using kernels with
weak smoothness requirements