71 research outputs found
Asymptotic learning curves of kernel methods: empirical data v.s. Teacher-Student paradigm
How many training data are needed to learn a supervised task? It is often
observed that the generalization error decreases as where is
the number of training examples and an exponent that depends on both
data and algorithm. In this work we measure when applying kernel
methods to real datasets. For MNIST we find and for CIFAR10
, for both regression and classification tasks, and for
Gaussian or Laplace kernels. To rationalize the existence of non-trivial
exponents that can be independent of the specific kernel used, we study the
Teacher-Student framework for kernels. In this scheme, a Teacher generates data
according to a Gaussian random field, and a Student learns them via kernel
regression. With a simplifying assumption -- namely that the data are sampled
from a regular lattice -- we derive analytically for translation
invariant kernels, using previous results from the kriging literature. Provided
that the Student is not too sensitive to high frequencies, depends only
on the smoothness and dimension of the training data. We confirm numerically
that these predictions hold when the training points are sampled at random on a
hypersphere. Overall, the test error is found to be controlled by the magnitude
of the projection of the true function on the kernel eigenvectors whose rank is
larger than . Using this idea we predict relate the exponent to an
exponent describing how the coefficients of the true function in the
eigenbasis of the kernel decay with rank. We extract from real data by
performing kernel PCA, leading to for MNIST and
for CIFAR10, in good agreement with observations. We argue
that these rather large exponents are possible due to the small effective
dimension of the data.Comment: We added (i) the prediction of the exponent for real data
using kernel PCA; (ii) the generalization of our results to non-Gaussian data
from reference [11] (Bordelon et al., "Spectrum Dependent Learning Curves in
Kernel Regression and Wide Neural Networks"
A Noise-Robust Fast Sparse Bayesian Learning Model
This paper utilizes the hierarchical model structure from the Bayesian Lasso
in the Sparse Bayesian Learning process to develop a new type of probabilistic
supervised learning approach. The hierarchical model structure in this Bayesian
framework is designed such that the priors do not only penalize the unnecessary
complexity of the model but will also be conditioned on the variance of the
random noise in the data. The hyperparameters in the model are estimated by the
Fast Marginal Likelihood Maximization algorithm which can achieve sparsity, low
computational cost and faster learning process. We compare our methodology with
two other popular learning models; the Relevance Vector Machine and the
Bayesian Lasso. We test our model on examples involving both simulated and
empirical data, and the results show that this approach has several performance
advantages, such as being fast, sparse and also robust to the variance in
random noise. In addition, our method can give out a more stable estimation of
variance of random error, compared with the other methods in the study.Comment: 15 page
The Neural Representation Benchmark and its Evaluation on Brain and Machine
A key requirement for the development of effective learning representations
is their evaluation and comparison to representations we know to be effective.
In natural sensory domains, the community has viewed the brain as a source of
inspiration and as an implicit benchmark for success. However, it has not been
possible to directly test representational learning algorithms directly against
the representations contained in neural systems. Here, we propose a new
benchmark for visual representations on which we have directly tested the
neural representation in multiple visual cortical areas in macaque (utilizing
data from [Majaj et al., 2012]), and on which any computer vision algorithm
that produces a feature space can be tested. The benchmark measures the
effectiveness of the neural or machine representation by computing the
classification loss on the ordered eigendecomposition of a kernel matrix
[Montavon et al., 2011]. In our analysis we find that the neural representation
in visual area IT is superior to visual area V4. In our analysis of
representational learning algorithms, we find that three-layer models approach
the representational performance of V4 and the algorithm in [Le et al., 2012]
surpasses the performance of V4. Impressively, we find that a recent supervised
algorithm [Krizhevsky et al., 2012] achieves performance comparable to that of
IT for an intermediate level of image variation difficulty, and surpasses IT at
a higher difficulty level. We believe this result represents a major milestone:
it is the first learning algorithm we have found that exceeds our current
estimate of IT representation performance. We hope that this benchmark will
assist the community in matching the representational performance of visual
cortex and will serve as an initial rallying point for further correspondence
between representations derived in brains and machines.Comment: The v1 version contained incorrectly computed kernel analysis curves
and KA-AUC values for V4, IT, and the HT-L3 models. They have been corrected
in this versio
- …