704 research outputs found
Learning with Spectral Kernels and Heavy-Tailed Data
Two ubiquitous aspects of large-scale data analysis are that the data often
have heavy-tailed properties and that diffusion-based or spectral-based methods
are often used to identify and extract structure of interest. Perhaps
surprisingly, popular distribution-independent methods such as those based on
the VC dimension fail to provide nontrivial results for even simple learning
problems such as binary classification in these two settings. In this paper, we
develop distribution-dependent learning methods that can be used to provide
dimension-independent sample complexity bounds for the binary classification
problem in these two popular settings. In particular, we provide bounds on the
sample complexity of maximum margin classifiers when the magnitude of the
entries in the feature vector decays according to a power law and also when
learning is performed with the so-called Diffusion Maps kernel. Both of these
results rely on bounding the annealed entropy of gap-tolerant classifiers in a
Hilbert space. We provide such a bound, and we demonstrate that our proof
technique generalizes to the case when the margin is measured with respect to
more general Banach space norms. The latter result is of potential interest in
cases where modeling the relationship between data elements as a dot product in
a Hilbert space is too restrictive.Comment: 21 pages. Substantially revised and extended relative to the first
versio
Convex Risk Minimization and Conditional Probability Estimation
This paper proves, in very general settings, that convex risk minimization is
a procedure to select a unique conditional probability model determined by the
classification problem. Unlike most previous work, we give results that are
general enough to include cases in which no minimum exists, as occurs
typically, for instance, with standard boosting algorithms. Concretely, we
first show that any sequence of predictors minimizing convex risk over the
source distribution will converge to this unique model when the class of
predictors is linear (but potentially of infinite dimension). Secondly, we show
the same result holds for \emph{empirical} risk minimization whenever this
class of predictors is finite dimensional, where the essential technical
contribution is a norm-free generalization bound.Comment: To appear, COLT 201
Bayesian Inference with Posterior Regularization and applications to Infinite Latent SVMs
Existing Bayesian models, especially nonparametric Bayesian methods, rely on
specially conceived priors to incorporate domain knowledge for discovering
improved latent representations. While priors can affect posterior
distributions through Bayes' rule, imposing posterior regularization is
arguably more direct and in some cases more natural and general. In this paper,
we present regularized Bayesian inference (RegBayes), a novel computational
framework that performs posterior inference with a regularization term on the
desired post-data posterior distribution under an information theoretical
formulation. RegBayes is more flexible than the procedure that elicits expert
knowledge via priors, and it covers both directed Bayesian networks and
undirected Markov networks whose Bayesian formulation results in hybrid chain
graph models. When the regularization is induced from a linear operator on the
posterior distributions, such as the expectation operator, we present a general
convex-analysis theorem to characterize the solution of RegBayes. Furthermore,
we present two concrete examples of RegBayes, infinite latent support vector
machines (iLSVM) and multi-task infinite latent support vector machines
(MT-iLSVM), which explore the large-margin idea in combination with a
nonparametric Bayesian model for discovering predictive latent features for
classification and multi-task learning, respectively. We present efficient
inference methods and report empirical studies on several benchmark datasets,
which appear to demonstrate the merits inherited from both large-margin
learning and Bayesian nonparametrics. Such results were not available until
now, and contribute to push forward the interface between these two important
subfields, which have been largely treated as isolated in the community.Comment: 49 pages, 11 figure
Multi-task Learning in Vector-valued Reproducing Kernel Banach Spaces with the Norm
Targeting at sparse multi-task learning, we consider regularization models
with an penalty on the coefficients of kernel functions. In order to
provide a kernel method for this model, we construct a class of vector-valued
reproducing kernel Banach spaces with the norm. The notion of
multi-task admissible kernels is proposed so that the constructed spaces could
have desirable properties including the crucial linear representer theorem.
Such kernels are related to bounded Lebesgue constants of a kernel
interpolation question. We study the Lebesgue constant of multi-task kernels
and provide examples of admissible kernels. Furthermore, we present numerical
experiments for both synthetic data and real-world benchmark data to
demonstrate the advantages of the proposed construction and regularization
models
Solving -norm regularization with tensor kernels
In this paper, we discuss how a suitable family of tensor kernels can be used
to efficiently solve nonparametric extensions of regularized learning
methods. Our main contribution is proposing a fast dual algorithm, and showing
that it allows to solve the problem efficiently. Our results contrast recent
findings suggesting kernel methods cannot be extended beyond Hilbert setting.
Numerical experiments confirm the effectiveness of the method
Online Learning via Sequential Complexities
We consider the problem of sequential prediction and provide tools to study
the minimax value of the associated game. Classical statistical learning theory
provides several useful complexity measures to study learning with i.i.d. data.
Our proposed sequential complexities can be seen as extensions of these
measures to the sequential setting. The developed theory is shown to yield
precise learning guarantees for the problem of sequential prediction. In
particular, we show necessary and sufficient conditions for online learnability
in the setting of supervised learning. Several examples show the utility of our
framework: we can establish learnability without having to exhibit an explicit
online learning algorithm
Empirical margin distributions and bounding the generalization error of combined classifiers
We prove new probabilistic upper bounds on generalization error of complex
classifiers that are combinations of simple classifiers. Such combinations
could be implemented by neural networks or by voting methods of combining the
classifiers, such as boosting and bagging. The bounds are in terms of the
empirical distribution of the margin of the combined classifier. They are based
on the methods of the theory of Gaussian and empirical processes (comparison
inequalities, symmetrization method, concentration inequalities) and they
improve previous results of Bartlett (1998) on bounding the generalization
error of neural networks in terms of l_1-norms of the weights of neurons and of
Schapire, Freund, Bartlett and Lee (1998) on bounding the generalization error
of boosting. We also obtain rates of convergence in Levy distance of empirical
margin distribution to the true margin distribution uniformly over the classes
of classifiers and prove the optimality of these rates.Comment: 35 pages, 1 figur
Non-asymptotic Analysis of -norm Support Vector Machines
Support Vector Machines (SVM) with penalty became a standard tool in
analysis of highdimensional classification problems with sparsity constraints
in many applications including bioinformatics and signal processing. Although
SVM have been studied intensively in the literature, this paper has to our
knowledge first non-asymptotic results on the performance of -SVM in
identification of sparse classifiers. We show that a -dimensional -sparse
classification vector can be (with high probability) well approximated from
only Gaussian trials. The methods used in the proof include
concentration of measure and probability in Banach spaces
On the Sample Complexity of Predictive Sparse Coding
The goal of predictive sparse coding is to learn a representation of examples
as sparse linear combinations of elements from a dictionary, such that a
learned hypothesis linear in the new representation performs well on a
predictive task. Predictive sparse coding algorithms recently have demonstrated
impressive performance on a variety of supervised tasks, but their
generalization properties have not been studied. We establish the first
generalization error bounds for predictive sparse coding, covering two
settings: 1) the overcomplete setting, where the number of features k exceeds
the original dimensionality d; and 2) the high or infinite-dimensional setting,
where only dimension-free bounds are useful. Both learning bounds intimately
depend on stability properties of the learned sparse encoder, as measured on
the training sample. Consequently, we first present a fundamental stability
result for the LASSO, a result characterizing the stability of the sparse codes
with respect to perturbations to the dictionary. In the overcomplete setting,
we present an estimation error bound that decays as \tilde{O}(sqrt(d k/m)) with
respect to d and k. In the high or infinite-dimensional setting, we show a
dimension-free bound that is \tilde{O}(sqrt(k^2 s / m)) with respect to k and
s, where s is an upper bound on the number of non-zeros in the sparse code for
any training data point.Comment: Sparse Coding Stability Theorem from version 1 has been relaxed
considerably using a new notion of coding margin. Old Sparse Coding Stability
Theorem still in new version, now as Theorem 2. Presentation of all proofs
simplified/improved considerably. Paper reorganized. Empirical analysis
showing new coding margin is non-trivial on real dataset
A Bayes consistent 1-NN classifier
We show that a simple modification of the 1-nearest neighbor classifier
yields a strongly Bayes consistent learner. Prior to this work, the only
strongly Bayes consistent proximity-based method was the k-nearest neighbor
classifier, for k growing appropriately with sample size. We will argue that a
margin-regularized 1-NN enjoys considerable statistical and algorithmic
advantages over the k-NN classifier. These include user-friendly finite-sample
error bounds, as well as time- and memory-efficient learning and test-point
evaluation algorithms with a principled speed-accuracy tradeoff. Encouraging
empirical results are reported
- …