704 research outputs found

    Learning with Spectral Kernels and Heavy-Tailed Data

    Full text link
    Two ubiquitous aspects of large-scale data analysis are that the data often have heavy-tailed properties and that diffusion-based or spectral-based methods are often used to identify and extract structure of interest. Perhaps surprisingly, popular distribution-independent methods such as those based on the VC dimension fail to provide nontrivial results for even simple learning problems such as binary classification in these two settings. In this paper, we develop distribution-dependent learning methods that can be used to provide dimension-independent sample complexity bounds for the binary classification problem in these two popular settings. In particular, we provide bounds on the sample complexity of maximum margin classifiers when the magnitude of the entries in the feature vector decays according to a power law and also when learning is performed with the so-called Diffusion Maps kernel. Both of these results rely on bounding the annealed entropy of gap-tolerant classifiers in a Hilbert space. We provide such a bound, and we demonstrate that our proof technique generalizes to the case when the margin is measured with respect to more general Banach space norms. The latter result is of potential interest in cases where modeling the relationship between data elements as a dot product in a Hilbert space is too restrictive.Comment: 21 pages. Substantially revised and extended relative to the first versio

    Convex Risk Minimization and Conditional Probability Estimation

    Full text link
    This paper proves, in very general settings, that convex risk minimization is a procedure to select a unique conditional probability model determined by the classification problem. Unlike most previous work, we give results that are general enough to include cases in which no minimum exists, as occurs typically, for instance, with standard boosting algorithms. Concretely, we first show that any sequence of predictors minimizing convex risk over the source distribution will converge to this unique model when the class of predictors is linear (but potentially of infinite dimension). Secondly, we show the same result holds for \emph{empirical} risk minimization whenever this class of predictors is finite dimensional, where the essential technical contribution is a norm-free generalization bound.Comment: To appear, COLT 201

    Bayesian Inference with Posterior Regularization and applications to Infinite Latent SVMs

    Full text link
    Existing Bayesian models, especially nonparametric Bayesian methods, rely on specially conceived priors to incorporate domain knowledge for discovering improved latent representations. While priors can affect posterior distributions through Bayes' rule, imposing posterior regularization is arguably more direct and in some cases more natural and general. In this paper, we present regularized Bayesian inference (RegBayes), a novel computational framework that performs posterior inference with a regularization term on the desired post-data posterior distribution under an information theoretical formulation. RegBayes is more flexible than the procedure that elicits expert knowledge via priors, and it covers both directed Bayesian networks and undirected Markov networks whose Bayesian formulation results in hybrid chain graph models. When the regularization is induced from a linear operator on the posterior distributions, such as the expectation operator, we present a general convex-analysis theorem to characterize the solution of RegBayes. Furthermore, we present two concrete examples of RegBayes, infinite latent support vector machines (iLSVM) and multi-task infinite latent support vector machines (MT-iLSVM), which explore the large-margin idea in combination with a nonparametric Bayesian model for discovering predictive latent features for classification and multi-task learning, respectively. We present efficient inference methods and report empirical studies on several benchmark datasets, which appear to demonstrate the merits inherited from both large-margin learning and Bayesian nonparametrics. Such results were not available until now, and contribute to push forward the interface between these two important subfields, which have been largely treated as isolated in the community.Comment: 49 pages, 11 figure

    Multi-task Learning in Vector-valued Reproducing Kernel Banach Spaces with the 1\ell^1 Norm

    Full text link
    Targeting at sparse multi-task learning, we consider regularization models with an 1\ell^1 penalty on the coefficients of kernel functions. In order to provide a kernel method for this model, we construct a class of vector-valued reproducing kernel Banach spaces with the 1\ell^1 norm. The notion of multi-task admissible kernels is proposed so that the constructed spaces could have desirable properties including the crucial linear representer theorem. Such kernels are related to bounded Lebesgue constants of a kernel interpolation question. We study the Lebesgue constant of multi-task kernels and provide examples of admissible kernels. Furthermore, we present numerical experiments for both synthetic data and real-world benchmark data to demonstrate the advantages of the proposed construction and regularization models

    Solving p ⁣\ell^p\!-norm regularization with tensor kernels

    Full text link
    In this paper, we discuss how a suitable family of tensor kernels can be used to efficiently solve nonparametric extensions of p\ell^p regularized learning methods. Our main contribution is proposing a fast dual algorithm, and showing that it allows to solve the problem efficiently. Our results contrast recent findings suggesting kernel methods cannot be extended beyond Hilbert setting. Numerical experiments confirm the effectiveness of the method

    Online Learning via Sequential Complexities

    Full text link
    We consider the problem of sequential prediction and provide tools to study the minimax value of the associated game. Classical statistical learning theory provides several useful complexity measures to study learning with i.i.d. data. Our proposed sequential complexities can be seen as extensions of these measures to the sequential setting. The developed theory is shown to yield precise learning guarantees for the problem of sequential prediction. In particular, we show necessary and sufficient conditions for online learnability in the setting of supervised learning. Several examples show the utility of our framework: we can establish learnability without having to exhibit an explicit online learning algorithm

    Empirical margin distributions and bounding the generalization error of combined classifiers

    Full text link
    We prove new probabilistic upper bounds on generalization error of complex classifiers that are combinations of simple classifiers. Such combinations could be implemented by neural networks or by voting methods of combining the classifiers, such as boosting and bagging. The bounds are in terms of the empirical distribution of the margin of the combined classifier. They are based on the methods of the theory of Gaussian and empirical processes (comparison inequalities, symmetrization method, concentration inequalities) and they improve previous results of Bartlett (1998) on bounding the generalization error of neural networks in terms of l_1-norms of the weights of neurons and of Schapire, Freund, Bartlett and Lee (1998) on bounding the generalization error of boosting. We also obtain rates of convergence in Levy distance of empirical margin distribution to the true margin distribution uniformly over the classes of classifiers and prove the optimality of these rates.Comment: 35 pages, 1 figur

    Non-asymptotic Analysis of 1\ell_1-norm Support Vector Machines

    Full text link
    Support Vector Machines (SVM) with 1\ell_1 penalty became a standard tool in analysis of highdimensional classification problems with sparsity constraints in many applications including bioinformatics and signal processing. Although SVM have been studied intensively in the literature, this paper has to our knowledge first non-asymptotic results on the performance of 1\ell_1-SVM in identification of sparse classifiers. We show that a dd-dimensional ss-sparse classification vector can be (with high probability) well approximated from only O(slog(d))O(s\log(d)) Gaussian trials. The methods used in the proof include concentration of measure and probability in Banach spaces

    On the Sample Complexity of Predictive Sparse Coding

    Full text link
    The goal of predictive sparse coding is to learn a representation of examples as sparse linear combinations of elements from a dictionary, such that a learned hypothesis linear in the new representation performs well on a predictive task. Predictive sparse coding algorithms recently have demonstrated impressive performance on a variety of supervised tasks, but their generalization properties have not been studied. We establish the first generalization error bounds for predictive sparse coding, covering two settings: 1) the overcomplete setting, where the number of features k exceeds the original dimensionality d; and 2) the high or infinite-dimensional setting, where only dimension-free bounds are useful. Both learning bounds intimately depend on stability properties of the learned sparse encoder, as measured on the training sample. Consequently, we first present a fundamental stability result for the LASSO, a result characterizing the stability of the sparse codes with respect to perturbations to the dictionary. In the overcomplete setting, we present an estimation error bound that decays as \tilde{O}(sqrt(d k/m)) with respect to d and k. In the high or infinite-dimensional setting, we show a dimension-free bound that is \tilde{O}(sqrt(k^2 s / m)) with respect to k and s, where s is an upper bound on the number of non-zeros in the sparse code for any training data point.Comment: Sparse Coding Stability Theorem from version 1 has been relaxed considerably using a new notion of coding margin. Old Sparse Coding Stability Theorem still in new version, now as Theorem 2. Presentation of all proofs simplified/improved considerably. Paper reorganized. Empirical analysis showing new coding margin is non-trivial on real dataset

    A Bayes consistent 1-NN classifier

    Full text link
    We show that a simple modification of the 1-nearest neighbor classifier yields a strongly Bayes consistent learner. Prior to this work, the only strongly Bayes consistent proximity-based method was the k-nearest neighbor classifier, for k growing appropriately with sample size. We will argue that a margin-regularized 1-NN enjoys considerable statistical and algorithmic advantages over the k-NN classifier. These include user-friendly finite-sample error bounds, as well as time- and memory-efficient learning and test-point evaluation algorithms with a principled speed-accuracy tradeoff. Encouraging empirical results are reported
    corecore