549 research outputs found

    Learning with SGD and Random Features

    Get PDF
    Sketching and stochastic gradient methods are arguably the most common techniques to derive efficient large scale learning algorithms. In this paper, we investigate their application in the context of nonparametric statistical learning. More precisely, we study the estimator defined by stochastic gradient with mini batches and random features. The latter can be seen as form of nonlinear sketching and used to define approximate kernel methods. The considered estimator is not explicitly penalized/constrained and regularization is implicit. Indeed, our study highlights how different parameters, such as number of features, iterations, step-size and mini-batch size control the learning properties of the solutions. We do this by deriving optimal finite sample bounds, under standard assumptions. The obtained results are corroborated and illustrated by numerical experiments

    Generalization Properties of Doubly Stochastic Learning Algorithms

    Full text link
    Doubly stochastic learning algorithms are scalable kernel methods that perform very well in practice. However, their generalization properties are not well understood and their analysis is challenging since the corresponding learning sequence may not be in the hypothesis space induced by the kernel. In this paper, we provide an in-depth theoretical analysis for different variants of doubly stochastic learning algorithms within the setting of nonparametric regression in a reproducing kernel Hilbert space and considering the square loss. Particularly, we derive convergence results on the generalization error for the studied algorithms either with or without an explicit penalty term. To the best of our knowledge, the derived results for the unregularized variants are the first of this kind, while the results for the regularized variants improve those in the literature. The novelties in our proof are a sample error bound that requires controlling the trace norm of a cumulative operator, and a refined analysis of bounding initial error.Comment: 24 pages. To appear in Journal of Complexit

    Learning with SGD and Random Features

    Get PDF
    Sketching and stochastic gradient methods are arguably the most common techniques to derive efficient large scale learning algorithms. In this paper, we investigate their application in the context of nonparametric statistical learning. More precisely, we study the estimator defined by stochastic gradient with mini batches and random features. The latter can be seen as form of nonlinear sketching and used to define approximate kernel methods. The considered estimator is not explicitly penalized/constrained and regularization is implicit. Indeed, our study highlights how different parameters, such as number of features, iterations, step-size and mini-batch size control the learning properties of the solutions. We do this by deriving optimal finite sample bounds, under standard assumptions. The obtained results are corroborated and illustrated by numerical experiments

    Leveraging Low-Rank Relations Between Surrogate Tasks in Structured Prediction

    Get PDF
    We study the interplay between surrogate methods for structured prediction and techniques from multitask learning designed to leverage relationships between surrogate outputs. We propose an efficient algorithm based on trace norm regularization which, differently from previous methods, does not require explicit knowledge of the coding/decoding functions of the surrogate framework. As a result, our algorithm can be applied to the broad class of problems in which the surrogate space is large or even infinite dimensional. We study excess risk bounds for trace norm regularized structured prediction, implying the consistency and learning rates for our estimator. We also identify relevant regimes in which our approach can enjoy better generalization performance than previous methods. Numerical experiments on ranking problems indicate that enforcing low-rank relations among surrogate outputs may indeed provide a significant advantage in practice.Comment: 42 pages, 1 tabl

    Scalable Kernel Methods via Doubly Stochastic Gradients

    Full text link
    The general perception is that kernel methods are not scalable, and neural nets are the methods of choice for nonlinear learning problems. Or have we simply not tried hard enough for kernel methods? Here we propose an approach that scales up kernel methods using a novel concept called "doubly stochastic functional gradients". Our approach relies on the fact that many kernel methods can be expressed as convex optimization problems, and we solve the problems by making two unbiased stochastic approximations to the functional gradient, one using random training points and another using random functions associated with the kernel, and then descending using this noisy functional gradient. We show that a function produced by this procedure after tt iterations converges to the optimal function in the reproducing kernel Hilbert space in rate O(1/t)O(1/t), and achieves a generalization performance of O(1/t)O(1/\sqrt{t}). This doubly stochasticity also allows us to avoid keeping the support vectors and to implement the algorithm in a small memory footprint, which is linear in number of iterations and independent of data dimension. Our approach can readily scale kernel methods up to the regimes which are dominated by neural nets. We show that our method can achieve competitive performance to neural nets in datasets such as 8 million handwritten digits from MNIST, 2.3 million energy materials from MolecularSpace, and 1 million photos from ImageNet.Comment: 32 pages, 22 figure

    Density Estimation in Infinite Dimensional Exponential Families

    Get PDF
    In this paper, we consider an infinite dimensional exponential family, P\mathcal{P} of probability densities, which are parametrized by functions in a reproducing kernel Hilbert space, HH and show it to be quite rich in the sense that a broad class of densities on Rd\mathbb{R}^d can be approximated arbitrarily well in Kullback-Leibler (KL) divergence by elements in P\mathcal{P}. The main goal of the paper is to estimate an unknown density, p0p_0 through an element in P\mathcal{P}. Standard techniques like maximum likelihood estimation (MLE) or pseudo MLE (based on the method of sieves), which are based on minimizing the KL divergence between p0p_0 and P\mathcal{P}, do not yield practically useful estimators because of their inability to efficiently handle the log-partition function. Instead, we propose an estimator, p^n\hat{p}_n based on minimizing the \emph{Fisher divergence}, J(p0p)J(p_0\Vert p) between p0p_0 and pPp\in \mathcal{P}, which involves solving a simple finite-dimensional linear system. When p0Pp_0\in\mathcal{P}, we show that the proposed estimator is consistent, and provide a convergence rate of nmin{23,2β+12β+2}n^{-\min\left\{\frac{2}{3},\frac{2\beta+1}{2\beta+2}\right\}} in Fisher divergence under the smoothness assumption that logp0R(Cβ)\log p_0\in\mathcal{R}(C^\beta) for some β0\beta\ge 0, where CC is a certain Hilbert-Schmidt operator on HH and R(Cβ)\mathcal{R}(C^\beta) denotes the image of CβC^\beta. We also investigate the misspecified case of p0Pp_0\notin\mathcal{P} and show that J(p0p^n)infpPJ(p0p)J(p_0\Vert\hat{p}_n)\rightarrow \inf_{p\in\mathcal{P}}J(p_0\Vert p) as nn\rightarrow\infty, and provide a rate for this convergence under a similar smoothness condition as above. Through numerical simulations we demonstrate that the proposed estimator outperforms the non-parametric kernel density estimator, and that the advantage with the proposed estimator grows as dd increases.Comment: 58 pages, 8 figures; Fixed some errors and typo

    Exponential convergence of testing error for stochastic gradient methods

    Get PDF
    We consider binary classification problems with positive definite kernels and square loss, and study the convergence rates of stochastic gradient methods. We show that while the excess testing loss (squared loss) converges slowly to zero as the number of observations (and thus iterations) goes to infinity, the testing error (classification error) converges exponentially fast if low-noise conditions are assumed
    corecore