1,271 research outputs found

    Convergence of Unregularized Online Learning Algorithms

    Full text link
    In this paper we study the convergence of online gradient descent algorithms in reproducing kernel Hilbert spaces (RKHSs) without regularization. We establish a sufficient condition and a necessary condition for the convergence of excess generalization errors in expectation. A sufficient condition for the almost sure convergence is also given. With high probability, we provide explicit convergence rates of the excess generalization errors for both averaged iterates and the last iterate, which in turn also imply convergence rates with probability one. To our best knowledge, this is the first high-probability convergence rate for the last iterate of online gradient descent algorithms without strong convexity. Without any boundedness assumptions on iterates, our results are derived by a novel use of two measures of the algorithm's one-step progress, respectively by generalization errors and by distances in RKHSs, where the variances of the involved martingales are cancelled out by the descent property of the algorithm

    Scalable Kernel Methods via Doubly Stochastic Gradients

    Full text link
    The general perception is that kernel methods are not scalable, and neural nets are the methods of choice for nonlinear learning problems. Or have we simply not tried hard enough for kernel methods? Here we propose an approach that scales up kernel methods using a novel concept called "doubly stochastic functional gradients". Our approach relies on the fact that many kernel methods can be expressed as convex optimization problems, and we solve the problems by making two unbiased stochastic approximations to the functional gradient, one using random training points and another using random functions associated with the kernel, and then descending using this noisy functional gradient. We show that a function produced by this procedure after tt iterations converges to the optimal function in the reproducing kernel Hilbert space in rate O(1/t)O(1/t), and achieves a generalization performance of O(1/t)O(1/\sqrt{t}). This doubly stochasticity also allows us to avoid keeping the support vectors and to implement the algorithm in a small memory footprint, which is linear in number of iterations and independent of data dimension. Our approach can readily scale kernel methods up to the regimes which are dominated by neural nets. We show that our method can achieve competitive performance to neural nets in datasets such as 8 million handwritten digits from MNIST, 2.3 million energy materials from MolecularSpace, and 1 million photos from ImageNet.Comment: 32 pages, 22 figure

    Iteration Complexity of Randomized Primal-Dual Methods for Convex-Concave Saddle Point Problems

    Full text link
    In this paper we propose a class of randomized primal-dual methods to contend with large-scale saddle point problems defined by a convex-concave function L(x,y)β‰œβˆ‘i=1mfi(xi)+Ξ¦(x,y)βˆ’h(y)\mathcal{L}(\mathbf{x},y)\triangleq\sum_{i=1}^m f_i(x_i)+\Phi(\mathbf{x},y)-h(y). We analyze the convergence rate of the proposed method under the settings of mere convexity and strong convexity in x\mathbf{x}-variable. In particular, assuming βˆ‡yΞ¦(β‹…,β‹…)\nabla_y\Phi(\cdot,\cdot) is Lipschitz and βˆ‡xΞ¦(β‹…,y)\nabla_\mathbf{x}\Phi(\cdot,y) is coordinate-wise Lipschitz for any fixed yy, the ergodic sequence generated by the algorithm achieves the convergence rate of O(m/k)\mathcal{O}(m/k) in a suitable error metric where mm denotes the number of coordinates for the primal variable. Furthermore, assuming that L(β‹…,y)\mathcal{L}(\cdot,y) is uniformly strongly convex for any yy, and that Ξ¦(β‹…,y)\Phi(\cdot,y) is linear in yy, the scheme displays convergence rate of O(m/k2)\mathcal{O}(m/k^2). We implemented the proposed algorithmic framework to solve kernel matrix learning problem, and tested it against other state-of-the-art solvers

    Stochastic functional descent for learning Support Vector Machines

    Full text link
    We present a novel method for learning Support Vector Machines (SVMs) in the online setting. Our method is generally applicable in that it handles the online learning of the binary, multiclass, and structural SVMs in a unified view. The SVM learning problem consists of optimizing a convex objective function that is composed of two parts: the hinge loss and quadratic regularization. To date, the predominant family of approaches for online SVM learning has been gradient-based methods, such as Stochastic Gradient Descent (SGD). Unfortunately, we note that there are two drawbacks in such approaches: first, gradient-based methods are based on a local linear approximation to the function being optimized, but since the hinge loss is piecewise-linear and nonsmooth, this approximation can be ill-behaved. Second, existing online SVM learning approaches share the same problem formulation with batch SVM learning methods, and they all need to tune a fixed global regularization parameter by cross validation. On the one hand, global regularization is ineffective in handling local irregularities encountered in the online setting; on the other hand, even though the learning problem for a particular global regularization parameter value may be efficiently solved, repeatedly solving for a wide range of values can be costly. We intend to tackle these two problems with our approach. To address the first problem, we propose to perform implicit online update steps to optimize the hinge loss, as opposed to explicit (or gradient-based) updates that utilize subgradients to perform local linearization. Regarding the second problem, we propose to enforce local regularization that is applied to individual classifier update steps, rather than having a fixed global regularization term. Our theoretical analysis suggests that our classifier update steps progressively optimize the structured hinge loss, with the rate controlled by a sequence of regularization parameters; setting these parameters is analogous to setting the stepsizes in gradient-based methods. In addition, we give sufficient conditions for the algorithm's convergence. Experimentally, our online algorithm can match optimal classification performances given by other state-of-the-art online SVM learning methods, as well as batch learning methods, after only one or two passes over the training data. More importantly, our algorithm can attain these results without doing cross validation, while all other methods must perform time-consuming cross validation to determine the optimal choice of the global regularization parameter
    • …
    corecore