Search CORE

1,271 research outputs found

Convergence of Unregularized Online Learning Algorithms

Author: Guo Zheng-Chu
Lei Yunwen
Shi Lei
Publication venue
Publication date: 09/08/2017
Field of study

In this paper we study the convergence of online gradient descent algorithms in reproducing kernel Hilbert spaces (RKHSs) without regularization. We establish a sufficient condition and a necessary condition for the convergence of excess generalization errors in expectation. A sufficient condition for the almost sure convergence is also given. With high probability, we provide explicit convergence rates of the excess generalization errors for both averaged iterates and the last iterate, which in turn also imply convergence rates with probability one. To our best knowledge, this is the first high-probability convergence rate for the last iterate of online gradient descent algorithms without strong convexity. Without any boundedness assumptions on iterates, our results are derived by a novel use of two measures of the algorithm's one-step progress, respectively by generalization errors and by distances in RKHSs, where the variances of the involved martingales are cancelled out by the descent property of the algorithm

arXiv.org e-Print Archive

University of Birmingham Research Portal

Scalable Kernel Methods via Doubly Stochastic Gradients

Author: Balcan Maria-Florina
Dai Bo
He Niao
Liang Yingyu
Raj Anant
Song Le
Xie Bo
Publication venue
Publication date: 10/09/2015
Field of study

The general perception is that kernel methods are not scalable, and neural nets are the methods of choice for nonlinear learning problems. Or have we simply not tried hard enough for kernel methods? Here we propose an approach that scales up kernel methods using a novel concept called "doubly stochastic functional gradients". Our approach relies on the fact that many kernel methods can be expressed as convex optimization problems, and we solve the problems by making two unbiased stochastic approximations to the functional gradient, one using random training points and another using random functions associated with the kernel, and then descending using this noisy functional gradient. We show that a function produced by this procedure after

t

iterations converges to the optimal function in the reproducing kernel Hilbert space in rate

O(1/t)

, and achieves a generalization performance of

O(1/\sqrt{t})

. This doubly stochasticity also allows us to avoid keeping the support vectors and to implement the algorithm in a small memory footprint, which is linear in number of iterations and independent of data dimension. Our approach can readily scale kernel methods up to the regimes which are dominated by neural nets. We show that our method can achieve competitive performance to neural nets in datasets such as 8 million handwritten digits from MNIST, 2.3 million energy materials from MolecularSpace, and 1 million photos from ImageNet.Comment: 32 pages, 22 figure

arXiv.org e-Print Archive

CiteSeerX

Iteration Complexity of Randomized Primal-Dual Methods for Convex-Concave Saddle Point Problems

Author: Aybat N. S.
Hamedani E. Yazdandoost
Jalilzadeh A.
Shanbhag U. V.
Publication venue
Publication date: 02/07/2018
Field of study

In this paper we propose a class of randomized primal-dual methods to contend with large-scale saddle point problems defined by a convex-concave function

\mathcal{L}(\mathbf{x},y)\triangleq\sum_{i=1}^m f_i(x_i)+\Phi(\mathbf{x},y)-h(y)

. We analyze the convergence rate of the proposed method under the settings of mere convexity and strong convexity in

\mathbf{x}

-variable. In particular, assuming

\nabla_y\Phi(\cdot,\cdot)

is Lipschitz and

\nabla_\mathbf{x}\Phi(\cdot,y)

is coordinate-wise Lipschitz for any fixed

y

, the ergodic sequence generated by the algorithm achieves the convergence rate of

\mathcal{O}(m/k)

in a suitable error metric where

m

denotes the number of coordinates for the primal variable. Furthermore, assuming that

\mathcal{L}(\cdot,y)

is uniformly strongly convex for any

y

, and that

\Phi(\cdot,y)

is linear in

y

, the scheme displays convergence rate of

\mathcal{O}(m/k^2)

. We implemented the proposed algorithmic framework to solve kernel matrix learning problem, and tested it against other state-of-the-art solvers

arXiv.org e-Print Archive

Stochastic functional descent for learning Support Vector Machines

Author: He Kun
Publication venue
Publication date: 22/01/2016
Field of study

We present a novel method for learning Support Vector Machines (SVMs) in the online setting. Our method is generally applicable in that it handles the online learning of the binary, multiclass, and structural SVMs in a unified view. The SVM learning problem consists of optimizing a convex objective function that is composed of two parts: the hinge loss and quadratic regularization. To date, the predominant family of approaches for online SVM learning has been gradient-based methods, such as Stochastic Gradient Descent (SGD). Unfortunately, we note that there are two drawbacks in such approaches: first, gradient-based methods are based on a local linear approximation to the function being optimized, but since the hinge loss is piecewise-linear and nonsmooth, this approximation can be ill-behaved. Second, existing online SVM learning approaches share the same problem formulation with batch SVM learning methods, and they all need to tune a fixed global regularization parameter by cross validation. On the one hand, global regularization is ineffective in handling local irregularities encountered in the online setting; on the other hand, even though the learning problem for a particular global regularization parameter value may be efficiently solved, repeatedly solving for a wide range of values can be costly. We intend to tackle these two problems with our approach. To address the first problem, we propose to perform implicit online update steps to optimize the hinge loss, as opposed to explicit (or gradient-based) updates that utilize subgradients to perform local linearization. Regarding the second problem, we propose to enforce local regularization that is applied to individual classifier update steps, rather than having a fixed global regularization term. Our theoretical analysis suggests that our classifier update steps progressively optimize the structured hinge loss, with the rate controlled by a sequence of regularization parameters; setting these parameters is analogous to setting the stepsizes in gradient-based methods. In addition, we give sufficient conditions for the algorithm's convergence. Experimentally, our online algorithm can match optimal classification performances given by other state-of-the-art online SVM learning methods, as well as batch learning methods, after only one or two passes over the training data. More importantly, our algorithm can attain these results without doing cross validation, while all other methods must perform time-consuming cross validation to determine the optimal choice of the global regularization parameter

Boston University Institutional Repository (OpenBU)