Search CORE

35 research outputs found

Generalization Bounds for Stochastic Gradient Descent via Localized $\varepsilon$ -Covers

Author: Erdogdu Murat A.
Park Sejun
Şimşekli Umut
Publication venue
Publication date: 19/09/2022
Field of study

In this paper, we propose a new covering technique localized for the trajectories of SGD. This localization provides an algorithm-specific complexity measured by the covering number, which can have dimension-independent cardinality in contrast to standard uniform covering arguments that result in exponential dimension dependency. Based on this localized construction, we show that if the objective function is a finite perturbation of a piecewise strongly convex and smooth function with

P

pieces, i.e. non-convex and non-smooth in general, the generalization error can be upper bounded by

O(\sqrt{(\log n\log(nP))/n})

, where

n

is the number of data samples. In particular, this rate is independent of dimension and does not require early stopping and decaying step size. Finally, we employ these results in various contexts and derive generalization bounds for multi-index linear models, multi-class support vector machines, and

K

-means clustering for both hard and soft label setups, improving the known state-of-the-art rates

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Optimal algorithms for smooth and strongly convex distributed optimization in networks

Author: Bach Francis
Bubeck Sébastien
Lee Yin Tat
Massoulié Laurent
Scaman Kevin
Publication venue: HAL CCSD
Publication date: 28/02/2017
Field of study

In this paper, we determine the optimal convergence rates for strongly convex and smooth distributed optimization in two settings: centralized and decentralized communications over a network. For centralized (i.e. master/slave) algorithms, we show that distributing Nesterov's accelerated gradient descent is optimal and achieves a precision

\varepsilon > 0

in time

O(\sqrt{\kappa_g}(1+\Delta\tau)\ln(1/\varepsilon))

, where

\kappa_g

is the condition number of the (global) function to optimize,

\Delta

is the diameter of the network, and

\tau

(resp.

1

) is the time needed to communicate values between two neighbors (resp. perform local computations). For decentralized algorithms based on gossip, we provide the first optimal algorithm, called the multi-step dual accelerated (MSDA) method, that achieves a precision

\varepsilon > 0

in time

O(\sqrt{\kappa_l}(1+\frac{\tau}{\sqrt{\gamma}})\ln(1/\varepsilon))

, where

\kappa_l

is the condition number of the local functions and

\gamma

is the (normalized) eigengap of the gossip matrix used for communication between nodes. We then verify the efficiency of MSDA against state-of-the-art methods for two problems: least-squares regression and classification by logistic regression

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Toward a unified theory of sparse dimensionality reduction in Euclidean space

Author: Avron H.
Bühlmann P.
Candès E.
Hegde C.
Lu Y.
Paul S.
Talagrand M.
Woodruff D. P.
Publication venue
Publication date: 01/01/2015
Field of study

Let

\Phi\in\mathbb{R}^{m\times n}

be a sparse Johnson-Lindenstrauss transform [KN14] with

s

non-zeroes per column. For a subset

T

of the unit sphere,

\varepsilon\in(0,1/2)

given, we study settings for

m,s

required to ensure

\mathop{\mathbb{E}}_\Phi \sup_{x\in T} \left|\|\Phi x\|_2^2 - 1 \right| < \varepsilon ,

i.e. so that

\Phi

preserves the norm of every

x\in T

simultaneously and multiplicatively up to

1+\varepsilon

. We introduce a new complexity parameter, which depends on the geometry of

T

, and show that it suffices to choose

s

and

m

such that this parameter is small. Our result is a sparse analog of Gordon's theorem, which was concerned with a dense

\Phi

having i.i.d. Gaussian entries. We qualitatively unify several results related to the Johnson-Lindenstrauss lemma, subspace embeddings, and Fourier-based restricted isometries. Our work also implies new results in using the sparse Johnson-Lindenstrauss transform in numerical linear algebra, classical and model-based compressed sensing, manifold learning, and constrained least squares problems such as the Lasso

arXiv.org e-Print Archive

CiteSeerX

Crossref

Publikationsserver der RWTH Aachen University

Utrecht University Repository

For Kernel Range Spaces a Constant Number of Queries Are Sufficient

Author: Phillips Jeff M.
Pourmahmood-Aghababa Hasan
Publication venue
Publication date: 28/06/2023
Field of study

We introduce the notion of an

\varepsilon

-cover for a kernel range space. A kernel range space concerns a set of points

X \subset \mathbb{R}^d

and the space of all queries by a fixed kernel (e.g., a Gaussian kernel

K(p,\cdot) = \exp(-\|p-\cdot\|^2)

). For a point set

X

of size

n

, a query returns a vector of values

R_p \in \mathbb{R}^n

, where the

i

th coordinate

(R_p)_i = K(p,x_i)

for

x_i \in X

. An

\varepsilon

-cover is a subset of points

Q \subset \mathbb{R}^d

so for any

p \in \mathbb{R}^d

that

\frac{1}{n} \|R_p - R_q\|_1\leq \varepsilon

for some

q \in Q

. This is a smooth analog of Haussler's notion of

\varepsilon

-covers for combinatorial range spaces (e.g., defined by subsets of points within a ball query) where the resulting vectors

R_p

are in

\{0,1\}^n

instead of

[0,1]^n

. The kernel versions of these range spaces show up in data analysis tasks where the coordinates may be uncertain or imprecise, and hence one wishes to add some flexibility in the notion of inside and outside of a query range. Our main result is that, unlike combinatorial range spaces, the size of kernel

\varepsilon

-covers is independent of the input size

n

and dimension

d

. We obtain a bound of

(1/\varepsilon)^{\tilde O(1/\varepsilon^2)}

, where

\tilde{O}(f(1/\varepsilon))

hides log factors in

(1/\varepsilon)

that can depend on the kernel. This implies that by relaxing the notion of boundaries in range queries, eventually the curse of dimensionality disappears, and may help explain the success of machine learning in very high-dimensions. We also complement this result with a lower bound of almost

(1/\varepsilon)^{\Omega(1/\varepsilon)}

, showing the exponential dependence on

1/\varepsilon

is necessary.Comment: 27 page

arXiv.org e-Print Archive