35 research outputs found
Generalization Bounds for Stochastic Gradient Descent via Localized -Covers
In this paper, we propose a new covering technique localized for the
trajectories of SGD. This localization provides an algorithm-specific
complexity measured by the covering number, which can have
dimension-independent cardinality in contrast to standard uniform covering
arguments that result in exponential dimension dependency. Based on this
localized construction, we show that if the objective function is a finite
perturbation of a piecewise strongly convex and smooth function with
pieces, i.e. non-convex and non-smooth in general, the generalization error can
be upper bounded by , where is the number of
data samples. In particular, this rate is independent of dimension and does not
require early stopping and decaying step size. Finally, we employ these results
in various contexts and derive generalization bounds for multi-index linear
models, multi-class support vector machines, and -means clustering for both
hard and soft label setups, improving the known state-of-the-art rates
Optimal algorithms for smooth and strongly convex distributed optimization in networks
In this paper, we determine the optimal convergence rates for strongly convex and smooth distributed optimization in two settings: centralized and decentralized communications over a network. For centralized (i.e. master/slave) algorithms, we show that distributing Nesterov's accelerated gradient descent is optimal and achieves a precision in time , where is the condition number of the (global) function to optimize, is the diameter of the network, and (resp. ) is the time needed to communicate values between two neighbors (resp. perform local computations). For decentralized algorithms based on gossip, we provide the first optimal algorithm, called the multi-step dual accelerated (MSDA) method, that achieves a precision in time , where is the condition number of the local functions and is the (normalized) eigengap of the gossip matrix used for communication between nodes. We then verify the efficiency of MSDA against state-of-the-art methods for two problems: least-squares regression and classification by logistic regression
Toward a unified theory of sparse dimensionality reduction in Euclidean space
Let be a sparse Johnson-Lindenstrauss
transform [KN14] with non-zeroes per column. For a subset of the unit
sphere, given, we study settings for required to
ensure i.e. so that preserves the norm of every
simultaneously and multiplicatively up to . We
introduce a new complexity parameter, which depends on the geometry of , and
show that it suffices to choose and such that this parameter is small.
Our result is a sparse analog of Gordon's theorem, which was concerned with a
dense having i.i.d. Gaussian entries. We qualitatively unify several
results related to the Johnson-Lindenstrauss lemma, subspace embeddings, and
Fourier-based restricted isometries. Our work also implies new results in using
the sparse Johnson-Lindenstrauss transform in numerical linear algebra,
classical and model-based compressed sensing, manifold learning, and
constrained least squares problems such as the Lasso
For Kernel Range Spaces a Constant Number of Queries Are Sufficient
We introduce the notion of an -cover for a kernel range space. A
kernel range space concerns a set of points and the
space of all queries by a fixed kernel (e.g., a Gaussian kernel ). For a point set of size , a query returns a
vector of values , where the th coordinate for . An -cover is a subset of points so for any that for some . This is a smooth analog of
Haussler's notion of -covers for combinatorial range spaces (e.g.,
defined by subsets of points within a ball query) where the resulting vectors
are in instead of . The kernel versions of these
range spaces show up in data analysis tasks where the coordinates may be
uncertain or imprecise, and hence one wishes to add some flexibility in the
notion of inside and outside of a query range.
Our main result is that, unlike combinatorial range spaces, the size of
kernel -covers is independent of the input size and dimension
. We obtain a bound of , where
hides log factors in that can
depend on the kernel. This implies that by relaxing the notion of boundaries in
range queries, eventually the curse of dimensionality disappears, and may help
explain the success of machine learning in very high-dimensions. We also
complement this result with a lower bound of almost
, showing the exponential dependence
on is necessary.Comment: 27 page