35 research outputs found

    Generalization Bounds for Stochastic Gradient Descent via Localized Δ\varepsilon-Covers

    Full text link
    In this paper, we propose a new covering technique localized for the trajectories of SGD. This localization provides an algorithm-specific complexity measured by the covering number, which can have dimension-independent cardinality in contrast to standard uniform covering arguments that result in exponential dimension dependency. Based on this localized construction, we show that if the objective function is a finite perturbation of a piecewise strongly convex and smooth function with PP pieces, i.e. non-convex and non-smooth in general, the generalization error can be upper bounded by O((log⁥nlog⁥(nP))/n)O(\sqrt{(\log n\log(nP))/n}), where nn is the number of data samples. In particular, this rate is independent of dimension and does not require early stopping and decaying step size. Finally, we employ these results in various contexts and derive generalization bounds for multi-index linear models, multi-class support vector machines, and KK-means clustering for both hard and soft label setups, improving the known state-of-the-art rates

    Optimal algorithms for smooth and strongly convex distributed optimization in networks

    Get PDF
    In this paper, we determine the optimal convergence rates for strongly convex and smooth distributed optimization in two settings: centralized and decentralized communications over a network. For centralized (i.e. master/slave) algorithms, we show that distributing Nesterov's accelerated gradient descent is optimal and achieves a precision Δ>0\varepsilon > 0 in time O(Îșg(1+Δτ)ln⁥(1/Δ))O(\sqrt{\kappa_g}(1+\Delta\tau)\ln(1/\varepsilon)), where Îșg\kappa_g is the condition number of the (global) function to optimize, Δ\Delta is the diameter of the network, and τ\tau (resp. 11) is the time needed to communicate values between two neighbors (resp. perform local computations). For decentralized algorithms based on gossip, we provide the first optimal algorithm, called the multi-step dual accelerated (MSDA) method, that achieves a precision Δ>0\varepsilon > 0 in time O(Îșl(1+Ï„Îł)ln⁥(1/Δ))O(\sqrt{\kappa_l}(1+\frac{\tau}{\sqrt{\gamma}})\ln(1/\varepsilon)), where Îșl\kappa_l is the condition number of the local functions and Îł\gamma is the (normalized) eigengap of the gossip matrix used for communication between nodes. We then verify the efficiency of MSDA against state-of-the-art methods for two problems: least-squares regression and classification by logistic regression

    Toward a unified theory of sparse dimensionality reduction in Euclidean space

    Get PDF
    Let Ί∈Rm×n\Phi\in\mathbb{R}^{m\times n} be a sparse Johnson-Lindenstrauss transform [KN14] with ss non-zeroes per column. For a subset TT of the unit sphere, Δ∈(0,1/2)\varepsilon\in(0,1/2) given, we study settings for m,sm,s required to ensure EΊsup⁥x∈T∣∄Ίx∄22−1∣<Δ, \mathop{\mathbb{E}}_\Phi \sup_{x\in T} \left|\|\Phi x\|_2^2 - 1 \right| < \varepsilon , i.e. so that Ί\Phi preserves the norm of every x∈Tx\in T simultaneously and multiplicatively up to 1+Δ1+\varepsilon. We introduce a new complexity parameter, which depends on the geometry of TT, and show that it suffices to choose ss and mm such that this parameter is small. Our result is a sparse analog of Gordon's theorem, which was concerned with a dense Ί\Phi having i.i.d. Gaussian entries. We qualitatively unify several results related to the Johnson-Lindenstrauss lemma, subspace embeddings, and Fourier-based restricted isometries. Our work also implies new results in using the sparse Johnson-Lindenstrauss transform in numerical linear algebra, classical and model-based compressed sensing, manifold learning, and constrained least squares problems such as the Lasso

    For Kernel Range Spaces a Constant Number of Queries Are Sufficient

    Full text link
    We introduce the notion of an Δ\varepsilon-cover for a kernel range space. A kernel range space concerns a set of points X⊂RdX \subset \mathbb{R}^d and the space of all queries by a fixed kernel (e.g., a Gaussian kernel K(p,⋅)=exp⁥(−∄p−⋅∄2)K(p,\cdot) = \exp(-\|p-\cdot\|^2)). For a point set XX of size nn, a query returns a vector of values Rp∈RnR_p \in \mathbb{R}^n, where the iith coordinate (Rp)i=K(p,xi)(R_p)_i = K(p,x_i) for xi∈Xx_i \in X. An Δ\varepsilon-cover is a subset of points Q⊂RdQ \subset \mathbb{R}^d so for any p∈Rdp \in \mathbb{R}^d that 1n∄Rp−Rq∄1≀Δ\frac{1}{n} \|R_p - R_q\|_1\leq \varepsilon for some q∈Qq \in Q. This is a smooth analog of Haussler's notion of Δ\varepsilon-covers for combinatorial range spaces (e.g., defined by subsets of points within a ball query) where the resulting vectors RpR_p are in {0,1}n\{0,1\}^n instead of [0,1]n[0,1]^n. The kernel versions of these range spaces show up in data analysis tasks where the coordinates may be uncertain or imprecise, and hence one wishes to add some flexibility in the notion of inside and outside of a query range. Our main result is that, unlike combinatorial range spaces, the size of kernel Δ\varepsilon-covers is independent of the input size nn and dimension dd. We obtain a bound of (1/Δ)O~(1/Δ2)(1/\varepsilon)^{\tilde O(1/\varepsilon^2)}, where O~(f(1/Δ))\tilde{O}(f(1/\varepsilon)) hides log factors in (1/Δ)(1/\varepsilon) that can depend on the kernel. This implies that by relaxing the notion of boundaries in range queries, eventually the curse of dimensionality disappears, and may help explain the success of machine learning in very high-dimensions. We also complement this result with a lower bound of almost (1/Δ)Ω(1/Δ)(1/\varepsilon)^{\Omega(1/\varepsilon)}, showing the exponential dependence on 1/Δ1/\varepsilon is necessary.Comment: 27 page
    corecore