1,025 research outputs found

    Learning with SGD and Random Features

    Get PDF
    Sketching and stochastic gradient methods are arguably the most common techniques to derive efficient large scale learning algorithms. In this paper, we investigate their application in the context of nonparametric statistical learning. More precisely, we study the estimator defined by stochastic gradient with mini batches and random features. The latter can be seen as form of nonlinear sketching and used to define approximate kernel methods. The considered estimator is not explicitly penalized/constrained and regularization is implicit. Indeed, our study highlights how different parameters, such as number of features, iterations, step-size and mini-batch size control the learning properties of the solutions. We do this by deriving optimal finite sample bounds, under standard assumptions. The obtained results are corroborated and illustrated by numerical experiments

    Practical sketching algorithms for low-rank matrix approximation

    Get PDF
    This paper describes a suite of algorithms for constructing low-rank approximations of an input matrix from a random linear image of the matrix, called a sketch. These methods can preserve structural properties of the input matrix, such as positive-semidefiniteness, and they can produce approximations with a user-specified rank. The algorithms are simple, accurate, numerically stable, and provably correct. Moreover, each method is accompanied by an informative error bound that allows users to select parameters a priori to achieve a given approximation quality. These claims are supported by numerical experiments with real and synthetic data

    Sublinear Time Numerical Linear Algebra for Structured Matrices

    Full text link
    We show how to solve a number of problems in numerical linear algebra, such as least squares regression, ℓp\ell_p-regression for any p≄1p \geq 1, low rank approximation, and kernel regression, in time T(A) \poly(\log(nd)), where for a given input matrix A∈Rn×dA \in \mathbb{R}^{n \times d}, T(A)T(A) is the time needed to compute A⋅yA\cdot y for an arbitrary vector y∈Rdy \in \mathbb{R}^d. Since T(A) \leq O(\nnz(A)), where \nnz(A) denotes the number of non-zero entries of AA, the time is no worse, up to polylogarithmic factors, as all of the recent advances for such problems that run in input-sparsity time. However, for many applications, T(A)T(A) can be much smaller than \nnz(A), yielding significantly sublinear time algorithms. For example, in the overconstrained (1+Ï”)(1+\epsilon)-approximate polynomial interpolation problem, AA is a Vandermonde matrix and T(A)=O(nlog⁥n)T(A) = O(n \log n); in this case our running time is n \cdot \poly(\log n) + \poly(d/\epsilon) and we recover the results of \cite{avron2013sketching} as a special case. For overconstrained autoregression, which is a common problem arising in dynamical systems, T(A)=O(nlog⁥n)T(A) = O(n \log n), and we immediately obtain n \cdot \poly(\log n) + \poly(d/\epsilon) time. For kernel autoregression, we significantly improve the running time of prior algorithms for general kernels. For the important case of autoregression with the polynomial kernel and arbitrary target vector b∈Rnb\in\mathbb{R}^n, we obtain even faster algorithms. Our algorithms show that, perhaps surprisingly, most of these optimization problems do not require much more time than that of a polylogarithmic number of matrix-vector multiplications

    Network Sketching: Exploiting Binary Structure in Deep CNNs

    Full text link
    Convolutional neural networks (CNNs) with deep architectures have substantially advanced the state-of-the-art in computer vision tasks. However, deep networks are typically resource-intensive and thus difficult to be deployed on mobile devices. Recently, CNNs with binary weights have shown compelling efficiency to the community, whereas the accuracy of such models is usually unsatisfactory in practice. In this paper, we introduce network sketching as a novel technique of pursuing binary-weight CNNs, targeting at more faithful inference and better trade-off for practical applications. Our basic idea is to exploit binary structure directly in pre-trained filter banks and produce binary-weight models via tensor expansion. The whole process can be treated as a coarse-to-fine model approximation, akin to the pencil drawing steps of outlining and shading. To further speedup the generated models, namely the sketches, we also propose an associative implementation of binary tensor convolutions. Experimental results demonstrate that a proper sketch of AlexNet (or ResNet) outperforms the existing binary-weight models by large margins on the ImageNet large scale classification task, while the committed memory for network parameters only exceeds a little.Comment: To appear in CVPR201

    FedNS: A Fast Sketching Newton-Type Algorithm for Federated Learning

    Full text link
    Recent Newton-type federated learning algorithms have demonstrated linear convergence with respect to the communication rounds. However, communicating Hessian matrices is often unfeasible due to their quadratic communication complexity. In this paper, we introduce a novel approach to tackle this issue while still achieving fast convergence rates. Our proposed method, named as Federated Newton Sketch methods (FedNS), approximates the centralized Newton's method by communicating the sketched square-root Hessian instead of the exact Hessian. To enhance communication efficiency, we reduce the sketch size to match the effective dimension of the Hessian matrix. We provide convergence analysis based on statistical learning for the federated Newton sketch approaches. Specifically, our approaches reach super-linear convergence rates w.r.t. the communication rounds for the first time. We validate the effectiveness of our algorithms through various experiments, which coincide with our theoretical findings.Comment: Accepted at AAAI 202

    Training (Overparametrized) Neural Networks in Near-Linear Time

    Get PDF
    The slow convergence rate and pathological curvature issues of first-order gradient methods for training deep neural networks, initiated an ongoing effort for developing faster second\mathit{second}-order\mathit{order} optimization algorithms beyond SGD, without compromising the generalization error. Despite their remarkable convergence rate (independent\mathit{independent} of the training batch size nn), second-order algorithms incur a daunting slowdown in the cost\mathit{cost} per\mathit{per} iteration\mathit{iteration} (inverting the Hessian matrix of the loss function), which renders them impractical. Very recently, this computational overhead was mitigated by the works of [ZMG19,CGH+19}, yielding an O(mn2)O(mn^2)-time second-order algorithm for training two-layer overparametrized neural networks of polynomial width mm. We show how to speed up the algorithm of [CGH+19], achieving an O~(mn)\tilde{O}(mn)-time backpropagation algorithm for training (mildly overparametrized) ReLU networks, which is near-linear in the dimension (mnmn) of the full gradient (Jacobian) matrix. The centerpiece of our algorithm is to reformulate the Gauss-Newton iteration as an ℓ2\ell_2-regression problem, and then use a Fast-JL type dimension reduction to precondition\mathit{precondition} the underlying Gram matrix in time independent of MM, allowing to find a sufficiently good approximate solution via first\mathit{first}-order\mathit{order} conjugate gradient. Our result provides a proof-of-concept that advanced machinery from randomized linear algebra -- which led to recent breakthroughs in convex\mathit{convex} optimization\mathit{optimization} (ERM, LPs, Regression) -- can be carried over to the realm of deep learning as well
    • 

    corecore