1,025 research outputs found
Learning with SGD and Random Features
Sketching and stochastic gradient methods are arguably the most common
techniques to derive efficient large scale learning algorithms. In this paper,
we investigate their application in the context of nonparametric statistical
learning. More precisely, we study the estimator defined by stochastic gradient
with mini batches and random features. The latter can be seen as form of
nonlinear sketching and used to define approximate kernel methods. The
considered estimator is not explicitly penalized/constrained and regularization
is implicit. Indeed, our study highlights how different parameters, such as
number of features, iterations, step-size and mini-batch size control the
learning properties of the solutions. We do this by deriving optimal finite
sample bounds, under standard assumptions. The obtained results are
corroborated and illustrated by numerical experiments
Practical sketching algorithms for low-rank matrix approximation
This paper describes a suite of algorithms for constructing low-rank
approximations of an input matrix from a random linear image of the matrix,
called a sketch. These methods can preserve structural properties of the input
matrix, such as positive-semidefiniteness, and they can produce approximations
with a user-specified rank. The algorithms are simple, accurate, numerically
stable, and provably correct. Moreover, each method is accompanied by an
informative error bound that allows users to select parameters a priori to
achieve a given approximation quality. These claims are supported by numerical
experiments with real and synthetic data
Sublinear Time Numerical Linear Algebra for Structured Matrices
We show how to solve a number of problems in numerical linear algebra, such
as least squares regression, -regression for any , low rank
approximation, and kernel regression, in time T(A) \poly(\log(nd)), where for
a given input matrix , is the time needed
to compute for an arbitrary vector . Since T(A)
\leq O(\nnz(A)), where \nnz(A) denotes the number of non-zero entries of
, the time is no worse, up to polylogarithmic factors, as all of the recent
advances for such problems that run in input-sparsity time. However, for many
applications, can be much smaller than \nnz(A), yielding significantly
sublinear time algorithms. For example, in the overconstrained
-approximate polynomial interpolation problem, is a
Vandermonde matrix and ; in this case our running time is
n \cdot \poly(\log n) + \poly(d/\epsilon) and we recover the results of
\cite{avron2013sketching} as a special case. For overconstrained
autoregression, which is a common problem arising in dynamical systems, , and we immediately obtain n \cdot \poly(\log n) +
\poly(d/\epsilon) time. For kernel autoregression, we significantly improve
the running time of prior algorithms for general kernels. For the important
case of autoregression with the polynomial kernel and arbitrary target vector
, we obtain even faster algorithms. Our algorithms show that,
perhaps surprisingly, most of these optimization problems do not require much
more time than that of a polylogarithmic number of matrix-vector
multiplications
Network Sketching: Exploiting Binary Structure in Deep CNNs
Convolutional neural networks (CNNs) with deep architectures have
substantially advanced the state-of-the-art in computer vision tasks. However,
deep networks are typically resource-intensive and thus difficult to be
deployed on mobile devices. Recently, CNNs with binary weights have shown
compelling efficiency to the community, whereas the accuracy of such models is
usually unsatisfactory in practice. In this paper, we introduce network
sketching as a novel technique of pursuing binary-weight CNNs, targeting at
more faithful inference and better trade-off for practical applications. Our
basic idea is to exploit binary structure directly in pre-trained filter banks
and produce binary-weight models via tensor expansion. The whole process can be
treated as a coarse-to-fine model approximation, akin to the pencil drawing
steps of outlining and shading. To further speedup the generated models, namely
the sketches, we also propose an associative implementation of binary tensor
convolutions. Experimental results demonstrate that a proper sketch of AlexNet
(or ResNet) outperforms the existing binary-weight models by large margins on
the ImageNet large scale classification task, while the committed memory for
network parameters only exceeds a little.Comment: To appear in CVPR201
FedNS: A Fast Sketching Newton-Type Algorithm for Federated Learning
Recent Newton-type federated learning algorithms have demonstrated linear
convergence with respect to the communication rounds. However, communicating
Hessian matrices is often unfeasible due to their quadratic communication
complexity. In this paper, we introduce a novel approach to tackle this issue
while still achieving fast convergence rates. Our proposed method, named as
Federated Newton Sketch methods (FedNS), approximates the centralized Newton's
method by communicating the sketched square-root Hessian instead of the exact
Hessian. To enhance communication efficiency, we reduce the sketch size to
match the effective dimension of the Hessian matrix. We provide convergence
analysis based on statistical learning for the federated Newton sketch
approaches. Specifically, our approaches reach super-linear convergence rates
w.r.t. the communication rounds for the first time. We validate the
effectiveness of our algorithms through various experiments, which coincide
with our theoretical findings.Comment: Accepted at AAAI 202
Training (Overparametrized) Neural Networks in Near-Linear Time
The slow convergence rate and pathological curvature issues of first-order
gradient methods for training deep neural networks, initiated an ongoing effort
for developing faster - optimization
algorithms beyond SGD, without compromising the generalization error. Despite
their remarkable convergence rate ( of the training batch
size ), second-order algorithms incur a daunting slowdown in the
(inverting the Hessian
matrix of the loss function), which renders them impractical. Very recently,
this computational overhead was mitigated by the works of [ZMG19,CGH+19},
yielding an -time second-order algorithm for training two-layer
overparametrized neural networks of polynomial width .
We show how to speed up the algorithm of [CGH+19], achieving an
-time backpropagation algorithm for training (mildly
overparametrized) ReLU networks, which is near-linear in the dimension ()
of the full gradient (Jacobian) matrix. The centerpiece of our algorithm is to
reformulate the Gauss-Newton iteration as an -regression problem, and
then use a Fast-JL type dimension reduction to the
underlying Gram matrix in time independent of , allowing to find a
sufficiently good approximate solution via -
conjugate gradient. Our result provides a proof-of-concept that advanced
machinery from randomized linear algebra -- which led to recent breakthroughs
in (ERM, LPs, Regression) -- can be
carried over to the realm of deep learning as well
- âŠ