592 research outputs found
Improved bounds on sample size for implicit matrix trace estimators
This article is concerned with Monte-Carlo methods for the estimation of the
trace of an implicitly given matrix whose information is only available
through matrix-vector products. Such a method approximates the trace by an
average of expressions of the form \ww^t (A\ww), with random vectors
\ww drawn from an appropriate distribution. We prove, discuss and experiment
with bounds on the number of realizations required in order to guarantee a
probabilistic bound on the relative error of the trace estimation upon
employing Rademacher (Hutchinson), Gaussian and uniform unit vector (with and
without replacement) probability distributions.
In total, one necessary bound and six sufficient bounds are proved, improving
upon and extending similar estimates obtained in the seminal work of Avron and
Toledo (2011) in several dimensions. We first improve their bound on for
the Hutchinson method, dropping a term that relates to and making the
bound comparable with that for the Gaussian estimator.
We further prove new sufficient bounds for the Hutchinson, Gaussian and the
unit vector estimators, as well as a necessary bound for the Gaussian
estimator, which depend more specifically on properties of the matrix . As
such they may suggest for what type of matrices one distribution or another
provides a particularly effective or relatively ineffective stochastic
estimation method
Schur properties of convolutions of gamma random variables
Sufficient conditions for comparing the convolutions of heterogeneous gamma
random variables in terms of the usual stochastic order are established. Such
comparisons are characterized by the Schur convexity properties of the
cumulative distribution function of the convolutions. Some examples of the
practical applications of our results are given
Newton-Type Methods for Non-Convex Optimization Under Inexact Hessian Information
We consider variants of trust-region and cubic regularization methods for
non-convex optimization, in which the Hessian matrix is approximated. Under
mild conditions on the inexact Hessian, and using approximate solution of the
corresponding sub-problems, we provide iteration complexity to achieve -approximate second-order optimality which have shown to be tight.
Our Hessian approximation conditions constitute a major relaxation over the
existing ones in the literature. Consequently, we are able to show that such
mild conditions allow for the construction of the approximate Hessian through
various random sampling methods. In this light, we consider the canonical
problem of finite-sum minimization, provide appropriate uniform and non-uniform
sub-sampling strategies to construct such Hessian approximations, and obtain
optimal iteration complexity for the corresponding sub-sampled trust-region and
cubic regularization methods.Comment: 32 page
Implicit Langevin Algorithms for Sampling From Log-concave Densities
For sampling from a log-concave density, we study implicit integrators
resulting from -method discretization of the overdamped Langevin
diffusion stochastic differential equation. Theoretical and algorithmic
properties of the resulting sampling methods for and a
range of step sizes are established. Our results generalize and extend prior
works in several directions. In particular, for , we prove
geometric ergodicity and stability of the resulting methods for all step sizes.
We show that obtaining subsequent samples amounts to solving a strongly-convex
optimization problem, which is readily achievable using one of numerous
existing methods. Numerical examples supporting our theoretical analysis are
also presented
Invariance of Weight Distributions in Rectified MLPs
An interesting approach to analyzing neural networks that has received
renewed attention is to examine the equivalent kernel of the neural network.
This is based on the fact that a fully connected feedforward network with one
hidden layer, a certain weight distribution, an activation function, and an
infinite number of neurons can be viewed as a mapping into a Hilbert space. We
derive the equivalent kernels of MLPs with ReLU or Leaky ReLU activations for
all rotationally-invariant weight distributions, generalizing a previous result
that required Gaussian weight distributions. Additionally, the Central Limit
Theorem is used to show that for certain activation functions, kernels
corresponding to layers with weight distributions having mean and finite
absolute third moment are asymptotically universal, and are well approximated
by the kernel corresponding to layers with spherical Gaussian weights. In deep
networks, as depth increases the equivalent kernel approaches a pathological
fixed point, which can be used to argue why training randomly initialized
networks can be difficult. Our results also have implications for weight
initialization.Comment: ICML 201
Assessing stochastic algorithms for large scale nonlinear least squares problems using extremal probabilities of linear combinations of gamma random variables
This article considers stochastic algorithms for efficiently solving a class
of large scale non-linear least squares (NLS) problems which frequently arise
in applications. We propose eight variants of a practical randomized algorithm
where the uncertainties in the major stochastic steps are quantified. Such
stochastic steps involve approximating the NLS objective function using
Monte-Carlo methods, and this is equivalent to the estimation of the trace of
corresponding symmetric positive semi-definite (SPSD) matrices. For the latter,
we prove tight necessary and sufficient conditions on the sample size (which
translates to cost) to satisfy the prescribed probabilistic accuracy. We show
that these conditions are practically computable and yield small sample sizes.
They are then incorporated in our stochastic algorithm to quantify the
uncertainty in each randomized step. The bounds we use are applications of more
general results regarding extremal tail probabilities of linear combinations of
gamma distributed random variables. We derive and prove new results concerning
the maximal and minimal tail probabilities of such linear combinations, which
can be considered independently of the rest of this paper
Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study
While first-order optimization methods such as stochastic gradient descent
(SGD) are popular in machine learning (ML), they come with well-known
deficiencies, including relatively-slow convergence, sensitivity to the
settings of hyper-parameters such as learning rate, stagnation at high training
errors, and difficulty in escaping flat regions and saddle points. These issues
are particularly acute in highly non-convex settings such as those arising in
neural networks. Motivated by this, there has been recent interest in
second-order methods that aim to alleviate these shortcomings by capturing
curvature information. In this paper, we report detailed empirical evaluations
of a class of Newton-type methods, namely sub-sampled variants of trust region
(TR) and adaptive regularization with cubics (ARC) algorithms, for non-convex
ML problems. In doing so, we demonstrate that these methods not only can be
computationally competitive with hand-tuned SGD with momentum, obtaining
comparable or better generalization performance, but also they are highly
robust to hyper-parameter settings. Further, in contrast to SGD with momentum,
we show that the manner in which these Newton-type methods employ curvature
information allows them to seamlessly escape flat regions and saddle points.Comment: 21 pages, 11 figures. Restructure the paper and add experiment
- …