29 research outputs found
Improved bounds on sample size for implicit matrix trace estimators
This article is concerned with Monte-Carlo methods for the estimation of the
trace of an implicitly given matrix whose information is only available
through matrix-vector products. Such a method approximates the trace by an
average of expressions of the form \ww^t (A\ww), with random vectors
\ww drawn from an appropriate distribution. We prove, discuss and experiment
with bounds on the number of realizations required in order to guarantee a
probabilistic bound on the relative error of the trace estimation upon
employing Rademacher (Hutchinson), Gaussian and uniform unit vector (with and
without replacement) probability distributions.
In total, one necessary bound and six sufficient bounds are proved, improving
upon and extending similar estimates obtained in the seminal work of Avron and
Toledo (2011) in several dimensions. We first improve their bound on for
the Hutchinson method, dropping a term that relates to and making the
bound comparable with that for the Gaussian estimator.
We further prove new sufficient bounds for the Hutchinson, Gaussian and the
unit vector estimators, as well as a necessary bound for the Gaussian
estimator, which depend more specifically on properties of the matrix . As
such they may suggest for what type of matrices one distribution or another
provides a particularly effective or relatively ineffective stochastic
estimation method
Schur properties of convolutions of gamma random variables
Sufficient conditions for comparing the convolutions of heterogeneous gamma
random variables in terms of the usual stochastic order are established. Such
comparisons are characterized by the Schur convexity properties of the
cumulative distribution function of the convolutions. Some examples of the
practical applications of our results are given
Optimization Methods for Inverse Problems
Optimization plays an important role in solving many inverse problems.
Indeed, the task of inversion often either involves or is fully cast as a
solution of an optimization problem. In this light, the mere non-linear,
non-convex, and large-scale nature of many of these inversions gives rise to
some very challenging optimization problems. The inverse problem community has
long been developing various techniques for solving such optimization tasks.
However, other, seemingly disjoint communities, such as that of machine
learning, have developed, almost in parallel, interesting alternative methods
which might have stayed under the radar of the inverse problem community. In
this survey, we aim to change that. In doing so, we first discuss current
state-of-the-art optimization methods widely used in inverse problems. We then
survey recent related advances in addressing similar challenges in problems
faced by the machine learning community, and discuss their potential advantages
for solving inverse problems. By highlighting the similarities among the
optimization challenges faced by the inverse problem and the machine learning
communities, we hope that this survey can serve as a bridge in bringing
together these two communities and encourage cross fertilization of ideas.Comment: 13 page
Invariance of Weight Distributions in Rectified MLPs
An interesting approach to analyzing neural networks that has received
renewed attention is to examine the equivalent kernel of the neural network.
This is based on the fact that a fully connected feedforward network with one
hidden layer, a certain weight distribution, an activation function, and an
infinite number of neurons can be viewed as a mapping into a Hilbert space. We
derive the equivalent kernels of MLPs with ReLU or Leaky ReLU activations for
all rotationally-invariant weight distributions, generalizing a previous result
that required Gaussian weight distributions. Additionally, the Central Limit
Theorem is used to show that for certain activation functions, kernels
corresponding to layers with weight distributions having mean and finite
absolute third moment are asymptotically universal, and are well approximated
by the kernel corresponding to layers with spherical Gaussian weights. In deep
networks, as depth increases the equivalent kernel approaches a pathological
fixed point, which can be used to argue why training randomly initialized
networks can be difficult. Our results also have implications for weight
initialization.Comment: ICML 201
Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study
While first-order optimization methods such as stochastic gradient descent
(SGD) are popular in machine learning (ML), they come with well-known
deficiencies, including relatively-slow convergence, sensitivity to the
settings of hyper-parameters such as learning rate, stagnation at high training
errors, and difficulty in escaping flat regions and saddle points. These issues
are particularly acute in highly non-convex settings such as those arising in
neural networks. Motivated by this, there has been recent interest in
second-order methods that aim to alleviate these shortcomings by capturing
curvature information. In this paper, we report detailed empirical evaluations
of a class of Newton-type methods, namely sub-sampled variants of trust region
(TR) and adaptive regularization with cubics (ARC) algorithms, for non-convex
ML problems. In doing so, we demonstrate that these methods not only can be
computationally competitive with hand-tuned SGD with momentum, obtaining
comparable or better generalization performance, but also they are highly
robust to hyper-parameter settings. Further, in contrast to SGD with momentum,
we show that the manner in which these Newton-type methods employ curvature
information allows them to seamlessly escape flat regions and saddle points.Comment: 21 pages, 11 figures. Restructure the paper and add experiment
Assessing stochastic algorithms for large scale nonlinear least squares problems using extremal probabilities of linear combinations of gamma random variables
This article considers stochastic algorithms for efficiently solving a class
of large scale non-linear least squares (NLS) problems which frequently arise
in applications. We propose eight variants of a practical randomized algorithm
where the uncertainties in the major stochastic steps are quantified. Such
stochastic steps involve approximating the NLS objective function using
Monte-Carlo methods, and this is equivalent to the estimation of the trace of
corresponding symmetric positive semi-definite (SPSD) matrices. For the latter,
we prove tight necessary and sufficient conditions on the sample size (which
translates to cost) to satisfy the prescribed probabilistic accuracy. We show
that these conditions are practically computable and yield small sample sizes.
They are then incorporated in our stochastic algorithm to quantify the
uncertainty in each randomized step. The bounds we use are applications of more
general results regarding extremal tail probabilities of linear combinations of
gamma distributed random variables. We derive and prove new results concerning
the maximal and minimal tail probabilities of such linear combinations, which
can be considered independently of the rest of this paper
GIANT: Globally Improved Approximate Newton Method for Distributed Optimization
For distributed computing environment, we consider the empirical risk
minimization problem and propose a distributed and communication-efficient
Newton-type optimization method. At every iteration, each worker locally finds
an Approximate NewTon (ANT) direction, which is sent to the main driver. The
main driver, then, averages all the ANT directions received from workers to
form a {\it Globally Improved ANT} (GIANT) direction. GIANT is highly
communication efficient and naturally exploits the trade-offs between local
computations and global communications in that more local computations result
in fewer overall rounds of communications. Theoretically, we show that GIANT
enjoys an improved convergence rate as compared with first-order methods and
existing distributed Newton-type methods. Further, and in sharp contrast with
many existing distributed Newton-type methods, as well as popular first-order
methods, a highly advantageous practical feature of GIANT is that it only
involves one tuning parameter. We conduct large-scale experiments on a computer
cluster and, empirically, demonstrate the superior performance of GIANT.Comment: Fixed some typos. Improved writin