45 research outputs found
Stochastic Frank-Wolfe Methods for Nonconvex Optimization
We study Frank-Wolfe methods for nonconvex stochastic and finite-sum
optimization problems. Frank-Wolfe methods (in the convex case) have gained
tremendous recent interest in machine learning and optimization communities due
to their projection-free property and their ability to exploit structured
constraints. However, our understanding of these algorithms in the nonconvex
setting is fairly limited. In this paper, we propose nonconvex stochastic
Frank-Wolfe methods and analyze their convergence properties. For objective
functions that decompose into a finite-sum, we leverage ideas from variance
reduction techniques for convex optimization to obtain new variance reduced
nonconvex Frank-Wolfe methods that have provably faster convergence than the
classical Frank-Wolfe method. Finally, we show that the faster convergence
rates of our variance reduced methods also translate into improved convergence
rates for the stochastic setting
On the High-dimensional Power of Linear-time Kernel Two-Sample Testing under Mean-difference Alternatives
Nonparametric two sample testing deals with the question of consistently
deciding if two distributions are different, given samples from both, without
making any parametric assumptions about the form of the distributions. The
current literature is split into two kinds of tests - those which are
consistent without any assumptions about how the distributions may differ
(\textit{general} alternatives), and those which are designed to specifically
test easier alternatives, like a difference in means (\textit{mean-shift}
alternatives).
The main contribution of this paper is to explicitly characterize the power
of a popular nonparametric two sample test, designed for general alternatives,
under a mean-shift alternative in the high-dimensional setting. Specifically,
we explicitly derive the power of the linear-time Maximum Mean Discrepancy
statistic using the Gaussian kernel, where the dimension and sample size can
both tend to infinity at any rate, and the two distributions differ in their
means. As a corollary, we find that if the signal-to-noise ratio is held
constant, then the test's power goes to one if the number of samples increases
faster than the dimension increases. This is the first explicit power
derivation for a general nonparametric test in the high-dimensional setting,
and also the first analysis of how tests designed for general alternatives
perform when faced with easier ones.Comment: 25 pages, 5 figure
On the Decreasing Power of Kernel and Distance based Nonparametric Hypothesis Tests in High Dimensions
This paper is about two related decision theoretic problems, nonparametric
two-sample testing and independence testing. There is a belief that two
recently proposed solutions, based on kernels and distances between pairs of
points, behave well in high-dimensional settings. We identify different sources
of misconception that give rise to the above belief. Specifically, we
differentiate the hardness of estimation of test statistics from the hardness
of testing whether these statistics are zero or not, and explicitly discuss a
notion of "fair" alternative hypotheses for these problems as dimension
increases. We then demonstrate that the power of these tests actually drops
polynomially with increasing dimension against fair alternatives. We end with
some theoretical insights and shed light on the \textit{median heuristic} for
kernel bandwidth selection. Our work advances the current understanding of the
power of modern nonparametric hypothesis tests in high dimensions.Comment: 19 pages, 9 figures, published in AAAI-15: The 29th AAAI Conference
on Artificial Intelligence (with author order reversed from ArXiv
Escaping Saddle Points with Adaptive Gradient Methods
Adaptive methods such as Adam and RMSProp are widely used in deep learning
but are not well understood. In this paper, we seek a crisp, clean and precise
characterization of their behavior in nonconvex settings. To this end, we first
provide a novel view of adaptive methods as preconditioned SGD, where the
preconditioner is estimated in an online manner. By studying the preconditioner
on its own, we elucidate its purpose: it rescales the stochastic gradient noise
to be isotropic near stationary points, which helps escape saddle points.
Furthermore, we show that adaptive methods can efficiently estimate the
aforementioned preconditioner. By gluing together these two components, we
provide the first (to our knowledge) second-order convergence result for any
adaptive method. The key insight from our analysis is that, compared to SGD,
adaptive methods escape saddle points faster, and can converge faster overall
to second-order stationary points.Comment: Update Theorem 4.1 and proof to use martingale concentration bounds,
i.e. matrix Freedma