83,642 research outputs found
On the convergence of reduction-based and model-based methods in proof theory
In the recent past, the reduction-based and the model-based methods to prove
cut elimination have converged, so that they now appear just as two sides of
the same coin. This paper details some of the steps of this transformation
Why gradient clipping accelerates training: A theoretical justification for adaptivity
We provide a theoretical explanation for the effectiveness of gradient
clipping in training deep neural networks. The key ingredient is a new
smoothness condition derived from practical neural network training examples.
We observe that gradient smoothness, a concept central to the analysis of
first-order optimization algorithms that is often assumed to be a constant,
demonstrates significant variability along the training trajectory of deep
neural networks. Further, this smoothness positively correlates with the
gradient norm, and contrary to standard assumptions in the literature, it can
grow with the norm of the gradient. These empirical observations limit the
applicability of existing theoretical analyses of algorithms that rely on a
fixed bound on smoothness. These observations motivate us to introduce a novel
relaxation of gradient smoothness that is weaker than the commonly used
Lipschitz smoothness assumption. Under the new condition, we prove that two
popular methods, namely, \emph{gradient clipping} and \emph{normalized
gradient}, converge arbitrarily faster than gradient descent with fixed
stepsize. We further explain why such adaptively scaled gradient methods can
accelerate empirical convergence and verify our results empirically in popular
neural network training settings
Stochastic Zeroth-order Optimization via Variance Reduction method
Derivative-free optimization has become an important technique used in
machine learning for optimizing black-box models. To conduct updates without
explicitly computing gradient, most current approaches iteratively sample a
random search direction from Gaussian distribution and compute the estimated
gradient along that direction. However, due to the variance in the search
direction, the convergence rates and query complexities of existing methods
suffer from a factor of , where is the problem dimension. In this paper,
we introduce a novel Stochastic Zeroth-order method with Variance Reduction
under Gaussian smoothing (SZVR-G) and establish the complexity for optimizing
non-convex problems. With variance reduction on both sample space and search
space, the complexity of our algorithm is sublinear to and is strictly
better than current approaches, in both smooth and non-smooth cases. Moreover,
we extend the proposed method to the mini-batch version. Our experimental
results demonstrate the superior performance of the proposed method over
existing derivative-free optimization techniques. Furthermore, we successfully
apply our method to conduct a universal black-box attack to deep neural
networks and present some interesting results
Exponential Convergence of Online Enrichment in Localized Reduced Basis Methods
Online enrichment is the extension of a reduced solution space based on the
solution of the reduced model. Procedures for online enrichment were published
for many localized model order reduction techniques. We show that residual
based online enrichment on overlapping domains converges exponentially.
Furthermore, we present an optimal enrichment strategy which couples the global
reduced space with a local fine space. Numerical experiments on the two
dimensional stationary heat equation with high contrast and channels confirm
and illustrate the results.Comment: 5 pages, 7 figures, 2 algorithm
Necessary Conditions and Tight Two-level Convergence Bounds for Parareal and Multigrid Reduction in Time
Parareal and multigrid reduction in time (MGRiT) are two of the most popular
parallel-in-time methods. The idea is to treat time integration in a parallel
context by using a multigrid method in time. If is a (fine-grid)
time-stepping scheme, let denote a "coarse-grid" time-stepping scheme
chosen to approximate steps of , . In particular,
defines the coarse-grid correction, and evaluating should be
(significantly) cheaper than evaluating .
A number of papers have studied the convergence of Parareal and MGRiT.
However, there have yet to be general conditions developed on the convergence
of Parareal or MGRiT that answer simple questions such as, (i) for a given
and , what is the best , or (ii) can Parareal/MGRiT converge
for my problem? This work derives necessary and sufficient conditions for the
convergence of Parareal and MGRiT applied to linear problems, along with tight
two-level convergence bounds. Results rest on the introduction of a "temporal
approximation property" (TAP) that indicates how must approximate the
action of on different vectors. Loosely, for unitarily diagonalizable
operators, the TAP indicates that fine-grid and coarse-grid time integration
schemes must integrate geometrically smooth spatial components similarly, and
less so for geometrically high frequency. In the (non-unitarily) diagonalizable
setting, the conditioning of each eigenvector, , must also be
reflected in how well . In general,
worst-case convergence bounds are exactly given by such that
an inequality along the lines of holds for all . Such inequalities are
formalized as different realizations of the TAP, and form the basis for
convergence of MGRiT and Parareal.Comment: 37 pages, accepted in SIMA
A Generalized Approximate Control Variate Framework for Multifidelity Uncertainty Quantification
We describe and analyze a variance reduction approach for Monte Carlo (MC)
sampling that accelerates the estimation of statistics of computationally
expensive simulation models using an ensemble of models with lower cost. These
lower cost models --- which are typically lower fidelity with unknown
statistics --- are used to reduce the variance in statistical estimators
relative to a MC estimator with equivalent cost. We derive the conditions under
which our proposed approximate control variate framework recovers existing
multi-model variance reduction schemes as special cases. We demonstrate that
these existing strategies use recursive sampling strategies, and as a result,
their maximum possible variance reduction is limited to that of a control
variate algorithm that uses only a single low-fidelity model with known mean.
This theoretical result holds regardless of the number of low-fidelity models
and/or samples used to build the estimator. We then derive new sampling
strategies within our framework that circumvent this limitation to make
efficient use of all available information sources. In particular, we
demonstrate that a significant gap can exist, of orders of magnitude in some
cases, between the variance reduction achievable by using a single low-fidelity
model and our non-recursive approach. We also present initial sample allocation
approaches for exploiting this gap. They yield the greatest benefit when
augmenting the high-fidelity model evaluations is impractical because, for
instance, they arise from a legacy database. Several analytic examples and an
example with a hyperbolic PDE describing elastic wave propagation in
heterogeneous media are used to illustrate the main features of the
methodology
Convergence of Adaptive Finite Element Approximations for Nonlinear Eigenvalue Problems
In this paper, we study an adaptive finite element method for a class of a
nonlinear eigenvalue problems that may be of nonconvex energy functional and
consider its applications to quantum chemistry. We prove the convergence of
adaptive finite element approximations and present several numerical examples
of micro-structure of matter calculations that support our theory.Comment: 24 pages, 12 figure
Optimized Signal Distortion for PAPR Reduction of OFDM Signals with IFFT/FFT Complexity via ADMM Approaches
In this paper, we propose two low-complexity optimization methods to reduce
peak-to-average power ratio (PAPR) values of orthogonal frequency division
multiplexing (OFDM) signals via alternating direction method of multipliers
(ADMM). First, we formulate a non-convex signal distortion optimization model
based on minimizing data carrier distortion such that the constraints are
placed on PAPR and the power of free carriers. Second, to obtain the model's
approximate optimal solution efficiently, we design two low-complexity ADMM
algorithms, named ADMM-Direct and ADMM-Relax respectively. Third, we show that,
in ADMM-Direct/-Relax, all the optimization subproblems can be solved
semi-analytically and the computational complexity in each iteration is roughly
O(lNlog2(lN)), where l and N are over sampling factor and carrier number
respectively. Moreover, we show that the resulting solution of ADMM-Direct is
guaranteed to be some Karush-Kuhn-Tucker (KKT) point of the non-convex model
when the iteration algorithm is convergent. For ADMM-Relax, we prove that it
has theoretically guaranteed convergence and can approach arbitrarily close to
some KKT point of the model if proper parameters are chosen. Simulation results
demonstrate the effectiveness of the proposed approaches.Comment: 15 pages, 7 figure
Bayesian Compressed Regression
As an alternative to variable selection or shrinkage in high dimensional
regression, we propose to randomly compress the predictors prior to analysis.
This dramatically reduces storage and computational bottlenecks, performing
well when the predictors can be projected to a low dimensional linear subspace
with minimal loss of information about the response. As opposed to existing
Bayesian dimensionality reduction approaches, the exact posterior distribution
conditional on the compressed data is available analytically, speeding up
computation by many orders of magnitude while also bypassing robustness issues
due to convergence and mixing problems with MCMC. Model averaging is used to
reduce sensitivity to the random projection matrix, while accommodating
uncertainty in the subspace dimension. Strong theoretical support is provided
for the approach by showing near parametric convergence rates for the
predictive density in the large p small n asymptotic paradigm. Practical
performance relative to competitors is illustrated in simulations and real data
applications.Comment: 29 pages, 4 figure
Efficient Rank Reduction of Correlation Matrices
Geometric optimisation algorithms are developed that efficiently find the
nearest low-rank correlation matrix. We show, in numerical tests, that our
methods compare favourably to the existing methods in the literature. The
connection with the Lagrange multiplier method is established, along with an
identification of whether a local minimum is a global minimum. An additional
benefit of the geometric approach is that any weighted norm can be applied. The
problem of finding the nearest low-rank correlation matrix occurs as part of
the calibration of multi-factor interest rate market models to correlation.Comment: First version: 20 pages, 4 figures Second version [changed content]:
21 pages, 6 figure
- …