Search CORE

83,642 research outputs found

On the convergence of reduction-based and model-based methods in proof theory

Author: Dowek Gilles
Publication venue
Publication date: 02/05/2023
Field of study

In the recent past, the reduction-based and the model-based methods to prove cut elimination have converged, so that they now appear just as two sides of the same coin. This paper details some of the steps of this transformation

arXiv.org e-Print Archive

Why gradient clipping accelerates training: A theoretical justification for adaptivity

Author: He Tianxing
Jadbabaie Ali
Sra Suvrit
Zhang Jingzhao
Publication venue
Publication date: 10/02/2020
Field of study

We provide a theoretical explanation for the effectiveness of gradient clipping in training deep neural networks. The key ingredient is a new smoothness condition derived from practical neural network training examples. We observe that gradient smoothness, a concept central to the analysis of first-order optimization algorithms that is often assumed to be a constant, demonstrates significant variability along the training trajectory of deep neural networks. Further, this smoothness positively correlates with the gradient norm, and contrary to standard assumptions in the literature, it can grow with the norm of the gradient. These empirical observations limit the applicability of existing theoretical analyses of algorithms that rely on a fixed bound on smoothness. These observations motivate us to introduce a novel relaxation of gradient smoothness that is weaker than the commonly used Lipschitz smoothness assumption. Under the new condition, we prove that two popular methods, namely, \emph{gradient clipping} and \emph{normalized gradient}, converge arbitrarily faster than gradient descent with fixed stepsize. We further explain why such adaptively scaled gradient methods can accelerate empirical convergence and verify our results empirically in popular neural network training settings

arXiv.org e-Print Archive

Stochastic Zeroth-order Optimization via Variance Reduction method

Author: Cheng Minhao
Hsieh Cho-Jui
Liu Liu
Tao Dacheng
Publication venue
Publication date: 02/08/2018
Field of study

Derivative-free optimization has become an important technique used in machine learning for optimizing black-box models. To conduct updates without explicitly computing gradient, most current approaches iteratively sample a random search direction from Gaussian distribution and compute the estimated gradient along that direction. However, due to the variance in the search direction, the convergence rates and query complexities of existing methods suffer from a factor of

d

, where

d

is the problem dimension. In this paper, we introduce a novel Stochastic Zeroth-order method with Variance Reduction under Gaussian smoothing (SZVR-G) and establish the complexity for optimizing non-convex problems. With variance reduction on both sample space and search space, the complexity of our algorithm is sublinear to

d

and is strictly better than current approaches, in both smooth and non-smooth cases. Moreover, we extend the proposed method to the mini-batch version. Our experimental results demonstrate the superior performance of the proposed method over existing derivative-free optimization techniques. Furthermore, we successfully apply our method to conduct a universal black-box attack to deep neural networks and present some interesting results

arXiv.org e-Print Archive

Exponential Convergence of Online Enrichment in Localized Reduced Basis Methods

Author: Buhr Andreas
Publication venue: 'Elsevier BV'
Publication date: 23/02/2018
Field of study

Online enrichment is the extension of a reduced solution space based on the solution of the reduced model. Procedures for online enrichment were published for many localized model order reduction techniques. We show that residual based online enrichment on overlapping domains converges exponentially. Furthermore, we present an optimal enrichment strategy which couples the global reduced space with a local fine space. Numerical experiments on the two dimensional stationary heat equation with high contrast and channels confirm and illustrate the results.Comment: 5 pages, 7 figures, 2 algorithm

arXiv.org e-Print Archive

Necessary Conditions and Tight Two-level Convergence Bounds for Parareal and Multigrid Reduction in Time

Author: Southworth Ben S.
Publication venue
Publication date: 12/05/2019
Field of study

Parareal and multigrid reduction in time (MGRiT) are two of the most popular parallel-in-time methods. The idea is to treat time integration in a parallel context by using a multigrid method in time. If

\Phi

is a (fine-grid) time-stepping scheme, let

\Psi

denote a "coarse-grid" time-stepping scheme chosen to approximate

k

steps of

\Phi

k\geq 1

. In particular,

\Psi

defines the coarse-grid correction, and evaluating

\Psi

should be (significantly) cheaper than evaluating

\Phi^k

. A number of papers have studied the convergence of Parareal and MGRiT. However, there have yet to be general conditions developed on the convergence of Parareal or MGRiT that answer simple questions such as, (i) for a given

\Phi

and

k

, what is the best

\Psi

, or (ii) can Parareal/MGRiT converge for my problem? This work derives necessary and sufficient conditions for the convergence of Parareal and MGRiT applied to linear problems, along with tight two-level convergence bounds. Results rest on the introduction of a "temporal approximation property" (TAP) that indicates how

\Phi^k

must approximate the action of

\Psi

on different vectors. Loosely, for unitarily diagonalizable operators, the TAP indicates that fine-grid and coarse-grid time integration schemes must integrate geometrically smooth spatial components similarly, and less so for geometrically high frequency. In the (non-unitarily) diagonalizable setting, the conditioning of each eigenvector,

\mathbf{v}_i

, must also be reflected in how well

\Psi\mathbf{v}_i \sim\Phi^k\mathbf{v}_i

. In general, worst-case convergence bounds are exactly given by

\min \varphi < 1

such that an inequality along the lines of

\|(\Psi-\Phi^k)\mathbf{v}\| \leq\varphi \|(I - \Psi)\mathbf{v}\|

holds for all

\mathbf{v}

. Such inequalities are formalized as different realizations of the TAP, and form the basis for convergence of MGRiT and Parareal.Comment: 37 pages, accepted in SIMA

arXiv.org e-Print Archive

A Generalized Approximate Control Variate Framework for Multifidelity Uncertainty Quantification

Author: Eldred Mike
Geraci Gianluca
Gorodetsky Alex A.
Jakeman John D.
Publication venue
Publication date: 06/06/2019
Field of study

We describe and analyze a variance reduction approach for Monte Carlo (MC) sampling that accelerates the estimation of statistics of computationally expensive simulation models using an ensemble of models with lower cost. These lower cost models --- which are typically lower fidelity with unknown statistics --- are used to reduce the variance in statistical estimators relative to a MC estimator with equivalent cost. We derive the conditions under which our proposed approximate control variate framework recovers existing multi-model variance reduction schemes as special cases. We demonstrate that these existing strategies use recursive sampling strategies, and as a result, their maximum possible variance reduction is limited to that of a control variate algorithm that uses only a single low-fidelity model with known mean. This theoretical result holds regardless of the number of low-fidelity models and/or samples used to build the estimator. We then derive new sampling strategies within our framework that circumvent this limitation to make efficient use of all available information sources. In particular, we demonstrate that a significant gap can exist, of orders of magnitude in some cases, between the variance reduction achievable by using a single low-fidelity model and our non-recursive approach. We also present initial sample allocation approaches for exploiting this gap. They yield the greatest benefit when augmenting the high-fidelity model evaluations is impractical because, for instance, they arise from a legacy database. Several analytic examples and an example with a hyperbolic PDE describing elastic wave propagation in heterogeneous media are used to illustrate the main features of the methodology

arXiv.org e-Print Archive

Convergence of Adaptive Finite Element Approximations for Nonlinear Eigenvalue Problems

Author: Chen H.
Gong X.
He L.
Zhou A.
Publication venue
Publication date: 13/01/2010
Field of study

In this paper, we study an adaptive finite element method for a class of a nonlinear eigenvalue problems that may be of nonconvex energy functional and consider its applications to quantum chemistry. We prove the convergence of adaptive finite element approximations and present several numerical examples of micro-structure of matter calculations that support our theory.Comment: 24 pages, 12 figure

arXiv.org e-Print Archive

CiteSeerX

Optimized Signal Distortion for PAPR Reduction of OFDM Signals with IFFT/FFT Complexity via ADMM Approaches

Author: Shi Qingjiang
Wang Yanjiao
Wang Yongchao
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 28/10/2018
Field of study

In this paper, we propose two low-complexity optimization methods to reduce peak-to-average power ratio (PAPR) values of orthogonal frequency division multiplexing (OFDM) signals via alternating direction method of multipliers (ADMM). First, we formulate a non-convex signal distortion optimization model based on minimizing data carrier distortion such that the constraints are placed on PAPR and the power of free carriers. Second, to obtain the model's approximate optimal solution efficiently, we design two low-complexity ADMM algorithms, named ADMM-Direct and ADMM-Relax respectively. Third, we show that, in ADMM-Direct/-Relax, all the optimization subproblems can be solved semi-analytically and the computational complexity in each iteration is roughly O(lNlog2(lN)), where l and N are over sampling factor and carrier number respectively. Moreover, we show that the resulting solution of ADMM-Direct is guaranteed to be some Karush-Kuhn-Tucker (KKT) point of the non-convex model when the iteration algorithm is convergent. For ADMM-Relax, we prove that it has theoretically guaranteed convergence and can approach arbitrarily close to some KKT point of the model if proper parameters are chosen. Simulation results demonstrate the effectiveness of the proposed approaches.Comment: 15 pages, 7 figure

arXiv.org e-Print Archive

Bayesian Compressed Regression

Author: Dunson David B.
Guhaniyogi Rajarshi
Publication venue
Publication date: 22/03/2013
Field of study

As an alternative to variable selection or shrinkage in high dimensional regression, we propose to randomly compress the predictors prior to analysis. This dramatically reduces storage and computational bottlenecks, performing well when the predictors can be projected to a low dimensional linear subspace with minimal loss of information about the response. As opposed to existing Bayesian dimensionality reduction approaches, the exact posterior distribution conditional on the compressed data is available analytically, speeding up computation by many orders of magnitude while also bypassing robustness issues due to convergence and mixing problems with MCMC. Model averaging is used to reduce sensitivity to the random projection matrix, while accommodating uncertainty in the subspace dimension. Strong theoretical support is provided for the approach by showing near parametric convergence rates for the predictive density in the large p small n asymptotic paradigm. Practical performance relative to competitors is illustrated in simulations and real data applications.Comment: 29 pages, 4 figure

arXiv.org e-Print Archive

CiteSeerX

Efficient Rank Reduction of Correlation Matrices

Author: Grubisic Igor
Pietersz Raoul
Publication venue
Publication date: 25/01/2005
Field of study

Geometric optimisation algorithms are developed that efficiently find the nearest low-rank correlation matrix. We show, in numerical tests, that our methods compare favourably to the existing methods in the literature. The connection with the Lagrange multiplier method is established, along with an identification of whether a local minimum is a global minimum. An additional benefit of the geometric approach is that any weighted norm can be applied. The problem of finding the nearest low-rank correlation matrix occurs as part of the calibration of multi-factor interest rate market models to correlation.Comment: First version: 20 pages, 4 figures Second version [changed content]: 21 pages, 6 figure

arXiv.org e-Print Archive

Erasmus University Digital Repository