83,642 research outputs found

    On the convergence of reduction-based and model-based methods in proof theory

    Full text link
    In the recent past, the reduction-based and the model-based methods to prove cut elimination have converged, so that they now appear just as two sides of the same coin. This paper details some of the steps of this transformation

    Why gradient clipping accelerates training: A theoretical justification for adaptivity

    Full text link
    We provide a theoretical explanation for the effectiveness of gradient clipping in training deep neural networks. The key ingredient is a new smoothness condition derived from practical neural network training examples. We observe that gradient smoothness, a concept central to the analysis of first-order optimization algorithms that is often assumed to be a constant, demonstrates significant variability along the training trajectory of deep neural networks. Further, this smoothness positively correlates with the gradient norm, and contrary to standard assumptions in the literature, it can grow with the norm of the gradient. These empirical observations limit the applicability of existing theoretical analyses of algorithms that rely on a fixed bound on smoothness. These observations motivate us to introduce a novel relaxation of gradient smoothness that is weaker than the commonly used Lipschitz smoothness assumption. Under the new condition, we prove that two popular methods, namely, \emph{gradient clipping} and \emph{normalized gradient}, converge arbitrarily faster than gradient descent with fixed stepsize. We further explain why such adaptively scaled gradient methods can accelerate empirical convergence and verify our results empirically in popular neural network training settings

    Stochastic Zeroth-order Optimization via Variance Reduction method

    Full text link
    Derivative-free optimization has become an important technique used in machine learning for optimizing black-box models. To conduct updates without explicitly computing gradient, most current approaches iteratively sample a random search direction from Gaussian distribution and compute the estimated gradient along that direction. However, due to the variance in the search direction, the convergence rates and query complexities of existing methods suffer from a factor of dd, where dd is the problem dimension. In this paper, we introduce a novel Stochastic Zeroth-order method with Variance Reduction under Gaussian smoothing (SZVR-G) and establish the complexity for optimizing non-convex problems. With variance reduction on both sample space and search space, the complexity of our algorithm is sublinear to dd and is strictly better than current approaches, in both smooth and non-smooth cases. Moreover, we extend the proposed method to the mini-batch version. Our experimental results demonstrate the superior performance of the proposed method over existing derivative-free optimization techniques. Furthermore, we successfully apply our method to conduct a universal black-box attack to deep neural networks and present some interesting results

    Exponential Convergence of Online Enrichment in Localized Reduced Basis Methods

    Full text link
    Online enrichment is the extension of a reduced solution space based on the solution of the reduced model. Procedures for online enrichment were published for many localized model order reduction techniques. We show that residual based online enrichment on overlapping domains converges exponentially. Furthermore, we present an optimal enrichment strategy which couples the global reduced space with a local fine space. Numerical experiments on the two dimensional stationary heat equation with high contrast and channels confirm and illustrate the results.Comment: 5 pages, 7 figures, 2 algorithm

    Necessary Conditions and Tight Two-level Convergence Bounds for Parareal and Multigrid Reduction in Time

    Full text link
    Parareal and multigrid reduction in time (MGRiT) are two of the most popular parallel-in-time methods. The idea is to treat time integration in a parallel context by using a multigrid method in time. If Φ\Phi is a (fine-grid) time-stepping scheme, let Ψ\Psi denote a "coarse-grid" time-stepping scheme chosen to approximate kk steps of Φ\Phi, k1k\geq 1. In particular, Ψ\Psi defines the coarse-grid correction, and evaluating Ψ\Psi should be (significantly) cheaper than evaluating Φk\Phi^k. A number of papers have studied the convergence of Parareal and MGRiT. However, there have yet to be general conditions developed on the convergence of Parareal or MGRiT that answer simple questions such as, (i) for a given Φ\Phi and kk, what is the best Ψ\Psi, or (ii) can Parareal/MGRiT converge for my problem? This work derives necessary and sufficient conditions for the convergence of Parareal and MGRiT applied to linear problems, along with tight two-level convergence bounds. Results rest on the introduction of a "temporal approximation property" (TAP) that indicates how Φk\Phi^k must approximate the action of Ψ\Psi on different vectors. Loosely, for unitarily diagonalizable operators, the TAP indicates that fine-grid and coarse-grid time integration schemes must integrate geometrically smooth spatial components similarly, and less so for geometrically high frequency. In the (non-unitarily) diagonalizable setting, the conditioning of each eigenvector, vi\mathbf{v}_i, must also be reflected in how well ΨviΦkvi\Psi\mathbf{v}_i \sim\Phi^k\mathbf{v}_i. In general, worst-case convergence bounds are exactly given by minφ<1\min \varphi < 1 such that an inequality along the lines of (ΨΦk)vφ(IΨ)v\|(\Psi-\Phi^k)\mathbf{v}\| \leq\varphi \|(I - \Psi)\mathbf{v}\| holds for all v\mathbf{v}. Such inequalities are formalized as different realizations of the TAP, and form the basis for convergence of MGRiT and Parareal.Comment: 37 pages, accepted in SIMA

    A Generalized Approximate Control Variate Framework for Multifidelity Uncertainty Quantification

    Full text link
    We describe and analyze a variance reduction approach for Monte Carlo (MC) sampling that accelerates the estimation of statistics of computationally expensive simulation models using an ensemble of models with lower cost. These lower cost models --- which are typically lower fidelity with unknown statistics --- are used to reduce the variance in statistical estimators relative to a MC estimator with equivalent cost. We derive the conditions under which our proposed approximate control variate framework recovers existing multi-model variance reduction schemes as special cases. We demonstrate that these existing strategies use recursive sampling strategies, and as a result, their maximum possible variance reduction is limited to that of a control variate algorithm that uses only a single low-fidelity model with known mean. This theoretical result holds regardless of the number of low-fidelity models and/or samples used to build the estimator. We then derive new sampling strategies within our framework that circumvent this limitation to make efficient use of all available information sources. In particular, we demonstrate that a significant gap can exist, of orders of magnitude in some cases, between the variance reduction achievable by using a single low-fidelity model and our non-recursive approach. We also present initial sample allocation approaches for exploiting this gap. They yield the greatest benefit when augmenting the high-fidelity model evaluations is impractical because, for instance, they arise from a legacy database. Several analytic examples and an example with a hyperbolic PDE describing elastic wave propagation in heterogeneous media are used to illustrate the main features of the methodology

    Convergence of Adaptive Finite Element Approximations for Nonlinear Eigenvalue Problems

    Full text link
    In this paper, we study an adaptive finite element method for a class of a nonlinear eigenvalue problems that may be of nonconvex energy functional and consider its applications to quantum chemistry. We prove the convergence of adaptive finite element approximations and present several numerical examples of micro-structure of matter calculations that support our theory.Comment: 24 pages, 12 figure

    Optimized Signal Distortion for PAPR Reduction of OFDM Signals with IFFT/FFT Complexity via ADMM Approaches

    Full text link
    In this paper, we propose two low-complexity optimization methods to reduce peak-to-average power ratio (PAPR) values of orthogonal frequency division multiplexing (OFDM) signals via alternating direction method of multipliers (ADMM). First, we formulate a non-convex signal distortion optimization model based on minimizing data carrier distortion such that the constraints are placed on PAPR and the power of free carriers. Second, to obtain the model's approximate optimal solution efficiently, we design two low-complexity ADMM algorithms, named ADMM-Direct and ADMM-Relax respectively. Third, we show that, in ADMM-Direct/-Relax, all the optimization subproblems can be solved semi-analytically and the computational complexity in each iteration is roughly O(lNlog2(lN)), where l and N are over sampling factor and carrier number respectively. Moreover, we show that the resulting solution of ADMM-Direct is guaranteed to be some Karush-Kuhn-Tucker (KKT) point of the non-convex model when the iteration algorithm is convergent. For ADMM-Relax, we prove that it has theoretically guaranteed convergence and can approach arbitrarily close to some KKT point of the model if proper parameters are chosen. Simulation results demonstrate the effectiveness of the proposed approaches.Comment: 15 pages, 7 figure

    Bayesian Compressed Regression

    Full text link
    As an alternative to variable selection or shrinkage in high dimensional regression, we propose to randomly compress the predictors prior to analysis. This dramatically reduces storage and computational bottlenecks, performing well when the predictors can be projected to a low dimensional linear subspace with minimal loss of information about the response. As opposed to existing Bayesian dimensionality reduction approaches, the exact posterior distribution conditional on the compressed data is available analytically, speeding up computation by many orders of magnitude while also bypassing robustness issues due to convergence and mixing problems with MCMC. Model averaging is used to reduce sensitivity to the random projection matrix, while accommodating uncertainty in the subspace dimension. Strong theoretical support is provided for the approach by showing near parametric convergence rates for the predictive density in the large p small n asymptotic paradigm. Practical performance relative to competitors is illustrated in simulations and real data applications.Comment: 29 pages, 4 figure

    Efficient Rank Reduction of Correlation Matrices

    Get PDF
    Geometric optimisation algorithms are developed that efficiently find the nearest low-rank correlation matrix. We show, in numerical tests, that our methods compare favourably to the existing methods in the literature. The connection with the Lagrange multiplier method is established, along with an identification of whether a local minimum is a global minimum. An additional benefit of the geometric approach is that any weighted norm can be applied. The problem of finding the nearest low-rank correlation matrix occurs as part of the calibration of multi-factor interest rate market models to correlation.Comment: First version: 20 pages, 4 figures Second version [changed content]: 21 pages, 6 figure
    corecore