Search CORE

6,837 research outputs found

Hybrid optimization and Bayesian inference techniques for a non-smooth radiation detection problem

Author: Hite Jason
Mattingly John
Schmidt Kathleen
Smith Ralph
Stefanescu Razvan
Publication venue
Publication date: 01/07/2016
Field of study

In this investigation, we propose several algorithms to recover the location and intensity of a radiation source located in a simulated 250 m x 180 m block in an urban center based on synthetic measurements. Radioactive decay and detection are Poisson random processes, so we employ likelihood functions based on this distribution. Due to the domain geometry and the proposed response model, the negative logarithm of the likelihood is only piecewise continuous differentiable, and it has multiple local minima. To address these difficulties, we investigate three hybrid algorithms comprised of mixed optimization techniques. For global optimization, we consider Simulated Annealing (SA), Particle Swarm (PS) and Genetic Algorithm (GA), which rely solely on objective function evaluations; i.e., they do not evaluate the gradient in the objective function. By employing early stopping criteria for the global optimization methods, a pseudo-optimum point is obtained. This is subsequently utilized as the initial value by the deterministic Implicit Filtering method (IF), which is able to find local extrema in non-smooth functions, to finish the search in a narrow domain. These new hybrid techniques combining global optimization and Implicit Filtering address difficulties associated with the non-smooth response, and their performances are shown to significantly decrease the computational time over the global optimization methods alone. To quantify uncertainties associated with the source location and intensity, we employ the Delayed Rejection Adaptive Metropolis (DRAM) and DiffeRential Evolution Adaptive Metropolis (DREAM) algorithms. Marginal densities of the source properties are obtained, and the means of the chains' compare accurately with the estimates produced by the hybrid algorithms.Comment: 36 pages, 14 figure

arXiv.org e-Print Archive

Online Expectation-Maximisation

Author: Cappé Olivier
Publication venue
Publication date: 08/11/2010
Field of study

Tutorial chapter on the Online EM algorithm to appear in the volume 'Mixtures' edited by Kerrie Mengersen, Mike Titterington and Christian P. Robert

arXiv.org e-Print Archive

Cost-Sensitive Approach to Batch Size Adaptation for Gradient Descent

Author: Pirotta Matteo
Restelli Marcello
Publication venue
Publication date: 09/12/2017
Field of study

In this paper, we propose a novel approach to automatically determine the batch size in stochastic gradient descent methods. The choice of the batch size induces a trade-off between the accuracy of the gradient estimate and the cost in terms of samples of each update. We propose to determine the batch size by optimizing the ratio between a lower bound to a linear or quadratic Taylor approximation of the expected improvement and the number of samples used to estimate the gradient. The performance of the proposed approach is empirically compared with related methods on popular classification tasks. The work was presented at the NIPS workshop on Optimizing the Optimizers. Barcelona, Spain, 2016.Comment: Presented at the NIPS workshop on Optimizing the Optimizers. Barcelona, Spain, 201

arXiv.org e-Print Archive

A Well-Tempered Landscape for Non-convex Robust Subspace Recovery

Author: Lerman Gilad
Maunu Tyler
Zhang Teng
Publication venue
Publication date: 28/02/2019
Field of study

We present a mathematical analysis of a non-convex energy landscape for robust subspace recovery. We prove that an underlying subspace is the only stationary point and local minimizer in a specified neighborhood under a deterministic condition on a dataset. If the deterministic condition is satisfied, we further show that a geodesic gradient descent method over the Grassmannian manifold can exactly recover the underlying subspace when the method is properly initialized. Proper initialization by principal component analysis is guaranteed with a simple deterministic condition. Under slightly stronger assumptions, the gradient descent method with a piecewise constant step-size scheme achieves linear convergence. The practicality of the deterministic condition is demonstrated on some statistical models of data, and the method achieves almost state-of-the-art recovery guarantees on the Haystack Model for different regimes of sample size and ambient dimension. In particular, when the ambient dimension is fixed and the sample size is large enough, we show that our gradient method can exactly recover the underlying subspace for any fixed fraction of outliers (less than 1).Comment: 58 pages, 6 figures, 1 tabl

arXiv.org e-Print Archive

Adaptive Tuning Of Hamiltonian Monte Carlo Within Sequential Monte Carlo

Author: Buchholz Alexander
Chopin Nicolas
Jacob Pierre E.
Publication venue
Publication date: 12/02/2020
Field of study

Sequential Monte Carlo (SMC) samplers form an attractive alternative to MCMC for Bayesian computation. However, their performance depends strongly on the Markov kernels used to rejuvenate particles. We discuss how to calibrate automatically (using the current particles) Hamiltonian Monte Carlo kernels within SMC. To do so, we build upon the adaptive SMC approach of Fearnhead and Taylor (2013), and we also suggest alternative methods. We illustrate the advantages of using HMC kernels within an SMC sampler via an extensive numerical study

arXiv.org e-Print Archive

Black-Box Optimization in Machine Learning with Trust Region Based Derivative Free Algorithm

Author: Ghanbari Hiva
Scheinberg Katya
Publication venue
Publication date: 20/03/2017
Field of study

In this work, we utilize a Trust Region based Derivative Free Optimization (DFO-TR) method to directly maximize the Area Under Receiver Operating Characteristic Curve (AUC), which is a nonsmooth, noisy function. We show that AUC is a smooth function, in expectation, if the distributions of the positive and negative data points obey a jointly normal distribution. The practical performance of this algorithm is compared to three prominent Bayesian optimization methods and random search. The presented numerical results show that DFO-TR surpasses Bayesian optimization and random search on various black-box optimization problem, such as maximizing AUC and hyperparameter tuning

arXiv.org e-Print Archive

A Cost-based Optimizer for Gradient Descent Optimization

Author: Abadi M.
Agrawal D.
Ben-David S.
Bottou L.
Bousquet O.
Johnson R.
Kraska T.
Liu J.
Recht B.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 27/03/2017
Field of study

As the use of machine learning (ML) permeates into diverse application domains, there is an urgent need to support a declarative framework for ML. Ideally, a user will specify an ML task in a high-level and easy-to-use language and the framework will invoke the appropriate algorithms and system configurations to execute it. An important observation towards designing such a framework is that many ML tasks can be expressed as mathematical optimization problems, which take a specific form. Furthermore, these optimization problems can be efficiently solved using variations of the gradient descent (GD) algorithm. Thus, to decouple a user specification of an ML task from its execution, a key component is a GD optimizer. We propose a cost-based GD optimizer that selects the best GD plan for a given ML task. To build our optimizer, we introduce a set of abstract operators for expressing GD algorithms and propose a novel approach to estimate the number of iterations a GD algorithm requires to converge. Extensive experiments on real and synthetic datasets show that our optimizer not only chooses the best GD plan but also allows for optimizations that achieve orders of magnitude performance speed-up.Comment: Accepted at SIGMOD 201

arXiv.org e-Print Archive

LAGC: Lazily Aggregated Gradient Coding for Straggler-Tolerant and Communication-Efficient Distributed Learning

Author: Simeone Osvaldo
Zhang Jingjing
Publication venue
Publication date: 03/04/2020
Field of study

Gradient-based distributed learning in Parameter Server (PS) computing architectures is subject to random delays due to straggling worker nodes, as well as to possible communication bottlenecks between PS and workers. Solutions have been recently proposed to separately address these impairments based on the ideas of gradient coding, worker grouping, and adaptive worker selection. This paper provides a unified analysis of these techniques in terms of wall-clock time, communication, and computation complexity measures. Furthermore, in order to combine the benefits of gradient coding and grouping in terms of robustness to stragglers with the communication and computation load gains of adaptive selection, novel strategies, named Lazily Aggregated Gradient Coding (LAGC) and Grouped-LAG (G-LAG), are introduced. Analysis and results show that G-LAG provides the best wall-clock time and communication performance, while maintaining a low computational cost, for two representative distributions of the computing times of the worker nodes.Comment: Submitte

arXiv.org e-Print Archive

King's Research Portal

High Throughput Synchronous Distributed Stochastic Gradient Descent

Author: Teng Michael
Wood Frank
Publication venue
Publication date: 12/03/2018
Field of study

We introduce a new, high-throughput, synchronous, distributed, data-parallel, stochastic-gradient-descent learning algorithm. This algorithm uses amortized inference in a compute-cluster-specific, deep, generative, dynamical model to perform joint posterior predictive inference of the mini-batch gradient computation times of all worker-nodes in a parallel computing cluster. We show that a synchronous parameter server can, by utilizing such a model, choose an optimal cutoff time beyond which mini-batch gradient messages from slow workers are ignored that maximizes overall mini-batch gradient computations per second. In keeping with earlier findings we observe that, under realistic conditions, eagerly discarding the mini-batch gradient computations of stragglers not only increases throughput but actually increases the overall rate of convergence as a function of wall-clock time by virtue of eliminating idleness. The principal novel contribution and finding of this work goes beyond this by demonstrating that using the predicted run-times from a generative model of cluster worker performance to dynamically adjust the cutoff improves substantially over the static-cutoff prior art, leading to, among other things, significantly reduced deep neural net training times on large computer clusters

arXiv.org e-Print Archive

Convergence of Contrastive Divergence Algorithm in Exponential Family

Author: Jiang Bai
Jin Yifan
Wong Wing H.
Wu Tung-Yu
Publication venue
Publication date: 27/02/2018
Field of study

The Contrastive Divergence (CD) algorithm has achieved notable success in training energy-based models including Restricted Boltzmann Machines and played a key role in the emergence of deep learning. The idea of this algorithm is to approximate the intractable term in the exact gradient of the log-likelihood function by using short Markov chain Monte Carlo (MCMC) runs. The approximate gradient is computationally-cheap but biased. Whether and why the CD algorithm provides an asymptotically consistent estimate are still open questions. This paper studies the asymptotic properties of the CD algorithm in canonical exponential families, which are special cases of the energy-based model. Suppose the CD algorithm runs

m

MCMC transition steps at each iteration

t

and iteratively generates a sequence of parameter estimates

\{\theta_t\}_{t \ge 0}

given an i.i.d. data sample

\{X_i\}_{i=1}^n \sim p_{\theta_\star}

. Under conditions which are commonly obeyed by the CD algorithm in practice, we prove the existence of some bounded

m

such that any limit point of the time average

\left. \sum_{s=0}^{t-1} \theta_s \right/ t

t \to \infty

is a consistent estimate for the true parameter

\theta_\star

. Our proof is based on the fact that

\{\theta_t\}_{t \ge 0}

is a homogenous Markov chain conditional on the data sample

\{X_i\}_{i=1}^n

. This chain meets the Foster-Lyapunov drift criterion and converges to a random walk around the Maximum Likelihood Estimate. The range of the random walk shrinks to zero at rate

\mathcal{O}(1/\sqrt[3]{n})

as the sample size

n \to \infty

arXiv.org e-Print Archive