6,837 research outputs found
Hybrid optimization and Bayesian inference techniques for a non-smooth radiation detection problem
In this investigation, we propose several algorithms to recover the location
and intensity of a radiation source located in a simulated 250 m x 180 m block
in an urban center based on synthetic measurements. Radioactive decay and
detection are Poisson random processes, so we employ likelihood functions based
on this distribution. Due to the domain geometry and the proposed response
model, the negative logarithm of the likelihood is only piecewise continuous
differentiable, and it has multiple local minima. To address these
difficulties, we investigate three hybrid algorithms comprised of mixed
optimization techniques. For global optimization, we consider Simulated
Annealing (SA), Particle Swarm (PS) and Genetic Algorithm (GA), which rely
solely on objective function evaluations; i.e., they do not evaluate the
gradient in the objective function. By employing early stopping criteria for
the global optimization methods, a pseudo-optimum point is obtained. This is
subsequently utilized as the initial value by the deterministic Implicit
Filtering method (IF), which is able to find local extrema in non-smooth
functions, to finish the search in a narrow domain. These new hybrid techniques
combining global optimization and Implicit Filtering address difficulties
associated with the non-smooth response, and their performances are shown to
significantly decrease the computational time over the global optimization
methods alone. To quantify uncertainties associated with the source location
and intensity, we employ the Delayed Rejection Adaptive Metropolis (DRAM) and
DiffeRential Evolution Adaptive Metropolis (DREAM) algorithms. Marginal
densities of the source properties are obtained, and the means of the chains'
compare accurately with the estimates produced by the hybrid algorithms.Comment: 36 pages, 14 figure
Online Expectation-Maximisation
Tutorial chapter on the Online EM algorithm to appear in the volume
'Mixtures' edited by Kerrie Mengersen, Mike Titterington and Christian P.
Robert
Cost-Sensitive Approach to Batch Size Adaptation for Gradient Descent
In this paper, we propose a novel approach to automatically determine the
batch size in stochastic gradient descent methods. The choice of the batch size
induces a trade-off between the accuracy of the gradient estimate and the cost
in terms of samples of each update. We propose to determine the batch size by
optimizing the ratio between a lower bound to a linear or quadratic Taylor
approximation of the expected improvement and the number of samples used to
estimate the gradient. The performance of the proposed approach is empirically
compared with related methods on popular classification tasks.
The work was presented at the NIPS workshop on Optimizing the Optimizers.
Barcelona, Spain, 2016.Comment: Presented at the NIPS workshop on Optimizing the Optimizers.
Barcelona, Spain, 201
A Well-Tempered Landscape for Non-convex Robust Subspace Recovery
We present a mathematical analysis of a non-convex energy landscape for
robust subspace recovery. We prove that an underlying subspace is the only
stationary point and local minimizer in a specified neighborhood under a
deterministic condition on a dataset. If the deterministic condition is
satisfied, we further show that a geodesic gradient descent method over the
Grassmannian manifold can exactly recover the underlying subspace when the
method is properly initialized. Proper initialization by principal component
analysis is guaranteed with a simple deterministic condition. Under slightly
stronger assumptions, the gradient descent method with a piecewise constant
step-size scheme achieves linear convergence. The practicality of the
deterministic condition is demonstrated on some statistical models of data, and
the method achieves almost state-of-the-art recovery guarantees on the Haystack
Model for different regimes of sample size and ambient dimension. In
particular, when the ambient dimension is fixed and the sample size is large
enough, we show that our gradient method can exactly recover the underlying
subspace for any fixed fraction of outliers (less than 1).Comment: 58 pages, 6 figures, 1 tabl
Adaptive Tuning Of Hamiltonian Monte Carlo Within Sequential Monte Carlo
Sequential Monte Carlo (SMC) samplers form an attractive alternative to MCMC
for Bayesian computation. However, their performance depends strongly on the
Markov kernels used to rejuvenate particles. We discuss how to calibrate
automatically (using the current particles) Hamiltonian Monte Carlo kernels
within SMC. To do so, we build upon the adaptive SMC approach of Fearnhead and
Taylor (2013), and we also suggest alternative methods. We illustrate the
advantages of using HMC kernels within an SMC sampler via an extensive
numerical study
Black-Box Optimization in Machine Learning with Trust Region Based Derivative Free Algorithm
In this work, we utilize a Trust Region based Derivative Free Optimization
(DFO-TR) method to directly maximize the Area Under Receiver Operating
Characteristic Curve (AUC), which is a nonsmooth, noisy function. We show that
AUC is a smooth function, in expectation, if the distributions of the positive
and negative data points obey a jointly normal distribution. The practical
performance of this algorithm is compared to three prominent Bayesian
optimization methods and random search. The presented numerical results show
that DFO-TR surpasses Bayesian optimization and random search on various
black-box optimization problem, such as maximizing AUC and hyperparameter
tuning
A Cost-based Optimizer for Gradient Descent Optimization
As the use of machine learning (ML) permeates into diverse application
domains, there is an urgent need to support a declarative framework for ML.
Ideally, a user will specify an ML task in a high-level and easy-to-use
language and the framework will invoke the appropriate algorithms and system
configurations to execute it. An important observation towards designing such a
framework is that many ML tasks can be expressed as mathematical optimization
problems, which take a specific form. Furthermore, these optimization problems
can be efficiently solved using variations of the gradient descent (GD)
algorithm. Thus, to decouple a user specification of an ML task from its
execution, a key component is a GD optimizer. We propose a cost-based GD
optimizer that selects the best GD plan for a given ML task. To build our
optimizer, we introduce a set of abstract operators for expressing GD
algorithms and propose a novel approach to estimate the number of iterations a
GD algorithm requires to converge. Extensive experiments on real and synthetic
datasets show that our optimizer not only chooses the best GD plan but also
allows for optimizations that achieve orders of magnitude performance speed-up.Comment: Accepted at SIGMOD 201
LAGC: Lazily Aggregated Gradient Coding for Straggler-Tolerant and Communication-Efficient Distributed Learning
Gradient-based distributed learning in Parameter Server (PS) computing
architectures is subject to random delays due to straggling worker nodes, as
well as to possible communication bottlenecks between PS and workers. Solutions
have been recently proposed to separately address these impairments based on
the ideas of gradient coding, worker grouping, and adaptive worker selection.
This paper provides a unified analysis of these techniques in terms of
wall-clock time, communication, and computation complexity measures.
Furthermore, in order to combine the benefits of gradient coding and grouping
in terms of robustness to stragglers with the communication and computation
load gains of adaptive selection, novel strategies, named Lazily Aggregated
Gradient Coding (LAGC) and Grouped-LAG (G-LAG), are introduced. Analysis and
results show that G-LAG provides the best wall-clock time and communication
performance, while maintaining a low computational cost, for two representative
distributions of the computing times of the worker nodes.Comment: Submitte
High Throughput Synchronous Distributed Stochastic Gradient Descent
We introduce a new, high-throughput, synchronous, distributed, data-parallel,
stochastic-gradient-descent learning algorithm. This algorithm uses amortized
inference in a compute-cluster-specific, deep, generative, dynamical model to
perform joint posterior predictive inference of the mini-batch gradient
computation times of all worker-nodes in a parallel computing cluster. We show
that a synchronous parameter server can, by utilizing such a model, choose an
optimal cutoff time beyond which mini-batch gradient messages from slow workers
are ignored that maximizes overall mini-batch gradient computations per second.
In keeping with earlier findings we observe that, under realistic conditions,
eagerly discarding the mini-batch gradient computations of stragglers not only
increases throughput but actually increases the overall rate of convergence as
a function of wall-clock time by virtue of eliminating idleness. The principal
novel contribution and finding of this work goes beyond this by demonstrating
that using the predicted run-times from a generative model of cluster worker
performance to dynamically adjust the cutoff improves substantially over the
static-cutoff prior art, leading to, among other things, significantly reduced
deep neural net training times on large computer clusters
Convergence of Contrastive Divergence Algorithm in Exponential Family
The Contrastive Divergence (CD) algorithm has achieved notable success in
training energy-based models including Restricted Boltzmann Machines and played
a key role in the emergence of deep learning. The idea of this algorithm is to
approximate the intractable term in the exact gradient of the log-likelihood
function by using short Markov chain Monte Carlo (MCMC) runs. The approximate
gradient is computationally-cheap but biased. Whether and why the CD algorithm
provides an asymptotically consistent estimate are still open questions. This
paper studies the asymptotic properties of the CD algorithm in canonical
exponential families, which are special cases of the energy-based model.
Suppose the CD algorithm runs MCMC transition steps at each iteration
and iteratively generates a sequence of parameter estimates given an i.i.d. data sample .
Under conditions which are commonly obeyed by the CD algorithm in practice, we
prove the existence of some bounded such that any limit point of the time
average as is a
consistent estimate for the true parameter . Our proof is based
on the fact that is a homogenous Markov chain
conditional on the data sample . This chain meets the
Foster-Lyapunov drift criterion and converges to a random walk around the
Maximum Likelihood Estimate. The range of the random walk shrinks to zero at
rate as the sample size
- …