19,895 research outputs found
Practical recommendations for gradient-based training of deep architectures
Learning algorithms related to artificial neural networks and in particular
for Deep Learning may seem to involve many bells and whistles, called
hyper-parameters. This chapter is meant as a practical guide with
recommendations for some of the most commonly used hyper-parameters, in
particular in the context of learning algorithms based on back-propagated
gradient and gradient-based optimization. It also discusses how to deal with
the fact that more interesting results can be obtained when allowing one to
adjust many hyper-parameters. Overall, it describes elements of the practice
used to successfully and efficiently train and debug large-scale and often deep
multi-layer neural networks. It closes with open questions about the training
difficulties observed with deeper architectures
Bayesian Optimization for Adaptive MCMC
This paper proposes a new randomized strategy for adaptive MCMC using
Bayesian optimization. This approach applies to non-differentiable objective
functions and trades off exploration and exploitation to reduce the number of
potentially costly objective function evaluations. We demonstrate the strategy
in the complex setting of sampling from constrained, discrete and densely
connected probabilistic graphical models where, for each variation of the
problem, one needs to adjust the parameters of the proposal mechanism
automatically to ensure efficient mixing of the Markov chains.Comment: This paper contains 12 pages and 6 figures. A similar version of this
paper has been submitted to AISTATS 2012 and is currently under revie
No More Pesky Learning Rates
The performance of stochastic gradient descent (SGD) depends critically on
how learning rates are tuned and decreased over time. We propose a method to
automatically adjust multiple learning rates so as to minimize the expected
error at any one time. The method relies on local gradient variations across
samples. In our approach, learning rates can increase as well as decrease,
making it suitable for non-stationary problems. Using a number of convex and
non-convex learning tasks, we show that the resulting algorithm matches the
performance of SGD or other adaptive approaches with their best settings
obtained through systematic search, and effectively removes the need for
learning rate tuning
- …