89 research outputs found
Adaptive learning rates and parallelization for stochastic, sparse, non-smooth gradients
Recent work has established an empirically successful framework for adapting
learning rates for stochastic gradient descent (SGD). This effectively removes
all needs for tuning, while automatically reducing learning rates over time on
stationary problems, and permitting learning rates to grow appropriately in
non-stationary tasks. Here, we extend the idea in three directions, addressing
proper minibatch parallelization, including reweighted updates for sparse or
orthogonal gradients, improving robustness on non-smooth loss functions, in the
process replacing the diagonal Hessian estimation procedure that may not always
be available by a robust finite-difference approximation. The final algorithm
integrates all these components, has linear complexity and is hyper-parameter
free.Comment: Published at the First International Conference on Learning
Representations (ICLR-2013). Public reviews are available at
http://openreview.net/document/c14f2204-fd66-4d91-bed4-153523694041#c14f2204-fd66-4d91-bed4-15352369404
No More Pesky Learning Rates
The performance of stochastic gradient descent (SGD) depends critically on
how learning rates are tuned and decreased over time. We propose a method to
automatically adjust multiple learning rates so as to minimize the expected
error at any one time. The method relies on local gradient variations across
samples. In our approach, learning rates can increase as well as decrease,
making it suitable for non-stationary problems. Using a number of convex and
non-convex learning tasks, we show that the resulting algorithm matches the
performance of SGD or other adaptive approaches with their best settings
obtained through systematic search, and effectively removes the need for
learning rate tuning
- …