2 research outputs found
Extreme Tensoring for Low-Memory Preconditioning
State-of-the-art models are now trained with billions of parameters, reaching
hardware limits in terms of memory consumption. This has created a recent
demand for memory-efficient optimizers. To this end, we investigate the limits
and performance tradeoffs of memory-efficient adaptively preconditioned
gradient methods. We propose extreme tensoring for high-dimensional stochastic
optimization, showing that an optimizer needs very little memory to benefit
from adaptive preconditioning. Our technique applies to arbitrary models (not
necessarily with tensor-shaped parameters), and is accompanied by regret and
convergence guarantees, which shed light on the tradeoffs between
preconditioner quality and expressivity. On a large-scale NLP model, we reduce
the optimizer memory overhead by three orders of magnitude, without degrading
performance
Lecture Notes: Optimization for Machine Learning
Lecture notes on optimization for machine learning, derived from a course at
Princeton University and tutorials given in MLSS, Buenos Aires, as well as
Simons Foundation, Berkeley