1 research outputs found
Randomized Automatic Differentiation
The successes of deep learning, variational inference, and many other fields
have been aided by specialized implementations of reverse-mode automatic
differentiation (AD) to compute gradients of mega-dimensional objectives. The
AD techniques underlying these tools were designed to compute exact gradients
to numerical precision, but modern machine learning models are almost always
trained with stochastic gradient descent. Why spend computation and memory on
exact (minibatch) gradients only to use them for stochastic optimization? We
develop a general framework and approach for randomized automatic
differentiation (RAD), which can allow unbiased gradient estimates to be
computed with reduced memory in return for variance. We examine limitations of
the general approach, and argue that we must leverage problem specific
structure to realize benefits. We develop RAD techniques for a variety of
simple neural network architectures, and show that for a fixed memory budget,
RAD converges in fewer iterations than using a small batch size for feedforward
networks, and in a similar number for recurrent networks. We also show that RAD
can be applied to scientific computing, and use it to develop a low-memory
stochastic gradient method for optimizing the control parameters of a linear
reaction-diffusion PDE representing a fission reactor.Comment: ICLR 202