86,531 research outputs found
Gradient Descent: The Ultimate Optimizer
Working with any gradient-based machine learning algorithm involves the
tedious task of tuning the optimizer's hyperparameters, such as the learning
rate. There exist many techniques for automated hyperparameter optimization,
but they typically introduce even more hyperparameters to control the
hyperparameter optimization process. We propose to instead learn the
hyperparameters themselves by gradient descent, and furthermore to learn the
hyper-hyperparameters by gradient descent as well, and so on ad infinitum. As
these towers of gradient-based optimizers grow, they become significantly less
sensitive to the choice of top-level hyperparameters, hence decreasing the
burden on the user to search for optimal values
Alternating Back-Propagation for Generator Network
This paper proposes an alternating back-propagation algorithm for learning
the generator network model. The model is a non-linear generalization of factor
analysis. In this model, the mapping from the continuous latent factors to the
observed signal is parametrized by a convolutional neural network. The
alternating back-propagation algorithm iterates the following two steps: (1)
Inferential back-propagation, which infers the latent factors by Langevin
dynamics or gradient descent. (2) Learning back-propagation, which updates the
parameters given the inferred latent factors by gradient descent. The gradient
computations in both steps are powered by back-propagation, and they share most
of their code in common. We show that the alternating back-propagation
algorithm can learn realistic generator models of natural images, video
sequences, and sounds. Moreover, it can also be used to learn from incomplete
or indirect training data
Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models
Transformers are remarkably good at in-context learning (ICL) -- learning
from demonstrations without parameter updates -- but how they perform ICL
remains a mystery. Recent work suggests that Transformers may learn in-context
by internally running Gradient Descent, a first-order optimization method. In
this paper, we instead demonstrate that Transformers learn to implement
higher-order optimization methods to perform ICL. Focusing on in-context linear
regression, we show that Transformers learn to implement an algorithm very
similar to Iterative Newton's Method, a higher-order optimization method,
rather than Gradient Descent. Empirically, we show that predictions from
successive Transformer layers closely match different iterations of Newton's
Method linearly, with each middle layer roughly computing 3 iterations. In
contrast, exponentially more Gradient Descent steps are needed to match an
additional Transformers layer; this suggests that Transformers have an
comparable rate of convergence with high-order methods such as Iterative
Newton, which are exponentially faster than Gradient Descent. We also show that
Transformers can learn in-context on ill-conditioned data, a setting where
Gradient Descent struggles but Iterative Newton succeeds. Finally, we show
theoretical results which support our empirical findings and have a close
correspondence with them: we prove that Transformers can implement
iterations of Newton's method with layers
Transformers learn to implement preconditioned gradient descent for in-context learning
Motivated by the striking ability of transformers for in-context learning,
several works demonstrate that transformers can implement algorithms like
gradient descent. By a careful construction of weights, these works show that
multiple layers of transformers are expressive enough to simulate gradient
descent iterations. Going beyond the question of expressivity, we ask: Can
transformers learn to implement such algorithms by training over random problem
instances? To our knowledge, we make the first theoretical progress toward this
question via analysis of the loss landscape for linear transformers trained
over random instances of linear regression. For a single attention layer, we
prove the global minimum of the training objective implements a single
iteration of preconditioned gradient descent. Notably, the preconditioning
matrix not only adapts to the input distribution but also to the variance
induced by data inadequacy. For a transformer with attention layers, we
prove certain critical points of the training objective implement
iterations of preconditioned gradient descent. Our results call for future
theoretical studies on learning algorithms by training transformers
Cooperative Reinforcement Learning Using an Expert-Measuring Weighted Strategy with WoLF
Gradient descent learning algorithms have proven effective in solving mixed strategy games. The policy hill climbing (PHC) variants of WoLF (Win or Learn Fast) and PDWoLF (Policy Dynamics based WoLF) have both shown rapid convergence to equilibrium solutions by increasing the accuracy of their gradient parameters over standard Q-learning. Likewise, cooperative learning techniques using weighted strategy sharing (WSS) and expertness measurements improve agent performance when multiple agents are solving a common goal. By combining these cooperative techniques with fast gradient descent learning, an agent’s performance converges to a solution at an even faster rate. This statement is verified in a stochastic grid world environment using a limited visibility hunter-prey model with random and intelligent prey. Among five different expertness measurements, cooperative learning using each PHC algorithm converges faster than independent learning when agents strictly learn from better performing agents
- …