86,135 research outputs found
Cooperative Reinforcement Learning Using an Expert-Measuring Weighted Strategy with WoLF
Gradient descent learning algorithms have proven effective in solving mixed strategy games. The policy hill climbing (PHC) variants of WoLF (Win or Learn Fast) and PDWoLF (Policy Dynamics based WoLF) have both shown rapid convergence to equilibrium solutions by increasing the accuracy of their gradient parameters over standard Q-learning. Likewise, cooperative learning techniques using weighted strategy sharing (WSS) and expertness measurements improve agent performance when multiple agents are solving a common goal. By combining these cooperative techniques with fast gradient descent learning, an agent’s performance converges to a solution at an even faster rate. This statement is verified in a stochastic grid world environment using a limited visibility hunter-prey model with random and intelligent prey. Among five different expertness measurements, cooperative learning using each PHC algorithm converges faster than independent learning when agents strictly learn from better performing agents
Learning-to-Learn Stochastic Gradient Descent with Biased Regularization
We study the problem of learning-to-learn: inferring a learning algorithm
that works well on tasks sampled from an unknown distribution. As class of
algorithms we consider Stochastic Gradient Descent on the true risk regularized
by the square euclidean distance to a bias vector. We present an average excess
risk bound for such a learning algorithm. This result quantifies the potential
benefit of using a bias vector with respect to the unbiased case. We then
address the problem of estimating the bias from a sequence of tasks. We propose
a meta-algorithm which incrementally updates the bias, as new tasks are
observed. The low space and time complexity of this approach makes it appealing
in practice. We provide guarantees on the learning ability of the
meta-algorithm. A key feature of our results is that, when the number of tasks
grows and their variance is relatively small, our learning-to-learn approach
has a significant advantage over learning each task in isolation by Stochastic
Gradient Descent without a bias term. We report on numerical experiments which
demonstrate the effectiveness of our approach.Comment: 37 pages, 8 figure
Trajectory-Based Off-Policy Deep Reinforcement Learning
Policy gradient methods are powerful reinforcement learning algorithms and
have been demonstrated to solve many complex tasks. However, these methods are
also data-inefficient, afflicted with high variance gradient estimates, and
frequently get stuck in local optima. This work addresses these weaknesses by
combining recent improvements in the reuse of off-policy data and exploration
in parameter space with deterministic behavioral policies. The resulting
objective is amenable to standard neural network optimization strategies like
stochastic gradient descent or stochastic gradient Hamiltonian Monte Carlo.
Incorporation of previous rollouts via importance sampling greatly improves
data-efficiency, whilst stochastic optimization schemes facilitate the escape
from local optima. We evaluate the proposed approach on a series of continuous
control benchmark tasks. The results show that the proposed algorithm is able
to successfully and reliably learn solutions using fewer system interactions
than standard policy gradient methods.Comment: Includes appendix. Accepted for ICML 201
- …