845 research outputs found
Variational Bayesian Reinforcement Learning with Regret Bounds
We consider the exploration-exploitation trade-off in reinforcement learning
and we show that an agent imbued with an epistemic-risk-seeking utility
function is able to explore efficiently, as measured by regret. The parameter
that controls how risk-seeking the agent is can be optimized to minimize
regret, or annealed according to a schedule. We call the resulting algorithm
K-learning and we show that the K-values that the agent maintains are
optimistic for the expected optimal Q-values at each state-action pair. The
utility function approach induces a natural Boltzmann exploration policy for
which the 'temperature' parameter is equal to the risk-seeking parameter. This
policy achieves a Bayesian regret bound of ,
where L is the time horizon, S is the number of states, A is the number of
actions, and T is the total number of elapsed time-steps. K-learning can be
interpreted as mirror descent in the policy space, and it is similar to other
well-known methods in the literature, including Q-learning, soft-Q-learning,
and maximum entropy policy gradient. K-learning is simple to implement, as it
only requires adding a bonus to the reward at each state-action and then
solving a Bellman equation. We conclude with a numerical example demonstrating
that K-learning is competitive with other state-of-the-art algorithms in
practice
Reinforcement Learning, Bit by Bit
Reinforcement learning agents have demonstrated remarkable achievements in
simulated environments. Data efficiency poses an impediment to carrying this
success over to real environments. The design of data-efficient agents calls
for a deeper understanding of information acquisition and representation. We
develop concepts and establish a regret bound that together offer principled
guidance. The bound sheds light on questions of what information to seek, how
to seek that information, and it what information to retain. To illustrate
concepts, we design simple agents that build on them and present computational
results that demonstrate improvements in data efficiency
- …