37 research outputs found
Efficient Exploration via Epistemic-Risk-Seeking Policy Optimization
Exploration remains a key challenge in deep reinforcement learning (RL).
Optimism in the face of uncertainty is a well-known heuristic with theoretical
guarantees in the tabular setting, but how best to translate the principle to
deep reinforcement learning, which involves online stochastic gradients and
deep network function approximators, is not fully understood. In this paper we
propose a new, differentiable optimistic objective that when optimized yields a
policy that provably explores efficiently, with guarantees even under function
approximation. Our new objective is a zero-sum two-player game derived from
endowing the agent with an epistemic-risk-seeking utility function, which
converts uncertainty into value and encourages the agent to explore uncertain
states. We show that the solution to this game minimizes an upper bound on the
regret, with the 'players' each attempting to minimize one component of a
particular regret decomposition. We derive a new model-free algorithm which we
call 'epistemic-risk-seeking actor-critic' (ERSAC), which is simply an
application of simultaneous stochastic gradient ascent-descent to the game.
Finally, we discuss a recipe for incorporating off-policy data and show that
combining the risk-seeking objective with replay data yields a double benefit
in terms of statistical efficiency. We conclude with some results showing good
performance of a deep RL agent using the technique on the challenging 'DeepSea'
environment, showing significant performance improvements even over other
efficient exploration techniques, as well as improved performance on the Atari
benchmark
Variational Bayesian Reinforcement Learning with Regret Bounds
We consider the exploration-exploitation trade-off in reinforcement learning
and we show that an agent imbued with an epistemic-risk-seeking utility
function is able to explore efficiently, as measured by regret. The parameter
that controls how risk-seeking the agent is can be optimized to minimize
regret, or annealed according to a schedule. We call the resulting algorithm
K-learning and we show that the K-values that the agent maintains are
optimistic for the expected optimal Q-values at each state-action pair. The
utility function approach induces a natural Boltzmann exploration policy for
which the 'temperature' parameter is equal to the risk-seeking parameter. This
policy achieves a Bayesian regret bound of ,
where L is the time horizon, S is the number of states, A is the number of
actions, and T is the total number of elapsed time-steps. K-learning can be
interpreted as mirror descent in the policy space, and it is similar to other
well-known methods in the literature, including Q-learning, soft-Q-learning,
and maximum entropy policy gradient. K-learning is simple to implement, as it
only requires adding a bonus to the reward at each state-action and then
solving a Bellman equation. We conclude with a numerical example demonstrating
that K-learning is competitive with other state-of-the-art algorithms in
practice
On the connection between Bregman divergence and value in regularized Markov decision processes
In this short note we derive a relationship between the Bregman divergence
from the current policy to the optimal policy and the suboptimality of the
current value function in a regularized Markov decision process. This result
has implications for multi-task reinforcement learning, offline reinforcement
learning, and regret analysis under function approximation, among others
Cohabitation in Ireland: evidence from survey data
non-peer-reviewedCohabitation has grown strongly in Ireland over the last decade. We use large-scale surveys to characterise its extent and
nature. We find it has almost tripled in incidence between 1994 and 2002. It is associated with being young, urban and in
the labour market. Most cohabitations are short, and a high proportion end in marriage. Over 40% of new marriages are
now preceded by cohabitation, making it close to a majority practice rather than the deviant behaviour it would have been
a generation ago. In this respect it seems to be developing as an adaptation of marriage rather than an alternative to it