2 research outputs found
Curiosity Killed the Cat and the Asymptotically Optimal Agent
Reinforcement learners are agents that learn to pick actions that lead to
high reward. Ideally, the value of a reinforcement learner's policy approaches
optimality--where the optimal informed policy is the one which maximizes
reward. Unfortunately, we show that if an agent is guaranteed to be
"asymptotically optimal" in any (stochastically computable) environment, then
subject to an assumption about the true environment, this agent will be either
destroyed or incapacitated with probability 1; both of these are forms of traps
as understood in the Markov Decision Process literature. Environments with
traps pose a well-known problem for agents, but we are unaware of other work
which shows that traps are not only a risk, but a certainty, for agents of a
certain caliber. Much work in reinforcement learning uses an ergodicity
assumption to avoid this problem. Often, doing theoretical research under
simplifying assumptions prepares us to provide practical solutions even in the
absence of those assumptions, but the ergodicity assumption in reinforcement
learning may have led us entirely astray in preparing safe and effective
exploration strategies for agents in dangerous environments. Rather than
assuming away the problem, we present an agent with the modest guarantee of
approaching the performance of a mentor, doing safe exploration instead of
reckless exploration.Comment: 8 pages, with 4 page appendix; 1 figur
Pessimism About Unknown Unknowns Inspires Conservatism
If we could define the set of all bad outcomes, we could hard-code an agent
which avoids them; however, in sufficiently complex environments, this is
infeasible. We do not know of any general-purpose approaches in the literature
to avoiding novel failure modes. Motivated by this, we define an idealized
Bayesian reinforcement learner which follows a policy that maximizes the
worst-case expected reward over a set of world-models. We call this agent
pessimistic, since it optimizes assuming the worst case. A scalar parameter
tunes the agent's pessimism by changing the size of the set of world-models
taken into account. Our first main contribution is: given an assumption about
the agent's model class, a sufficiently pessimistic agent does not cause
"unprecedented events" with probability , whether or not designers
know how to precisely specify those precedents they are concerned with. Since
pessimism discourages exploration, at each timestep, the agent may defer to a
mentor, who may be a human or some known-safe policy we would like to improve.
Our other main contribution is that the agent's policy's value approaches at
least that of the mentor, while the probability of deferring to the mentor goes
to 0. In high-stakes environments, we might like advanced artificial agents to
pursue goals cautiously, which is a non-trivial problem even if the agent were
allowed arbitrary computing power; we present a formal solution.Comment: 12 pages, plus 16-page appendix; to be published in COLT 2020
proceeding