199 research outputs found
Examining average and discounted reward optimality criteria in reinforcement learning
In reinforcement learning (RL), the goal is to obtain an optimal policy, for
which the optimality criterion is fundamentally important. Two major optimality
criteria are average and discounted rewards, where the later is typically
considered as an approximation to the former. While the discounted reward is
more popular, it is problematic to apply in environments that have no natural
notion of discounting. This motivates us to revisit a) the progression of
optimality criteria in dynamic programming, b) justification for and
complication of an artificial discount factor, and c) benefits of directly
maximizing the average reward. Our contributions include a thorough examination
of the relationship between average and discounted rewards, as well as a
discussion of their pros and cons in RL. We emphasize that average-reward RL
methods possess the ingredient and mechanism for developing the general
discounting-free optimality criterion (Veinott, 1969) in RL.Comment: 14 pages, 3 figures, 10-page main conten
Bridging RL Theory and Practice with the Effective Horizon
Deep reinforcement learning (RL) works impressively in some environments and
fails catastrophically in others. Ideally, RL theory should be able to provide
an understanding of why this is, i.e. bounds predictive of practical
performance. Unfortunately, current theory does not quite have this ability. We
compare standard deep RL algorithms to prior sample complexity prior bounds by
introducing a new dataset, BRIDGE. It consists of 155 MDPs from common deep RL
benchmarks, along with their corresponding tabular representations, which
enables us to exactly compute instance-dependent bounds. We find that prior
bounds do not correlate well with when deep RL succeeds vs. fails, but discover
a surprising property that does. When actions with the highest Q-values under
the random policy also have the highest Q-values under the optimal policy, deep
RL tends to succeed; when they don't, deep RL tends to fail. We generalize this
property into a new complexity measure of an MDP that we call the effective
horizon, which roughly corresponds to how many steps of lookahead search are
needed in order to identify the next optimal action when leaf nodes are
evaluated with random rollouts. Using BRIDGE, we show that the effective
horizon-based bounds are more closely reflective of the empirical performance
of PPO and DQN than prior sample complexity bounds across four metrics. We also
show that, unlike existing bounds, the effective horizon can predict the
effects of using reward shaping or a pre-trained exploration policy
Accelerating decision making under partial observability using learned action priors
Thesis (M.Sc.)--University of the Witwatersrand, Faculty of Science, School of Computer Science and Applied Mathematics, 2017.Partially Observable Markov Decision Processes (POMDPs) provide a principled mathematical
framework allowing a robot to reason about the consequences of actions and
observations with respect to the agent's limited perception of its environment. They
allow an agent to plan and act optimally in uncertain environments. Although they
have been successfully applied to various robotic tasks, they are infamous for their high
computational cost. This thesis demonstrates the use of knowledge transfer, learned
from previous experiences, to accelerate the learning of POMDP tasks. We propose
that in order for an agent to learn to solve these tasks quicker, it must be able to generalise
from past behaviours and transfer knowledge, learned from solving multiple tasks,
between di erent circumstances. We present a method for accelerating this learning
process by learning the statistics of action choices over the lifetime of an agent, known
as action priors. Action priors specify the usefulness of actions in situations and allow
us to bias exploration, which in turn improves the performance of the learning process.
Using navigation domains, we study the degree to which transferring knowledge
between tasks in this way results in a considerable speed up in solution times.
This thesis therefore makes the following contributions. We provide an algorithm
for learning action priors from a set of approximately optimal value functions and two
approaches with which a prior knowledge over actions can be used in a POMDP context.
As such, we show that considerable gains in speed can be achieved in learning subsequent
tasks using prior knowledge rather than learning from scratch. Learning with
action priors can particularly be useful in reducing the cost of exploration in the early
stages of the learning process as the priors can act as mechanism that allows the agent
to select more useful actions given particular circumstances. Thus, we demonstrate how
the initial losses associated with unguided exploration can be alleviated through the
use of action priors which allow for safer exploration. Additionally, we illustrate that
action priors can also improve the computation speeds of learning feasible policies in a
shorter period of time.MT201
Improving the Practicality of Model-Based Reinforcement Learning: An Investigation into Scaling up Model-Based Methods in Online Settings
This thesis is a response to the current scarcity of practical model-based control algorithms in the reinforcement learning (RL) framework. As of yet there is no consensus on how best to integrate imperfect transition models into RL whilst mitigating policy improvement instabilities in online settings. Current state-of-the-art policy learning algorithms that surpass human performance often rely on model-free approaches that enjoy unmitigated sampling of transition data. Model-based RL (MBRL) instead attempts to distil experience into transition models that allow agents to plan new policies without needing to return to the environment and sample more data. The initial focus of this investigation is on kernel conditional mean embeddings (CMEs) (Song et al., 2009) deployed in an approximate policy iteration (API) algorithm (Grünewälder et al., 2012a). This existing MBRL algorithm boasts theoretically stable policy updates in continuous state and discrete action spaces. The Bellman operator’s value function and (transition) conditional expectation are modelled and embedded respectively as functions in a reproducing kernel Hilbert space (RKHS). The resulting finite-induced approximate pseudo-MDP (Yao et al., 2014a) can be solved exactly in a dynamic programming algorithm with policy improvement suboptimality guarantees. However model construction and policy planning scale cubically and quadratically respectively with the training set size, rendering the CME impractical for sampleabundant tasks in online settings. Three variants of CME API are investigated to strike a balance between stable policy updates and reduced computational complexity. The first variant models the value function and state-action representation explicitly in a parametric CME (PCME) algorithm with favourable computational complexity. However a soft conservative policy update technique is developed to mitigate policy learning oscillations in the planning process. The second variant returns to the non-parametric embedding and contributes (along with external work) to the compressed CME (CCME); a sparse and computationally more favourable CME. The final variant is a fully end-to-end differentiable embedding trained with stochastic gradient updates. The value function remains modelled in an RKHS such that backprop is driven by a non-parametric RKHS loss function. Actively compressed CME (ACCME) satisfies the pseudo-MDP contraction constraint using a sparse softmax activation function. The size of the pseudo-MDP (i.e. the size of the embedding’s last layer) is controlled by sparsifying the last layer weight matrix by extending the truncated gradient method (Langford et al., 2009) with group lasso updates in a novel ‘use it or lose it’ neuron pruning mechanism. Surprisingly this technique does not require extensive fine-tuning between control tasks
- …