67 research outputs found
Estimating the Maximum Expected Value: An Analysis of (Nested) Cross Validation and the Maximum Sample Average
We investigate the accuracy of the two most common estimators for the maximum
expected value of a general set of random variables: a generalization of the
maximum sample average, and cross validation. No unbiased estimator exists and
we show that it is non-trivial to select a good estimator without knowledge
about the distributions of the random variables. We investigate and bound the
bias and variance of the aforementioned estimators and prove consistency. The
variance of cross validation can be significantly reduced, but not without
risking a large bias. The bias and variance of different variants of cross
validation are shown to be very problem-dependent, and a wrong choice can lead
to very inaccurate estimates
Reinforcement learning in continuous state and action spaces
Many traditional reinforcement-learning algorithms have been designed for problems with small finite state and action spaces. Learning in such discrete problems can been difficult, due to noise and delayed reinforcements. However, many real-world problems have continuous state or action spaces, which can make learning a good decision policy even more involved. In this chapter we discuss how to automatically find good decision policies in continuous domains.
Because analytically computing a good policy from a continuous model can be infeasible, in this chapter we mainly focus on methods that explicitly update a representation of a value function, a policy or both. We discuss considerations in choosing an appropriate representation for these functions and discuss gradient-based and gradient-free ways to update the parameters. We show how to apply these methods to reinforcement-learning problems and discuss many specific algorithms. Amongst others, we cover gradient-based temporal-difference learning, evolutionary strategies, policy-gradient algorithms and actor-critic methods. We discuss the advantages of different approaches and compare the performance of a state-of-the-art actor-critic method and a state-of-the-art evolutionary strategy empirically
Deep Reinforcement Learning with Double Q-learning
The popular Q-learning algorithm is known to overestimate action values under
certain conditions. It was not previously known whether, in practice, such
overestimations are common, whether they harm performance, and whether they can
generally be prevented. In this paper, we answer all these questions
affirmatively. In particular, we first show that the recent DQN algorithm,
which combines Q-learning with a deep neural network, suffers from substantial
overestimations in some games in the Atari 2600 domain. We then show that the
idea behind the Double Q-learning algorithm, which was introduced in a tabular
setting, can be generalized to work with large-scale function approximation. We
propose a specific adaptation to the DQN algorithm and show that the resulting
algorithm not only reduces the observed overestimations, as hypothesized, but
that this also leads to much better performance on several games.Comment: AAAI 201
Double Q-learning
In some stochastic environments the well-known reinforcement learning algorithm Q-learning performs very poorly. This poor performance is caused by large overestimations of action values, which result from a positive bias that is introduced because Q-learning uses the maximum action value as an approximation for the maximum expected action value. We introduce an alternative way to approximate the maximum expected value for any set of random variables. The obtained double estimator method is shown to sometimes underestimate rather than overestimate the maximum expected value. We apply the double estimator to Q-learning to construct Double Q-learning, a new off-policy reinforcement learning algorithm. We show the new algorithm converges to the optimal policy and that it performs well in some settings in which Q-learning performs poorly due to its overestimation
Chaining Value Functions for Off-Policy Learning
To accumulate knowledge and improve its policy of behaviour,
a reinforcement learning agent can learn ‘off-policy’ about
policies that differ from the policy used to generate its experience. This is important to learn counterfactuals, or because
the experience was generated out of its own control. However,
off-policy learning is non-trivial, and standard reinforcementlearning algorithms can be unstable and divergent.
In this paper we discuss a novel family of off-policy prediction
algorithms which are convergent by construction. The idea is
to frst learn on-policy about the data-generating behaviour,
and then bootstrap an off-policy value estimate on this onpolicy estimate, thereby constructing a value estimate that
is partially off-policy. This process can be repeated to build
a chain of value functions, each time bootstrapping a new
estimate on the previous estimate in the chain. Each step in the
chain is stable and hence the complete algorithm is guaranteed
to be stable. Under mild conditions this comes arbitrarily close
to the off-policy TD solution when we increase the length of
the chain. Hence it can compute the solution even in cases
where off-policy TD diverges.
We prove that the proposed scheme is convergent and corresponds to an iterative decomposition of the inverse key matrix.
Furthermore it can be interpreted as estimating a novel objective – that we call a ‘k-step expedition’ – of following the
target policy for fnitely many steps before continuing indefnitely with the behaviour policy. Empirically we evaluate the
idea on challenging MDPs such as Baird’s counter example
and observe favourable results
- …