18 research outputs found
The Sample-Complexity of General Reinforcement Learning
We present a new algorithm for general reinforcement learning where the true
environment is known to belong to a finite class of N arbitrary models. The
algorithm is shown to be near-optimal for all but O(N log^2 N) time-steps with
high probability. Infinite classes are also considered where we show that
compactness is a key criterion for determining the existence of uniform
sample-complexity bounds. A matching lower bound is given for the finite case.Comment: 16 page
Extreme State Aggregation Beyond MDPs
We consider a Reinforcement Learning setup where an agent interacts with an
environment in observation-reward-action cycles without any (esp.\ MDP)
assumptions on the environment. State aggregation and more generally feature
reinforcement learning is concerned with mapping histories/raw-states to
reduced/aggregated states. The idea behind both is that the resulting reduced
process (approximately) forms a small stationary finite-state MDP, which can
then be efficiently solved or learnt. We considerably generalize existing
aggregation results by showing that even if the reduced process is not an MDP,
the (q-)value functions and (optimal) policies of an associated MDP with same
state-space size solve the original problem, as long as the solution can
approximately be represented as a function of the reduced states. This implies
an upper bound on the required state space size that holds uniformly for all RL
problems. It may also explain why RL algorithms designed for MDPs sometimes
perform well beyond MDPs.Comment: 28 LaTeX pages. 8 Theorem
Recommended from our members
A dual process theory of optimistic cognition
Optimism is a prevalent bias in human cognition including variations like self-serving beliefs, illusions of control and overly positive views of one's own future. Further, optimism has been linked with both success and happiness. In fact, it has been described as a part of human mental well-being which has otherwise been assumed to be about being connected to reality. In reality, only people suffering from depression are realistic. Here we study a formalization of optimism within a dual process framework and study its usefulness beyond human needs in a way that also applies to artificial reinforcement learning agents. Optimism enables systematic exploration which is essential in an (partially) unknown world. The key property of an optimistic hypothesis is that if it is not contradicted when one acts greedily with respect to it, then one is well rewarded even if it is wrong
Settling the Reward Hypothesis
The reward hypothesis posits that, "all of what we mean by goals and purposes
can be well thought of as maximization of the expected value of the cumulative
sum of a received scalar signal (reward)." We aim to fully settle this
hypothesis. This will not conclude with a simple affirmation or refutation, but
rather specify completely the implicit requirements on goals and purposes under
which the hypothesis holds
Near-optimal PAC bounds for discounted MDPs
We study upper and lower bounds on the sample-complexity of learning near-optimal behaviour in finite-state discounted Markov Decision Processes (mdps). We prove a new bound for a modified version of Upper Confidence Reinforcement Learning (ucrl) with only cubic dependence on the horizon. The bound is unimprovable in all parameters except the size of the state/action space, where it depends linearly on the number of non-zero transition probabilities. The lower bound strengthens previous work by being both more general (it applies to all policies) and tighter. The upper and lower bounds match up to logarithmic factors provided the transition matrix is not too dense