7 research outputs found
Characterizing predictable classes of processes
The problem is sequence prediction in the following setting. A sequence
of discrete-valued observations is generated according to
some unknown probabilistic law (measure) . After observing each outcome,
it is required to give the conditional probabilities of the next observation.
The measure belongs to an arbitrary class \C of stochastic processes.
We are interested in predictors whose conditional probabilities converge
to the "true" -conditional probabilities if any \mu\in\C is chosen to
generate the data. We show that if such a predictor exists, then a predictor
can also be obtained as a convex combination of a countably many elements of
\C. In other words, it can be obtained as a Bayesian predictor whose prior is
concentrated on a countable set. This result is established for two very
different measures of performance of prediction, one of which is very strong,
namely, total variation, and the other is very weak, namely, prediction in
expected average Kullback-Leibler divergence
Discrete MDL Predicts in Total Variation
The Minimum Description Length (MDL) principle selects the model that has the
shortest code for data plus model. We show that for a countable class of
models, MDL predictions are close to the true distribution in a strong sense.
The result is completely general. No independence, ergodicity, stationarity,
identifiability, or other assumption on the model class need to be made. More
formally, we show that for any countable class of models, the distributions
selected by MDL (or MAP) asymptotically predict (merge with) the true measure
in the class in total variation distance. Implications for non-i.i.d. domains
like time-series forecasting, discriminative learning, and reinforcement
learning are discussed.Comment: 15 LaTeX page
On Finding Predictors for Arbitrary Families of Processes
International audienceThe problem is sequence prediction in the following setting. A sequence of discrete-valued observations is generated according to some unknown probabilistic law (measure) . After observing each outcome, it is required to give the conditional probabilities of the next observation. The measure belongs to an arbitrary but known class of stochastic process measures. We are interested in predictors whose conditional probabilities converge (in some sense) to the ``true'' -conditional probabilities if any is chosen to generate the sequence. The contribution of this work is in characterizing the families for which such predictors exist, and in providing a specific and simple form in which to look for a solution. We show that if any predictor works, then there exists a Bayesian predictor, whose prior is discrete, and which works too. We also find several sufficient and necessary conditions for the existence of a predictor, in terms of topological characterizations of the family , as well as in terms of local behaviour of the measures in , which in some cases lead to procedures for constructing such predictors. It should be emphasized that the framework is completely general: the stochastic processes considered are not required to be i.i.d., stationary, or to belong to any parametric or countable family
Nonparametric General Reinforcement Learning
Reinforcement learning problems are often phrased in terms of
Markov decision processes (MDPs). In this thesis we go beyond
MDPs and consider reinforcement learning in environments that are
non-Markovian, non-ergodic and only partially observable. Our
focus is not on practical algorithms, but rather on the
fundamental underlying problems: How do we balance exploration
and exploitation? How do we explore optimally? When is an agent
optimal? We follow the nonparametric realizable paradigm: we
assume the data is drawn from an unknown source that belongs to a
known countable class of candidates.
First, we consider the passive (sequence prediction) setting,
learning from data that is not independent and identically
distributed. We collect results from artificial intelligence,
algorithmic information theory, and game theory and put them in a
reinforcement learning context: they demonstrate how an agent can
learn the value of its own policy.
Next, we establish negative results on Bayesian reinforcement
learning agents, in particular AIXI. We show that unlucky or
adversarial choices of the prior cause the agent to misbehave
drastically. Therefore Legg-Hutter intelligence and balanced
Pareto optimality, which depend crucially on the choice of the
prior, are entirely subjective. Moreover, in the class of all
computable environments every policy is Pareto optimal. This
undermines all existing optimality properties for AIXI.
However, there are Bayesian approaches to general reinforcement
learning that satisfy objective optimality guarantees: We prove
that Thompson sampling
is asymptotically optimal in stochastic environments in the sense
that its value converges to the value of the optimal policy. We
connect asymptotic optimality to regret
given a recoverability assumption on the environment that allows
the agent to recover from mistakes. Hence Thompson sampling
achieves sublinear regret in these environments.
AIXI is known to be incomputable. We quantify this using the
arithmetical hierarchy, and establish upper and corresponding
lower bounds for incomputability. Further, we show that AIXI is
not limit computable, thus cannot be approximated using finite
computation. However there are limit computable ε-optimal
approximations to AIXI. We also derive computability bounds for
knowledge-seeking agents, and give a limit computable weakly
asymptotically optimal reinforcement learning agent.
Finally, our results culminate in a formal solution to the grain
of truth problem: A Bayesian agent acting in a multi-agent
environment learns to predict the other agents' policies if its
prior assigns positive probability to them (the prior contains a
grain of truth). We construct a large but limit computable class
containing a grain of truth
and show that agents based on Thompson sampling over this class
converge to play ε-Nash equilibria in arbitrary unknown
computable multi-agent environments
Nonparametric General Reinforcement Learning
Reinforcement learning problems are often phrased in terms of
Markov decision processes (MDPs). In this thesis we go beyond
MDPs and consider reinforcement learning in environments that are
non-Markovian, non-ergodic and only partially observable. Our
focus is not on practical algorithms, but rather on the
fundamental underlying problems: How do we balance exploration
and exploitation? How do we explore optimally? When is an agent
optimal? We follow the nonparametric realizable paradigm: we
assume the data is drawn from an unknown source that belongs to a
known countable class of candidates.
First, we consider the passive (sequence prediction) setting,
learning from data that is not independent and identically
distributed. We collect results from artificial intelligence,
algorithmic information theory, and game theory and put them in a
reinforcement learning context: they demonstrate how an agent can
learn the value of its own policy.
Next, we establish negative results on Bayesian reinforcement
learning agents, in particular AIXI. We show that unlucky or
adversarial choices of the prior cause the agent to misbehave
drastically. Therefore Legg-Hutter intelligence and balanced
Pareto optimality, which depend crucially on the choice of the
prior, are entirely subjective. Moreover, in the class of all
computable environments every policy is Pareto optimal. This
undermines all existing optimality properties for AIXI.
However, there are Bayesian approaches to general reinforcement
learning that satisfy objective optimality guarantees: We prove
that Thompson sampling
is asymptotically optimal in stochastic environments in the sense
that its value converges to the value of the optimal policy. We
connect asymptotic optimality to regret
given a recoverability assumption on the environment that allows
the agent to recover from mistakes. Hence Thompson sampling
achieves sublinear regret in these environments.
AIXI is known to be incomputable. We quantify this using the
arithmetical hierarchy, and establish upper and corresponding
lower bounds for incomputability. Further, we show that AIXI is
not limit computable, thus cannot be approximated using finite
computation. However there are limit computable ε-optimal
approximations to AIXI. We also derive computability bounds for
knowledge-seeking agents, and give a limit computable weakly
asymptotically optimal reinforcement learning agent.
Finally, our results culminate in a formal solution to the grain
of truth problem: A Bayesian agent acting in a multi-agent
environment learns to predict the other agents' policies if its
prior assigns positive probability to them (the prior contains a
grain of truth). We construct a large but limit computable class
containing a grain of truth
and show that agents based on Thompson sampling over this class
converge to play ε-Nash equilibria in arbitrary unknown
computable multi-agent environments