12 research outputs found
Sample-Efficient Learning of POMDPs with Multiple Observations In Hindsight
This paper studies the sample-efficiency of learning in Partially Observable
Markov Decision Processes (POMDPs), a challenging problem in reinforcement
learning that is known to be exponentially hard in the worst-case. Motivated by
real-world settings such as loading in game playing, we propose an enhanced
feedback model called ``multiple observations in hindsight'', where after each
episode of interaction with the POMDP, the learner may collect multiple
additional observations emitted from the encountered latent states, but may not
observe the latent states themselves. We show that sample-efficient learning
under this feedback model is possible for two new subclasses of POMDPs:
\emph{multi-observation revealing POMDPs} and \emph{distinguishable POMDPs}.
Both subclasses generalize and substantially relax \emph{revealing POMDPs} -- a
widely studied subclass for which sample-efficient learning is possible under
standard trajectory feedback. Notably, distinguishable POMDPs only require the
emission distributions from different latent states to be \emph{different}
instead of \emph{linearly independent} as required in revealing POMDPs
Offline RL with Observation Histories: Analyzing and Improving Sample Complexity
Offline reinforcement learning (RL) can in principle synthesize more optimal
behavior from a dataset consisting only of suboptimal trials. One way that this
can happen is by "stitching" together the best parts of otherwise suboptimal
trajectories that overlap on similar states, to create new behaviors where each
individual state is in-distribution, but the overall returns are higher.
However, in many interesting and complex applications, such as autonomous
navigation and dialogue systems, the state is partially observed. Even worse,
the state representation is unknown or not easy to define. In such cases,
policies and value functions are often conditioned on observation histories
instead of states. In these cases, it is not clear if the same kind of
"stitching" is feasible at the level of observation histories, since two
different trajectories would always have different histories, and thus "similar
states" that might lead to effective stitching cannot be leveraged.
Theoretically, we show that standard offline RL algorithms conditioned on
observation histories suffer from poor sample complexity, in accordance with
the above intuition. We then identify sufficient conditions under which offline
RL can still be efficient -- intuitively, it needs to learn a compact
representation of history comprising only features relevant for action
selection. We introduce a bisimulation loss that captures the extent to which
this happens, and propose that offline RL can explicitly optimize this loss to
aid worst-case sample complexity. Empirically, we show that across a variety of
tasks either our proposed loss improves performance, or the value of this loss
is already minimized as a consequence of standard offline RL, indicating that
it correlates well with good performance.Comment: 21 pages, 4 figure
Partially Observable Multi-agent RL with (Quasi-)Efficiency: The Blessing of Information Sharing
We study provable multi-agent reinforcement learning (MARL) in the general
framework of partially observable stochastic games (POSGs). To circumvent the
known hardness results and the use of computationally intractable oracles, we
advocate leveraging the potential \emph{information-sharing} among agents, a
common practice in empirical MARL, and a standard model for multi-agent control
systems with communications. We first establish several computation complexity
results to justify the necessity of information-sharing, as well as the
observability assumption that has enabled quasi-efficient single-agent RL with
partial observations, for computational efficiency in solving POSGs. We then
propose to further \emph{approximate} the shared common information to
construct an {approximate model} of the POSG, in which planning an approximate
equilibrium (in terms of solving the original POSG) can be quasi-efficient,
i.e., of quasi-polynomial-time, under the aforementioned assumptions.
Furthermore, we develop a partially observable MARL algorithm that is both
statistically and computationally quasi-efficient. We hope our study may open
up the possibilities of leveraging and even designing different
\emph{information structures}, for developing both sample- and
computation-efficient partially observable MARL.Comment: International Conference on Machine Learning (ICML) 202
Future-Dependent Value-Based Off-Policy Evaluation in POMDPs
We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs)
with general function approximation. Existing methods such as sequential
importance sampling estimators and fitted-Q evaluation suffer from the curse of
horizon in POMDPs. To circumvent this problem, we develop a novel model-free
OPE method by introducing future-dependent value functions that take future
proxies as inputs. Future-dependent value functions play similar roles as
classical value functions in fully-observable MDPs. We derive a new Bellman
equation for future-dependent value functions as conditional moment equations
that use history proxies as instrumental variables. We further propose a
minimax learning method to learn future-dependent value functions using the new
Bellman equation. We obtain the PAC result, which implies our OPE estimator is
consistent as long as futures and histories contain sufficient information
about latent states, and the Bellman completeness. Finally, we extend our
methods to learning of dynamics and establish the connection between our
approach and the well-known spectral learning methods in POMDPs.Comment: This paper was accepted in NeurIPS 202
Efficient RL with Impaired Observability: Learning to Act with Delayed and Missing State Observations
In real-world reinforcement learning (RL) systems, various forms of impaired
observability can complicate matters. These situations arise when an agent is
unable to observe the most recent state of the system due to latency or lossy
channels, yet the agent must still make real-time decisions. This paper
introduces a theoretical investigation into efficient RL in control systems
where agents must act with delayed and missing state observations. We establish
near-optimal regret bounds, of the form , for RL in both the delayed and missing observation settings.
Despite impaired observability posing significant challenges to the policy
class and planning, our results demonstrate that learning remains efficient,
with the regret bound optimally depending on the state-action size of the
original system. Additionally, we provide a characterization of the performance
of the optimal policy under impaired observability, comparing it to the optimal
value obtained with full observability
Posterior Sampling for Competitive RL: Function Approximation and Partial Observation
This paper investigates posterior sampling algorithms for competitive
reinforcement learning (RL) in the context of general function approximations.
Focusing on zero-sum Markov games (MGs) under two critical settings, namely
self-play and adversarial learning, we first propose the self-play and
adversarial generalized eluder coefficient (GEC) as complexity measures for
function approximation, capturing the exploration-exploitation trade-off in
MGs. Based on self-play GEC, we propose a model-based self-play posterior
sampling method to control both players to learn Nash equilibrium, which can
successfully handle the partial observability of states. Furthermore, we
identify a set of partially observable MG models fitting MG learning with the
adversarial policies of the opponent. Incorporating the adversarial GEC, we
propose a model-based posterior sampling method for learning adversarial MG
with potential partial observability. We further provide low regret bounds for
proposed algorithms that can scale sublinearly with the proposed GEC and the
number of episodes . To the best of our knowledge, we for the first time
develop generic model-based posterior sampling algorithms for competitive RL
that can be applied to a majority of tractable zero-sum MG classes in both
fully observable and partially observable MGs with self-play and adversarial
learning.Comment: NeurIPS 202
Blessing from Experts: Super Reinforcement Learning in Confounded Environments
We introduce super reinforcement learning in the batch setting, which takes
the observed action as input for enhanced policy learning. In the presence of
unmeasured confounders, the recommendations from human experts recorded in the
observed data allow us to recover certain unobserved information. Including
this information in the policy search, the proposed super reinforcement
learning will yield a super-policy that is guaranteed to outperform both the
standard optimal policy and the behavior one (e.g., the expert's
recommendation). Furthermore, to address the issue of unmeasured confounding in
finding super-policies, a number of non-parametric identification results are
established. Finally, we develop two super-policy learning algorithms and
derive their corresponding finite-sample regret guarantees
Evolutionary reinforcement learning for vision-based general video game playing.
Over the past decade, video games have become increasingly utilised for research in artificial intelligence. Perhaps the most extensive use of video games has been as benchmark problems in the field of reinforcement learning. Part of the reason for this is because video games are designed to challenge humans, and as a result, developing methods capable of mastering them is considered a stepping stone to achieving human-level per- formance in real-world tasks. Of particular interest are vision-based general video game playing (GVGP) methods. These are methods that learn from pixel inputs and can be applied, without modification, across sets of games. One of the challenges in evolutionary computing is scaling up neuroevolution methods, which have proven effective at solving simpler reinforcement learning problems in the past, to tasks with high- dimensional input spaces, such as video games. This thesis proposes a novel method for vision-based GVGP that combines the representational learning power of deep neural networks and the policy learning benefits of neuroevolution. This is achieved by separating state representation and policy learning and applying neuroevolution only to the latter. The method, AutoEncoder-augmented NeuroEvolution of Augmented Topologies (AE-NEAT), uses a deep autoencoder to learn compact state representations that are used as input for policy networks evolved using NEAT. Experiments on a selection of Atari games showed that this approach can successfully evolve high-performing agents and scale neuroevolution methods that evolve both weights and topology to do- mains with high-dimensional inputs. Overall, the experiments and results demonstrate a proof-of-concept of this separated state representation and policy learning approach and show that hybrid deep learning and neuroevolution-based GVGP methods are a promising avenue for future research