489 research outputs found
Performance Guarantees for Homomorphisms Beyond Markov Decision Processes
Most real-world problems have huge state and/or action spaces. Therefore, a
naive application of existing tabular solution methods is not tractable on such
problems. Nonetheless, these solution methods are quite useful if an agent has
access to a relatively small state-action space homomorphism of the true
environment and near-optimal performance is guaranteed by the map. A plethora
of research is focused on the case when the homomorphism is a Markovian
representation of the underlying process. However, we show that near-optimal
performance is sometimes guaranteed even if the homomorphism is non-Markovian.
Moreover, we can aggregate significantly more states by lifting the Markovian
requirement without compromising on performance. In this work, we expand
Extreme State Aggregation (ESA) framework to joint state-action aggregations.
We also lift the policy uniformity condition for aggregation in ESA that allows
even coarser modeling of the true environment
On overfitting and asymptotic bias in batch reinforcement learning with partial observability
This paper provides an analysis of the tradeoff between asymptotic bias
(suboptimality with unlimited data) and overfitting (additional suboptimality
due to limited data) in the context of reinforcement learning with partial
observability. Our theoretical analysis formally characterizes that while
potentially increasing the asymptotic bias, a smaller state representation
decreases the risk of overfitting. This analysis relies on expressing the
quality of a state representation by bounding L1 error terms of the associated
belief states. Theoretical results are empirically illustrated when the state
representation is a truncated history of observations, both on synthetic POMDPs
and on a large-scale POMDP in the context of smartgrids, with real-world data.
Finally, similarly to known results in the fully observable setting, we also
briefly discuss and empirically illustrate how using function approximators and
adapting the discount factor may enhance the tradeoff between asymptotic bias
and overfitting in the partially observable context.Comment: Accepted at the Journal of Artificial Intelligence Research (JAIR) -
31 page
Abstractions of General Reinforcement Learning
The field of artificial intelligence (AI) is devoted to the creation of artificial decision-makers that can perform (at least) on par with the human counterparts on a domain of interest. Unlike the agents in traditional AI, the agents in artificial general intelligence (AGI) are required to replicate human intelligence in almost every domain of interest. Moreover, an AGI agent should be able to achieve this without (virtually any) further changes, retraining, or fine- tuning of the parameters. The real world is non-stationary, non-ergodic, and non-Markovian: we, humans, can neither revisit our past nor are the most recent observations sufficient statistics to perform optimally. Yet, we excel at a variety of complex tasks. Many of these tasks require long term planning. We can associate this success to our natural faculty to abstract away task-irrelevant information from our overwhelming sensory experience. We make task- specific mental models of the world without much effort. Due to this ability to abstract, we can plan on a significantly compact representation of a task without much loss of performance. Not only this, we also abstract our actions to produce high-level plans: the level of action- abstraction can be anywhere between small muscle movements to a mental notion of "doing an action". It is natural to assume that any AGI agent competing with humans (at every plausible domain) should also have these abilities to abstract its experiences and actions. This thesis is an inquiry into the existence of such abstractions which aid efficient planning for a wide range of domains. And most importantly, these abstractions come with some optimality guarantees. We use a history-based reinforcement learning (RL) setup, appropriately called general reinforcement learning (GRL), to model such general-purpose decision-makers. We show that if such GRL agents have access to appropriate abstractions then they can perform optimally in a huge set of domains. That is, we argue that GRL with abstractions, called abstraction reinforcement learning (ARL), is an appropriate framework to model and analyze AGI agents. This work uses and extends beyond a powerful class of (state-only) abstractions called extreme state abstractions (ESA). We analyze a variety of such extreme abstractions, both state-only and state-action abstractions, to formally establish the representation and convergence guarantees. We also make many minor contributions to the ARL framework along the way. Last but not least, we collect a series of ideas that lay the foundations for designing the (extreme) abstraction learning algorithms
An Analysis of Model-Based Reinforcement Learning From Abstracted Observations
Many methods for Model-based Reinforcement learning (MBRL) in Markov decision
processes (MDPs) provide guarantees for both the accuracy of the model they can
deliver and the learning efficiency. At the same time, state abstraction
techniques allow for a reduction of the size of an MDP while maintaining a
bounded loss with respect to the original problem. Therefore, it may come as a
surprise that no such guarantees are available when combining both techniques,
i.e., where MBRL merely observes abstract states. Our theoretical analysis
shows that abstraction can introduce a dependence between samples collected
online (e.g., in the real world). That means that, without taking this
dependence into account, results for MBRL do not directly extend to this
setting. Our result shows that we can use concentration inequalities for
martingales to overcome this problem. This result makes it possible to extend
the guarantees of existing MBRL algorithms to the setting with abstraction. We
illustrate this by combining R-MAX, a prototypical MBRL algorithm, with
abstraction, thus producing the first performance guarantees for model-based
'RL from Abstracted Observations': model-based reinforcement learning with an
abstract model.Comment: 36 pages, 2 figures, published in Transactions on Machine Learning
Research (TMLR) 202
Average-energy games
Two-player quantitative zero-sum games provide a natural framework to
synthesize controllers with performance guarantees for reactive systems within
an uncontrollable environment. Classical settings include mean-payoff games,
where the objective is to optimize the long-run average gain per action, and
energy games, where the system has to avoid running out of energy.
We study average-energy games, where the goal is to optimize the long-run
average of the accumulated energy. We show that this objective arises naturally
in several applications, and that it yields interesting connections with
previous concepts in the literature. We prove that deciding the winner in such
games is in NP inter coNP and at least as hard as solving mean-payoff games,
and we establish that memoryless strategies suffice to win. We also consider
the case where the system has to minimize the average-energy while maintaining
the accumulated energy within predefined bounds at all times: this corresponds
to operating with a finite-capacity storage for energy. We give results for
one-player and two-player games, and establish complexity bounds and memory
requirements.Comment: In Proceedings GandALF 2015, arXiv:1509.0685
Scalable methods for computing state similarity in deterministic Markov Decision Processes
We present new algorithms for computing and approximating bisimulation
metrics in Markov Decision Processes (MDPs). Bisimulation metrics are an
elegant formalism that capture behavioral equivalence between states and
provide strong theoretical guarantees on differences in optimal behaviour.
Unfortunately, their computation is expensive and requires a tabular
representation of the states, which has thus far rendered them impractical for
large problems. In this paper we present a new version of the metric that is
tied to a behavior policy in an MDP, along with an analysis of its theoretical
properties. We then present two new algorithms for approximating bisimulation
metrics in large, deterministic MDPs. The first does so via sampling and is
guaranteed to converge to the true metric. The second is a differentiable loss
which allows us to learn an approximation even for continuous state MDPs, which
prior to this work had not been possible.Comment: To appear in Proceedings of the Thirty-Fourth AAAI Conference on
Artificial Intelligence (AAAI-20
- …