1,518 research outputs found
Scalable Planning and Learning for Multiagent POMDPs: Extended Version
Online, sample-based planning algorithms for POMDPs have shown great promise
in scaling to problems with large state spaces, but they become intractable for
large action and observation spaces. This is particularly problematic in
multiagent POMDPs where the action and observation space grows exponentially
with the number of agents. To combat this intractability, we propose a novel
scalable approach based on sample-based planning and factored value functions
that exploits structure present in many multiagent settings. This approach
applies not only in the planning case, but also in the Bayesian reinforcement
learning setting. Experimental results show that we are able to provide high
quality solutions to large multiagent planning and learning problems
Near-Optimal BRL using Optimistic Local Transitions
Model-based Bayesian Reinforcement Learning (BRL) allows a found
formalization of the problem of acting optimally while facing an unknown
environment, i.e., avoiding the exploration-exploitation dilemma. However,
algorithms explicitly addressing BRL suffer from such a combinatorial explosion
that a large body of work relies on heuristic algorithms. This paper introduces
BOLT, a simple and (almost) deterministic heuristic algorithm for BRL which is
optimistic about the transition function. We analyze BOLT's sample complexity,
and show that under certain parameters, the algorithm is near-optimal in the
Bayesian sense with high probability. Then, experimental results highlight the
key differences of this method compared to previous work.Comment: ICML201
Recommended from our members
Understanding Model-Based Reinforcement Learning and its Application in Safe Reinforcement Learning
Model-based reinforcement learning algorithms have been shown to achieve successful results on various continuous control benchmarks, but the understanding of model-based methods is limited. We try to interpret how model-based method works through novel experiments on state-of-the-art algorithms with an emphasis on the model learning part. We evaluate the role of the model learning in policy optimization and propose methods to learn a more accurate model. With a better understanding of model-based reinforcement learning, we then apply model-based methods to solve safe reinforcement learning (RL) problems with near-zero violation of hard constraints throughout training. Drawing an analogy with how humans and animals learn to perform safe actions, we break down the safe RL problem into three stages. First, we train agents in a constraint-free environment to learn a performant policy for reaching high rewards, and simultaneously learn a model of the dynamics. Second, we use model-based methods to plan safe actions and train a safeguarding policy from these actions through imitation. Finally, we propose a factored framework to train an overall policy that mixes the performant policy and the safeguarding policy. This three-step curriculum ensures near-zero violation of safety constraints at all times. As an advantage of model-based method, the sample complexity required at the second and third steps of the process is significantly lower than model-free methods and can enable online safe learning. We demonstrate the effectiveness of our methods in various continuous control problems and analyze the advantages over state-of-the-art approaches
Bayesian Nonparametric Feature and Policy Learning for Decision-Making
Learning from demonstrations has gained increasing interest in the recent
past, enabling an agent to learn how to make decisions by observing an
experienced teacher. While many approaches have been proposed to solve this
problem, there is only little work that focuses on reasoning about the observed
behavior. We assume that, in many practical problems, an agent makes its
decision based on latent features, indicating a certain action. Therefore, we
propose a generative model for the states and actions. Inference reveals the
number of features, the features, and the policies, allowing us to learn and to
analyze the underlying structure of the observed behavior. Further, our
approach enables prediction of actions for new states. Simulations are used to
assess the performance of the algorithm based upon this model. Moreover, the
problem of learning a driver's behavior is investigated, demonstrating the
performance of the proposed model in a real-world scenario
- …