198,352 research outputs found
A Theory of Cheap Control in Embodied Systems
We present a framework for designing cheap control architectures for embodied
agents. Our derivation is guided by the classical problem of universal
approximation, whereby we explore the possibility of exploiting the agent's
embodiment for a new and more efficient universal approximation of behaviors
generated by sensorimotor control. This embodied universal approximation is
compared with the classical non-embodied universal approximation. To exemplify
our approach, we present a detailed quantitative case study for policy models
defined in terms of conditional restricted Boltzmann machines. In contrast to
non-embodied universal approximation, which requires an exponential number of
parameters, in the embodied setting we are able to generate all possible
behaviors with a drastically smaller model, thus obtaining cheap universal
approximation. We test and corroborate the theory experimentally with a
six-legged walking machine. The experiments show that the sufficient controller
complexity predicted by our theory is tight, which means that the theory has
direct practical implications. Keywords: cheap design, embodiment, sensorimotor
loop, universal approximation, conditional restricted Boltzmann machineComment: 27 pages, 10 figure
Learning from Scarce Experience
Searching the space of policies directly for the optimal policy has been one
popular method for solving partially observable reinforcement learning
problems. Typically, with each change of the target policy, its value is
estimated from the results of following that very policy. This requires a large
number of interactions with the environment as different polices are
considered. We present a family of algorithms based on likelihood ratio
estimation that use data gathered when executing one policy (or collection of
policies) to estimate the value of a different policy. The algorithms combine
estimation and optimization stages. The former utilizes experience to build a
non-parametric representation of an optimized function. The latter performs
optimization on this estimate. We show positive empirical results and provide
the sample complexity bound.Comment: 8 pages 4 figure
Improved Memory-Bounded Dynamic Programming for Decentralized POMDPs
Memory-Bounded Dynamic Programming (MBDP) has proved extremely effective in
solving decentralized POMDPs with large horizons. We generalize the algorithm
and improve its scalability by reducing the complexity with respect to the
number of observations from exponential to polynomial. We derive error bounds
on solution quality with respect to this new approximation and analyze the
convergence behavior. To evaluate the effectiveness of the improvements, we
introduce a new, larger benchmark problem. Experimental results show that
despite the high complexity of decentralized POMDPs, scalable solution
techniques such as MBDP perform surprisingly well.Comment: Appears in Proceedings of the Twenty-Third Conference on Uncertainty
in Artificial Intelligence (UAI2007
Recommended from our members
Understanding Model-Based Reinforcement Learning and its Application in Safe Reinforcement Learning
Model-based reinforcement learning algorithms have been shown to achieve successful results on various continuous control benchmarks, but the understanding of model-based methods is limited. We try to interpret how model-based method works through novel experiments on state-of-the-art algorithms with an emphasis on the model learning part. We evaluate the role of the model learning in policy optimization and propose methods to learn a more accurate model. With a better understanding of model-based reinforcement learning, we then apply model-based methods to solve safe reinforcement learning (RL) problems with near-zero violation of hard constraints throughout training. Drawing an analogy with how humans and animals learn to perform safe actions, we break down the safe RL problem into three stages. First, we train agents in a constraint-free environment to learn a performant policy for reaching high rewards, and simultaneously learn a model of the dynamics. Second, we use model-based methods to plan safe actions and train a safeguarding policy from these actions through imitation. Finally, we propose a factored framework to train an overall policy that mixes the performant policy and the safeguarding policy. This three-step curriculum ensures near-zero violation of safety constraints at all times. As an advantage of model-based method, the sample complexity required at the second and third steps of the process is significantly lower than model-free methods and can enable online safe learning. We demonstrate the effectiveness of our methods in various continuous control problems and analyze the advantages over state-of-the-art approaches
- …