25,155 research outputs found
Universal Reinforcement Learning Algorithms: Survey and Experiments
Many state-of-the-art reinforcement learning (RL) algorithms typically assume
that the environment is an ergodic Markov Decision Process (MDP). In contrast,
the field of universal reinforcement learning (URL) is concerned with
algorithms that make as few assumptions as possible about the environment. The
universal Bayesian agent AIXI and a family of related URL algorithms have been
developed in this setting. While numerous theoretical optimality results have
been proven for these agents, there has been no empirical investigation of
their behavior to date. We present a short and accessible survey of these URL
algorithms under a unified notation and framework, along with results of some
experiments that qualitatively illustrate some properties of the resulting
policies, and their relative performance on partially-observable gridworld
environments. We also present an open-source reference implementation of the
algorithms which we hope will facilitate further understanding of, and
experimentation with, these ideas.Comment: 8 pages, 6 figures, Twenty-sixth International Joint Conference on
Artificial Intelligence (IJCAI-17
Convergence of a Reinforcement Learning Algorithm in Continuous Domains
In the field of Reinforcement Learning, Markov Decision Processes with a finite number of states and actions have been well studied, and there exist algorithms capable of producing a sequence of policies which converge to an optimal policy with probability one. Convergence guarantees for problems with continuous states also exist. Until recently, no online algorithm for continuous states and continuous actions has been proven to produce optimal policies. This Dissertation contains the results of research into reinforcement learning algorithms for problems in which both the state and action spaces are continuous. The problems to be solved are introduced formally as Markov Decision Processes. Also introduced is a value-function solution method known as Q-learning. The primary result of this Dissertation is the presentation of a Q-learning type algorithm adapted for continuous states and actions, and the proof that it asymptotically learns an optimal policy with probability one. While the algorithm is intended to advance the theory of continuous domain reinforcement learning, an example is given to show that with appropriate exploration policies, it can produce satisfactory solutions to non-trivial benchmark problems. Kernel regression based algorithms have excellent theoretical properties, but have high computational cost and do not adapt well to high-dimensional problems. A class of batch-mode regression tree-based algorithms is introduced. These algorithms are modular in the sense that different methods for partitioning, performing local regression, and choosing representative actions can be chosen. Experiments demonstrate superior performance over kernel methods. Batch algorithms possess superior computational efficiency, but pay the price of not being able to use past observations to inform exploration. A data structure useful for limited learning during the exploration phase is introduced. It is then demonstrated that this limited learning can outperform batch algorithms using totally random action exploration
Energy Sharing for Multiple Sensor Nodes with Finite Buffers
We consider the problem of finding optimal energy sharing policies that
maximize the network performance of a system comprising of multiple sensor
nodes and a single energy harvesting (EH) source. Sensor nodes periodically
sense the random field and generate data, which is stored in the corresponding
data queues. The EH source harnesses energy from ambient energy sources and the
generated energy is stored in an energy buffer. Sensor nodes receive energy for
data transmission from the EH source. The EH source has to efficiently share
the stored energy among the nodes in order to minimize the long-run average
delay in data transmission. We formulate the problem of energy sharing between
the nodes in the framework of average cost infinite-horizon Markov decision
processes (MDPs). We develop efficient energy sharing algorithms, namely
Q-learning algorithm with exploration mechanisms based on the -greedy
method as well as upper confidence bound (UCB). We extend these algorithms by
incorporating state and action space aggregation to tackle state-action space
explosion in the MDP. We also develop a cross entropy based method that
incorporates policy parameterization in order to find near optimal energy
sharing policies. Through simulations, we show that our algorithms yield energy
sharing policies that outperform the heuristic greedy method.Comment: 38 pages, 10 figure
Learning from humans: combining imitation and deep reinforcement learning to accomplish human-level performance on a virtual foraging task
We develop a method to learn bio-inspired foraging policies using human data.
We conduct an experiment where humans are virtually immersed in an open field
foraging environment and are trained to collect the highest amount of rewards.
A Markov Decision Process (MDP) framework is introduced to model the human
decision dynamics. Then, Imitation Learning (IL) based on maximum likelihood
estimation is used to train Neural Networks (NN) that map human decisions to
observed states. The results show that passive imitation substantially
underperforms humans. We further refine the human-inspired policies via
Reinforcement Learning (RL), using on-policy algorithms that are more suitable
to learn from pre-trained networks. We show that the combination of IL and RL
can match human results and that good performance strongly depends on an
egocentric representation of the environment. The developed methodology can be
used to efficiently learn policies for unmanned vehicles which have to solve
missions in an open field environment.Comment: 24 pages, 15 figure
Deep Reinforcement Learning for 2D Physics-Based Object Manipulation in Clutter
Deep Reinforcement Learning (DRL) is a quickly evolving research field rooted
in operations research and behavioural psychology, with potential applications
extending across various domains, including robotics. This thesis delineates
the background of modern Reinforcement Learning (RL), starting with the
framework constituted by the Markov decision processes, Markov properties,
goals and rewards, agent-environment interactions, and policies. We explain the
main types of algorithms commonly used in RL, including value-based, policy
gradient, and actor-critic methods, with a special emphasis on DQN, A2C and
PPO. We then give a short literature review on some widely adopted frameworks
for implementing RL algorithms and environments. Subsequently, we present
Bidimensional Gripper Environment (BGE), a virtual simulator based on the
Pymunk physics engine we developed to analyse top-down bidimensional object
manipulation. The methodology section frames our agent-environment interaction
as a Markov decision process, such that we can apply our RL algorithms. We list
various goal formulation strategies, including reward shaping and curriculum
learning. We also employ different steps of observation preprocessing to reduce
the computational workload required. In the experimental phase, we run through
a series of scenarios of increasing difficulty. We start with a simple static
scenario and then gradually increase the amount of stochasticity. Whenever the
agents show difficulty in learning, we counteract by increasing the degree of
reward shaping and curriculum learning. These experiments demonstrate the
substantial limitations and pitfalls of model-free algorithms under changing
dynamics. In conclusion, we present a summary of our findings and remarks. We
then outline potential future work to improve our methodology and possibly
expand to real-world systems
Active Markov Information-Theoretic Path Planning for Robotic Environmental Sensing
Recent research in multi-robot exploration and mapping has focused on
sampling environmental fields, which are typically modeled using the Gaussian
process (GP). Existing information-theoretic exploration strategies for
learning GP-based environmental field maps adopt the non-Markovian problem
structure and consequently scale poorly with the length of history of
observations. Hence, it becomes computationally impractical to use these
strategies for in situ, real-time active sampling. To ease this computational
burden, this paper presents a Markov-based approach to efficient
information-theoretic path planning for active sampling of GP-based fields. We
analyze the time complexity of solving the Markov-based path planning problem,
and demonstrate analytically that it scales better than that of deriving the
non-Markovian strategies with increasing length of planning horizon. For a
class of exploration tasks called the transect sampling task, we provide
theoretical guarantees on the active sampling performance of our Markov-based
policy, from which ideal environmental field conditions and sampling task
settings can be established to limit its performance degradation due to
violation of the Markov assumption. Empirical evaluation on real-world
temperature and plankton density field data shows that our Markov-based policy
can generally achieve active sampling performance comparable to that of the
widely-used non-Markovian greedy policies under less favorable realistic field
conditions and task settings while enjoying significant computational gain over
them.Comment: 10th International Conference on Autonomous Agents and Multiagent
Systems (AAMAS 2011), Extended version with proofs, 11 page
- …