25,155 research outputs found

    Universal Reinforcement Learning Algorithms: Survey and Experiments

    Full text link
    Many state-of-the-art reinforcement learning (RL) algorithms typically assume that the environment is an ergodic Markov Decision Process (MDP). In contrast, the field of universal reinforcement learning (URL) is concerned with algorithms that make as few assumptions as possible about the environment. The universal Bayesian agent AIXI and a family of related URL algorithms have been developed in this setting. While numerous theoretical optimality results have been proven for these agents, there has been no empirical investigation of their behavior to date. We present a short and accessible survey of these URL algorithms under a unified notation and framework, along with results of some experiments that qualitatively illustrate some properties of the resulting policies, and their relative performance on partially-observable gridworld environments. We also present an open-source reference implementation of the algorithms which we hope will facilitate further understanding of, and experimentation with, these ideas.Comment: 8 pages, 6 figures, Twenty-sixth International Joint Conference on Artificial Intelligence (IJCAI-17

    Convergence of a Reinforcement Learning Algorithm in Continuous Domains

    Get PDF
    In the field of Reinforcement Learning, Markov Decision Processes with a finite number of states and actions have been well studied, and there exist algorithms capable of producing a sequence of policies which converge to an optimal policy with probability one. Convergence guarantees for problems with continuous states also exist. Until recently, no online algorithm for continuous states and continuous actions has been proven to produce optimal policies. This Dissertation contains the results of research into reinforcement learning algorithms for problems in which both the state and action spaces are continuous. The problems to be solved are introduced formally as Markov Decision Processes. Also introduced is a value-function solution method known as Q-learning. The primary result of this Dissertation is the presentation of a Q-learning type algorithm adapted for continuous states and actions, and the proof that it asymptotically learns an optimal policy with probability one. While the algorithm is intended to advance the theory of continuous domain reinforcement learning, an example is given to show that with appropriate exploration policies, it can produce satisfactory solutions to non-trivial benchmark problems. Kernel regression based algorithms have excellent theoretical properties, but have high computational cost and do not adapt well to high-dimensional problems. A class of batch-mode regression tree-based algorithms is introduced. These algorithms are modular in the sense that different methods for partitioning, performing local regression, and choosing representative actions can be chosen. Experiments demonstrate superior performance over kernel methods. Batch algorithms possess superior computational efficiency, but pay the price of not being able to use past observations to inform exploration. A data structure useful for limited learning during the exploration phase is introduced. It is then demonstrated that this limited learning can outperform batch algorithms using totally random action exploration

    Energy Sharing for Multiple Sensor Nodes with Finite Buffers

    Full text link
    We consider the problem of finding optimal energy sharing policies that maximize the network performance of a system comprising of multiple sensor nodes and a single energy harvesting (EH) source. Sensor nodes periodically sense the random field and generate data, which is stored in the corresponding data queues. The EH source harnesses energy from ambient energy sources and the generated energy is stored in an energy buffer. Sensor nodes receive energy for data transmission from the EH source. The EH source has to efficiently share the stored energy among the nodes in order to minimize the long-run average delay in data transmission. We formulate the problem of energy sharing between the nodes in the framework of average cost infinite-horizon Markov decision processes (MDPs). We develop efficient energy sharing algorithms, namely Q-learning algorithm with exploration mechanisms based on the ϵ\epsilon-greedy method as well as upper confidence bound (UCB). We extend these algorithms by incorporating state and action space aggregation to tackle state-action space explosion in the MDP. We also develop a cross entropy based method that incorporates policy parameterization in order to find near optimal energy sharing policies. Through simulations, we show that our algorithms yield energy sharing policies that outperform the heuristic greedy method.Comment: 38 pages, 10 figure

    Learning from humans: combining imitation and deep reinforcement learning to accomplish human-level performance on a virtual foraging task

    Full text link
    We develop a method to learn bio-inspired foraging policies using human data. We conduct an experiment where humans are virtually immersed in an open field foraging environment and are trained to collect the highest amount of rewards. A Markov Decision Process (MDP) framework is introduced to model the human decision dynamics. Then, Imitation Learning (IL) based on maximum likelihood estimation is used to train Neural Networks (NN) that map human decisions to observed states. The results show that passive imitation substantially underperforms humans. We further refine the human-inspired policies via Reinforcement Learning (RL), using on-policy algorithms that are more suitable to learn from pre-trained networks. We show that the combination of IL and RL can match human results and that good performance strongly depends on an egocentric representation of the environment. The developed methodology can be used to efficiently learn policies for unmanned vehicles which have to solve missions in an open field environment.Comment: 24 pages, 15 figure

    Deep Reinforcement Learning for 2D Physics-Based Object Manipulation in Clutter

    Full text link
    Deep Reinforcement Learning (DRL) is a quickly evolving research field rooted in operations research and behavioural psychology, with potential applications extending across various domains, including robotics. This thesis delineates the background of modern Reinforcement Learning (RL), starting with the framework constituted by the Markov decision processes, Markov properties, goals and rewards, agent-environment interactions, and policies. We explain the main types of algorithms commonly used in RL, including value-based, policy gradient, and actor-critic methods, with a special emphasis on DQN, A2C and PPO. We then give a short literature review on some widely adopted frameworks for implementing RL algorithms and environments. Subsequently, we present Bidimensional Gripper Environment (BGE), a virtual simulator based on the Pymunk physics engine we developed to analyse top-down bidimensional object manipulation. The methodology section frames our agent-environment interaction as a Markov decision process, such that we can apply our RL algorithms. We list various goal formulation strategies, including reward shaping and curriculum learning. We also employ different steps of observation preprocessing to reduce the computational workload required. In the experimental phase, we run through a series of scenarios of increasing difficulty. We start with a simple static scenario and then gradually increase the amount of stochasticity. Whenever the agents show difficulty in learning, we counteract by increasing the degree of reward shaping and curriculum learning. These experiments demonstrate the substantial limitations and pitfalls of model-free algorithms under changing dynamics. In conclusion, we present a summary of our findings and remarks. We then outline potential future work to improve our methodology and possibly expand to real-world systems

    Active Markov Information-Theoretic Path Planning for Robotic Environmental Sensing

    Full text link
    Recent research in multi-robot exploration and mapping has focused on sampling environmental fields, which are typically modeled using the Gaussian process (GP). Existing information-theoretic exploration strategies for learning GP-based environmental field maps adopt the non-Markovian problem structure and consequently scale poorly with the length of history of observations. Hence, it becomes computationally impractical to use these strategies for in situ, real-time active sampling. To ease this computational burden, this paper presents a Markov-based approach to efficient information-theoretic path planning for active sampling of GP-based fields. We analyze the time complexity of solving the Markov-based path planning problem, and demonstrate analytically that it scales better than that of deriving the non-Markovian strategies with increasing length of planning horizon. For a class of exploration tasks called the transect sampling task, we provide theoretical guarantees on the active sampling performance of our Markov-based policy, from which ideal environmental field conditions and sampling task settings can be established to limit its performance degradation due to violation of the Markov assumption. Empirical evaluation on real-world temperature and plankton density field data shows that our Markov-based policy can generally achieve active sampling performance comparable to that of the widely-used non-Markovian greedy policies under less favorable realistic field conditions and task settings while enjoying significant computational gain over them.Comment: 10th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2011), Extended version with proofs, 11 page