630 research outputs found

    Online Meta-learning by Parallel Algorithm Competition

    Full text link
    The efficiency of reinforcement learning algorithms depends critically on a few meta-parameters that modulates the learning updates and the trade-off between exploration and exploitation. The adaptation of the meta-parameters is an open question in reinforcement learning, which arguably has become more of an issue recently with the success of deep reinforcement learning in high-dimensional state spaces. The long learning times in domains such as Atari 2600 video games makes it not feasible to perform comprehensive searches of appropriate meta-parameter values. We propose the Online Meta-learning by Parallel Algorithm Competition (OMPAC) method. In the OMPAC method, several instances of a reinforcement learning algorithm are run in parallel with small differences in the initial values of the meta-parameters. After a fixed number of episodes, the instances are selected based on their performance in the task at hand. Before continuing the learning, Gaussian noise is added to the meta-parameters with a predefined probability. We validate the OMPAC method by improving the state-of-the-art results in stochastic SZ-Tetris and in standard Tetris with a smaller, 10×\times10, board, by 31% and 84%, respectively, and by improving the results for deep Sarsa(λ\lambda) agents in three Atari 2600 games by 62% or more. The experiments also show the ability of the OMPAC method to adapt the meta-parameters according to the learning progress in different tasks.Comment: 15 pages, 10 figures. arXiv admin note: text overlap with arXiv:1702.0311

    Reinforcement Learning for the Unit Commitment Problem

    Full text link
    In this work we solve the day-ahead unit commitment (UC) problem, by formulating it as a Markov decision process (MDP) and finding a low-cost policy for generation scheduling. We present two reinforcement learning algorithms, and devise a third one. We compare our results to previous work that uses simulated annealing (SA), and show a 27% improvement in operation costs, with running time of 2.5 minutes (compared to 2.5 hours of existing state-of-the-art).Comment: Accepted and presented in IEEE PES PowerTech, Eindhoven 2015, paper ID 46273

    Reinforcement learning in continuous state- and action-space

    Get PDF
    Reinforcement learning in the continuous state-space poses the problem of the inability to store the values of all state-action pairs in a lookup table, due to both storage limitations and the inability to visit all states sufficiently often to learn the correct values. This can be overcome with the use of function approximation techniques with generalisation capability, such as artificial neural networks, to store the value function. When this is applied we can select the optimal action by comparing the values of each possible action; however, when the action-space is continuous this is not possible. In this thesis we investigate methods to select the optimal action when artificial neural networks are used to approximate the value function, through the application of numerical optimization techniques. Although it has been stated in the literature that gradient-ascent methods can be applied to the action selection [47], it is also stated that solving this problem would be infeasible, and therefore, is claimed that it is necessary to utilise a second artificial neural network to approximate the policy function [21, 55]. The major contributions of this thesis include the investigation of the applicability of action selection by numerical optimization methods, including gradient-ascent along with other derivative-based and derivative-free numerical optimization methods,and the proposal of two novel algorithms which are based on the application of two alternative action selection methods: NM-SARSA [40] and NelderMead-SARSA. We empirically compare the proposed methods to state-of-the-art methods from the literature on three continuous state- and action-space control benchmark problems from the literature: minimum-time full swing-up of the Acrobot; Cart-Pole balancing problem; and a double pole variant. We also present novel results from the application of the existing direct policy search method genetic programming to the Acrobot benchmark problem [12, 14]

    Study on stock trading and portfolio optimization using genetic network programming

    Get PDF
    制度:新 ; 報告番号:甲3002号 ; 学位の種類:博士(工学) ; 授与年月日: 2010/3/15 ; 早大学位記番号:新525

    Exploration of genetic network programming with two-stage reinforcement learning for mobile robot

    Get PDF
    This paper observes the exploration of Genetic Network Programming Two-Stage Reinforcement Learning for mobile robot navigation. The proposed method aims to observe its exploration when inexperienced environments used in the implementation. In order to deal with this situation, individuals are trained firstly in the training phase, that is, they learn the environment with ϵ-greedy policy and learning rate α parameters. Here, two cases are studied, i.e., case A for low exploration and case B for high exploration. In the implementation, the individuals implemented to get experience and learn a new environment on-line. Then, the performance of learning processes are observed due to the environmental changes
    corecore