516 research outputs found

    Discretized Bayesian pursuit – A new scheme for reinforcement learning

    Get PDF
    The success of Learning Automata (LA)-based estimator algorithms over the classical, Linear Reward-Inaction ( L RI )-like schemes, can be explained by their ability to pursue the actions with the highest reward probability estimates. Without access to reward probability estimates, it makes sense for schemes like the L RI to first make large exploring steps, and then to gradually turn exploration into exploitation by making progressively smaller learning steps. However, this behavior becomes counter-intuitive when pursuing actions based on their estimated reward probabilities. Learning should then ideally proceed in progressively larger steps, as the reward probability estimates turn more accurate. This paper introduces a new estimator algorithm, the Discretized Bayesian Pursuit Algorithm (DBPA), that achieves this. The DBPA is implemented by linearly discretizing the action probability space of the Bayesian Pursuit Algorithm (BPA) [1]. The key innovation is that the linear discrete updating rules mitigate the counter-intuitive behavior of the corresponding linear continuous updating rules, by augmenting them with the reward probability estimates. Extensive experimental results show the superiority of DBPA over previous estimator algorithms. Indeed, the DBPA is probably the fastest reported LA to date

    On incorporating the paradigms of discretization and Bayesian estimation to create a new family of pursuit learning automata

    Get PDF
    There are currently two fundamental paradigms that have been used to enhance the convergence speed of Learning Automata (LA). The first involves the concept of utilizing the estimates of the reward probabilities, while the second involves discretizing the probability space in which the LA operates. This paper demonstrates how both of these can be simultaneously utilized, and in particular, by using the family of Bayesian estimates that have been proven to have distinct advantages over their maximum likelihood counterparts. The success of LA-based estimator algorithms over the classical, Linear Reward-Inaction (LRI)-like schemes, can be explained by their ability to pursue the actions with the highest reward probability estimates. Without access to reward probability estimates, it makes sense for schemes like the LRI to first make large exploring steps, and then to gradually turn exploration into exploitation by making progressively smaller learning steps. However, this behavior becomes counter-intuitive when pursuing actions based on their estimated reward probabilities. Learning should then ideally proceed in progressively larger steps, as the reward probability estimates turn more accurate. This paper introduces a new estimator algorithm, the Discretized Bayesian Pursuit Algorithm (DBPA), that achieves this by incorporating both the above paradigms. The DBPA is implemented by linearly discretizing the action probability space of the Bayesian Pursuit Algorithm (BPA) (Zhang et al. in IEA-AIE 2011, Springer, New York, pp. 608-620, 2011). The key innovation of this paper is that the linear discrete updating rules mitigate the counter-intuitive behavior of the corresponding linear continuous updating rules, by augmenting them with the reward probability estimates. Extensive experimental results show the superiority of DBPA over previous estimator algorithms. Indeed, the DBPA is probably the fastest reported LA to date. Apart from the rigorous experimental demonstration of the strength of the DBPA, the paper also briefly records the proofs of why the BPA and the DBPA are ε{lunate}-optimal in stationary environments

    The Hierarchical Discrete Learning Automaton Suitable for Environments with Many Actions and High Accuracy Requirements

    Get PDF
    Author's accepted manuscriptSince its early beginning, the paradigm of Learning Automata (LA), has attracted much interest. Over the last decades, new concepts and various improvements have been introduced to increase the LA’s speed and accuracy, including employing probability updating functions, discretizing the probability space, and implementing the “Pursuit” concept. The concept of incorporating “structure” into the ordering of the LA’s actions is one of the latest advancements to the field, leading to the ϵ-optimal Hierarchical Continuous Pursuit LA (HCPA) that has superior performance to other LA variants when the number of actions is large. Although the previously proposed HCPA is powerful, its speed has a handicap when the required action probability of an action is approaching unity. The reason for this slow convergence is that the learning parameter operates in a multiplicative manner within the probability space, making the increment of the action probability smaller as its probability becomes close to unity. Therefore, we propose the novel Hierarchical Discrete Learning Automata (HDPA) in this paper, which does not possess the same impediment as the HCPA. The proposed machine infuse the principle of discretization into the action probability vector’s updating functionality, where this type of updating is invoked recursively at every depth within a hierarchical tree structure and we pursue the best estimated action in all iterations through utilization of the Estimator phenomenon. The proposed machine is ϵ-optimal, and our experimental results demonstrate that the number of iterations required before convergence is significantly reduced for the HDPA, when compared with the HCPA.acceptedVersio

    Distributed learning automata-based scheme for classification using novel pursuit scheme

    Get PDF
    Author's accepted manuscript.Available from 03/03/2021.This is a post-peer-review, pre-copyedit version of an article published in Applied Intelligence. The final authenticated version is available online at: http://dx.doi.org/10.1007/s10489-019-01627-w.acceptedVersio

    The Hierarchical Discrete Pursuit Learning Automaton: A Novel Scheme With Fast Convergence and Epsilon-Optimality

    Get PDF
    Author's accepted manuscript© 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting /republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Since the early 1960s, the paradigm of learning automata (LA) has experienced abundant interest. Arguably, it has also served as the foundation for the phenomenon and field of reinforcement learning (RL). Over the decades, new concepts and fundamental principles have been introduced to increase the LA’s speed and accuracy. These include using probability updating functions, discretizing the probability space, and using the “Pursuit” concept. Very recently, the concept of incorporating “structure” into the ordering of the LA’s actions has improved both the speed and accuracy of the corresponding hierarchical machines, when the number of actions is large. This has led to the ϵ -optimal hierarchical continuous pursuit LA (HCPA). This article pioneers the inclusion of all the above-mentioned phenomena into a new single LA, leading to the novel hierarchical discretized pursuit LA (HDPA). Indeed, although the previously proposed HCPA is powerful, its speed has an impediment when any action probability is close to unity, because the updates of the components of the probability vector are correspondingly smaller when any action probability becomes closer to unity. We propose here, the novel HDPA, where we infuse the phenomenon of discretization into the action probability vector’s updating functionality, and which is invoked recursively at every stage of the machine’s hierarchical structure. This discretized functionality does not possess the same impediment, because discretization prohibits it. We demonstrate the HDPA’s robustness and validity by formally proving the ϵ -optimality by utilizing the moderation property. We also invoke the submartingale characteristic at every level, to prove that the action probability of the optimal action converges to unity as time goes to infinity. Apart from the new machine being ϵ -optimal, the numerical results demonstrate that the number of iterations required for convergence is significantly reduce...acceptedVersio

    Towards Thompson Sampling for Complex Bayesian Reasoning

    Get PDF
    Paper III, IV, and VI are not available as a part of the dissertation due to the copyright.Thompson Sampling (TS) is a state-of-art algorithm for bandit problems set in a Bayesian framework. Both the theoretical foundation and the empirical efficiency of TS is wellexplored for plain bandit problems. However, the Bayesian underpinning of TS means that TS could potentially be applied to other, more complex, problems as well, beyond the bandit problem, if suitable Bayesian structures can be found. The objective of this thesis is the development and analysis of TS-based schemes for more complex optimization problems, founded on Bayesian reasoning. We address several complex optimization problems where the previous state-of-art relies on a relatively myopic perspective on the problem. These includes stochastic searching on the line, the Goore game, the knapsack problem, travel time estimation, and equipartitioning. Instead of employing Bayesian reasoning to obtain a solution, they rely on carefully engineered rules. In all brevity, we recast each of these optimization problems in a Bayesian framework, introducing dedicated TS based solution schemes. For all of the addressed problems, the results show that besides being more effective, the TS based approaches we introduce are also capable of solving more adverse versions of the problems, such as dealing with stochastic liars.publishedVersio

    Simultaneous Process Design and Control Optimization using Reinforcement Learning

    Get PDF
    With the ever-increasing numbers in population and quality in healthcare, it is inevitable for the demand of energy and natural resources to rise. Therefore, it is important to design highly efficient and sustainable chemical processes in the pursuit of sustainability. The performance of a chemical plant is highly affected by its design and control. A design cannot be evaluated without its controls and vice versa. To optimally address design and control simultaneously, one must formulate a bi-level mixed-integer nonlinear program with a dynamic optimization problem as the inner problem; this, is intractable. However, by computing an optimal policy using reinforcement learning, a controller with close-form expression can be found and embedded into the mathematical program. In this work, an approach using a policy gradient method along with mathematical programming to solve the problem simultaneously is proposed. The approach was tested in two case studies and the performance of the controller was evaluated. It was shown that the proposed approach outperforms current state-of-the-art control strategies. This opens a whole new range of possibilities to address the simultaneous design and control of engineering systems

    Demonstrating Advantages of Neuromorphic Computation: A Pilot Study

    Get PDF
    Neuromorphic devices represent an attempt to mimic aspects of the brain's architecture and dynamics with the aim of replicating its hallmark functional capabilities in terms of computational power, robust learning and energy efficiency. We employ a single-chip prototype of the BrainScaleS 2 neuromorphic system to implement a proof-of-concept demonstration of reward-modulated spike-timing-dependent plasticity in a spiking network that learns to play the Pong video game by smooth pursuit. This system combines an electronic mixed-signal substrate for emulating neuron and synapse dynamics with an embedded digital processor for on-chip learning, which in this work also serves to simulate the virtual environment and learning agent. The analog emulation of neuronal membrane dynamics enables a 1000-fold acceleration with respect to biological real-time, with the entire chip operating on a power budget of 57mW. Compared to an equivalent simulation using state-of-the-art software, the on-chip emulation is at least one order of magnitude faster and three orders of magnitude more energy-efficient. We demonstrate how on-chip learning can mitigate the effects of fixed-pattern noise, which is unavoidable in analog substrates, while making use of temporal variability for action exploration. Learning compensates imperfections of the physical substrate, as manifested in neuronal parameter variability, by adapting synaptic weights to match respective excitability of individual neurons.Comment: Added measurements with noise in NEST simulation, add notice about journal publication. Frontiers in Neuromorphic Engineering (2019
    • …
    corecore