43 research outputs found

    Reinforcement learning in continuous state- and action-space

    Get PDF
    Reinforcement learning in the continuous state-space poses the problem of the inability to store the values of all state-action pairs in a lookup table, due to both storage limitations and the inability to visit all states sufficiently often to learn the correct values. This can be overcome with the use of function approximation techniques with generalisation capability, such as artificial neural networks, to store the value function. When this is applied we can select the optimal action by comparing the values of each possible action; however, when the action-space is continuous this is not possible. In this thesis we investigate methods to select the optimal action when artificial neural networks are used to approximate the value function, through the application of numerical optimization techniques. Although it has been stated in the literature that gradient-ascent methods can be applied to the action selection [47], it is also stated that solving this problem would be infeasible, and therefore, is claimed that it is necessary to utilise a second artificial neural network to approximate the policy function [21, 55]. The major contributions of this thesis include the investigation of the applicability of action selection by numerical optimization methods, including gradient-ascent along with other derivative-based and derivative-free numerical optimization methods,and the proposal of two novel algorithms which are based on the application of two alternative action selection methods: NM-SARSA [40] and NelderMead-SARSA. We empirically compare the proposed methods to state-of-the-art methods from the literature on three continuous state- and action-space control benchmark problems from the literature: minimum-time full swing-up of the Acrobot; Cart-Pole balancing problem; and a double pole variant. We also present novel results from the application of the existing direct policy search method genetic programming to the Acrobot benchmark problem [12, 14]

    A comparison of action selection methods for implicit policy method reinforcement learning in continuous action-space

    Get PDF
    In this paper I investigate methods of applying reinforcement learning to continuous state- and action-space problems without a policy function. I compare the performance of four methods, one of which is the discretisation of the action-space, and the other three are optimisation techniques applied to finding the greedy action without discretisation. The optimisation methods I apply are gradient descent, Nelder-Mead and Newton's Method. The action selection methods are applied in conjunction with the SARSA algorithm, with a multilayer perceptron utilized for the approximation of the value function. The approaches are applied to two simulated continuous state- and action-space control problems: Cart-Pole and double Cart-Pole. The results are compared both in terms of action selection time and the number of trials required to train on the benchmark problems

    Echo state model of non-Markovian reinforcement learning, An

    Get PDF
    Department Head: Dale H. Grit.2008 Spring.Includes bibliographical references (pages 137-142).There exists a growing need for intelligent, autonomous control strategies that operate in real-world domains. Theoretically the state-action space must exhibit the Markov property in order for reinforcement learning to be applicable. Empirical evidence, however, suggests that reinforcement learning also applies to domains where the state-action space is approximately Markovian, a requirement for the overwhelming majority of real-world domains. These domains, termed non-Markovian reinforcement learning domains, raise a unique set of practical challenges. The reconstruction dimension required to approximate a Markovian state-space is unknown a priori and can potentially be large. Further, spatial complexity of local function approximation of the reinforcement learning domain grows exponentially with the reconstruction dimension. Parameterized dynamic systems alleviate both embedding length and state-space dimensionality concerns by reconstructing an approximate Markovian state-space via a compact, recurrent representation. Yet this representation extracts a cost; modeling reinforcement learning domains via adaptive, parameterized dynamic systems is characterized by instability, slow-convergence, and high computational or spatial training complexity. The objectives of this research are to demonstrate a stable, convergent, accurate, and scalable model of non-Markovian reinforcement learning domains. These objectives are fulfilled via fixed point analysis of the dynamics underlying the reinforcement learning domain and the Echo State Network, a class of parameterized dynamic system. Understanding models of non-Markovian reinforcement learning domains requires understanding the interactions between learning domains and their models. Fixed point analysis of the Mountain Car Problem reinforcement learning domain, for both local and nonlocal function approximations, suggests a close relationship between the locality of the approximation and the number and severity of bifurcations of the fixed point structure. This research suggests the likely cause of this relationship: reinforcement learning domains exist within a dynamic feature space in which trajectories are analogous to states. The fixed point structure maps dynamic space onto state-space. This explanation suggests two testable hypotheses. Reinforcement learning is sensitive to state-space locality because states cluster as trajectories in time rather than space. Second, models using trajectory-based features should exhibit good modeling performance and few changes in fixed point structure. Analysis of performance of lookup table, feedforward neural network, and Echo State Network (ESN) on the Mountain Car Problem reinforcement learning domain confirm these hypotheses. The ESN is a large, sparse, randomly-generated, unadapted recurrent neural network, which adapts a linear projection of the target domain onto the hidden layer. ESN modeling results on reinforcement learning domains show it achieves performance comparable to lookup table and neural network architectures on the Mountain Car Problem with minimal changes to fixed point structure. Also, the ESN achieves lookup table caliber performance when modeling Acrobot, a four-dimensional control problem, but is less successful modeling the lower dimensional Modified Mountain Car Problem. These performance discrepancies are attributed to the ESN’s excellent ability to represent complex short term dynamics, and its inability to consolidate long temporal dependencies into a static memory. Without memory consolidation, reinforcement learning domains exhibiting attractors with multiple dynamic scales are unlikely to be well-modeled via ESN. To mediate this problem, a simple ESN memory consolidation method is presented and tested for stationary dynamic systems. These results indicate the potential to improve modeling performance in reinforcement learning domains via memory consolidation

    A comparison of action selection methods for implicit policy method reinforcement learning in continuous action-space

    Get PDF
    In this paper I investigate methods of applying reinforcement learning to continuous state- and action-space problems without a policy function. I compare the performance of four methods, one of which is the discretisation of the action-space, and the other three are optimisation techniques applied to finding the greedy action without discretisation. The optimisation methods I apply are gradient descent, Nelder-Mead and Newton's Method. The action selection methods are applied in conjunction with the SARSA algorithm, with a multilayer perceptron utilized for the approximation of the value function. The approaches are applied to two simulated continuous state- and action-space control problems: Cart-Pole and double Cart-Pole. The results are compared both in terms of action selection time and the number of trials required to train on the benchmark problems

    A gene regulatory network model for control

    Get PDF
    The activity of a biological cell is regulated by interactions between genes and proteins. In artificial intelligence, this has led to the creation of developmental gene regulatory network (GRN) models which aim to exploit these mechanisms to algorithmically build complex designs. The emerging field of GRNs for control aims to instead exploit these natural mechanisms and this ability to encode a large variety of behaviours within a single evolvable genetic program for the solution of control problems. This work aims to extend the application domain of GRN models to previously unsolved control problems; the focus will here be on reinforcement learning problems, in which the dynamics of the system controlled are kept from the controller and only sparse feedback is given to it. This category of problems closely matches the challenges faced by natural evolution in generating biological GRNs. Starting with an existing GRN model, the fractal GRN (FGRN) model, a successful application to a standard control problem will be presented, followed by multiple improvements to the FGRN model and its associated genetic algorithm, resulting in better performances in terms of both reliability and speed. Limitations will be identified in the FGRN model, leading to the introduction of the Input-Merge- Regulate-Output (IMRO) architecture for GRN models, an implementation of which will show both quantitative and qualitative improvements over the FGRN model, solving harder control problems. The resulting model also displays useful features which should facilitate further extension and real-world use of the system

    Enhanced Bees Algorithm with fuzzy logic and Kalman filtering

    Get PDF
    The Bees Algorithm is a new population-based optimisation procedure which employs a combination of global exploratory and local exploitatory search. This thesis introduces an enhanced version of the Bees Algorithm which implements a fuzzy logic system for greedy selection of local search sites. The proposed fuzzy greedy selection system reduces the number of parameters needed to run the Bees Algorithm. The proposed algorithm has been applied to a number of benchmark function optimisation problems to demonstrate its robustness and self-organising ability. The Bees Algorithm in both its basic and enhanced forms has been used to optimise the parameters of a fuzzy logic controller. The purpose of the controller is to stabilise and balance an under-actuated two-link acrobatic robot (ACROBOT) in the upright position. Kalman filtering, as a fast convergence gradient-based optimisation method, is introduced as an alternative to random neighbourhood search to guide worker bees speedily towards the optima of local search sites. The proposed method has been used to tune membership functions for a fuzzy logic system. Finally, the fuzzy greedy selection system is enhanced by using multiple independent criteria to select local search sites. The enhanced fuzzy selection system has again been used with Kalman filtering to speed up the Bees Algorithm. The resulting algorithm has been applied to train a Radial Basis Function (RBF) neural network for wood defect identification. The results obtained show that the changes made to the Bees Algorithm in this research have significantly improved its performance. This is because these enhancements maintain the robust global search attribute of the Bees Algorithm and improve its local search procedure.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Intelligent model-based control of complex multi-link mechanisms

    Get PDF
    Complex under-actuated multilink mechanism involves a system whose number of control inputs is smaller than the dimension of the configuration space. The ability to control such a system through the manipulation of its natural dynamics would allow for the design of more energy-efficient machines with the ability to achieve smooth motions similar to those found in the natural world. This research aims to understand the complex nature of the Robogymnast, a triple link underactuated pendulum built at Cardiff University with the purpose of studying the behaviour of non-linear systems and understanding the challenges in developing its control system. A mathematical model of the robot was derived from the Euler-Lagrange equations. The design of the control system was based on the discrete-time linear model around the downward position and a sampling time of 2.5 milliseconds. Firstly, Invasive Weed Optimization (IWO) was used to optimize the swing-up motion of the robot by determining the optimum values of parameters that control the input signals of the Robogymnast’s two motors. The values obtained from IWO were then applied to both simulation and experiment. The results showed that the swing-up motion of the Robogymnast from the stable downward position to the inverted configuration to be successfully achieved. Secondly, due to the complex nature and nonlinearity of the Robogymnast, a novel approach of modelling the Robogymnast using a multi-layered Elman neural ii network (ENN) was proposed. The ENN model was then tested with various inputs and its output were analysed. The results showed that the ENN model to be capable of providing a better representation of the actual system compared to the mathematical model. Thirdly, IWO is used to investigate the optimum Q values of the Linear Quadratic Regulator (LQR) for inverted balance control of the Robogymnast. IWO was used to obtain the optimal Q values required by the LQR to maintain the Robogymnast in an upright configuration. Two fitness criteria were investigated: cost function J and settling time T. A controller was developed using values obtained from each fitness criteria. The results showed that LQRT performed faster but LQRJ was capable of stabilizing the Robogymnast from larger deflection angles. Finally, fitness criteria J and T were used simultaneously to obtain the optimal Q values for the LQR. For this purpose, two multi-objective optimization methods based on the IWO, namely the Weighted Criteria Method IWO (WCMIWO) and the Fuzzy Logic IWO Hybrid (FLIWOH) were developed. Two LQR controllers were first developed using the parameters obtained from the two optimization methods. The same process was then repeated with disturbance applied to the Robogymnast states to develop another two LQR controllers. The response of the controllers was then tested in different scenarios using simulation and their performance was evaluated. The results showed that all four controllers were able to balance the Robogymnast with the fastest settling time achieved by WMCIWO with disturbance followed by in the ascending order: FLIWOH with disturbance, FLIWOH, and WCMIWO

    Enhanced Bees Algorithm with fuzzy logic and Kalman filtering

    Get PDF
    The Bees Algorithm is a new population-based optimisation procedure which employs a combination of global exploratory and local exploitatory search. This thesis introduces an enhanced version of the Bees Algorithm which implements a fuzzy logic system for greedy selection of local search sites. The proposed fuzzy greedy selection system reduces the number of parameters needed to run the Bees Algorithm. The proposed algorithm has been applied to a number of benchmark function optimisation problems to demonstrate its robustness and self-organising ability. The Bees Algorithm in both its basic and enhanced forms has been used to optimise the parameters of a fuzzy logic controller. The purpose of the controller is to stabilise and balance an under-actuated two-link acrobatic robot (ACROBOT) in the upright position. Kalman filtering, as a fast convergence gradient-based optimisation method, is introduced as an alternative to random neighbourhood search to guide worker bees speedily towards the optima of local search sites. The proposed method has been used to tune membership functions for a fuzzy logic system. Finally, the fuzzy greedy selection system is enhanced by using multiple independent criteria to select local search sites. The enhanced fuzzy selection system has again been used with Kalman filtering to speed up the Bees Algorithm. The resulting algorithm has been applied to train a Radial Basis Function (RBF) neural network for wood defect identification. The results obtained show that the changes made to the Bees Algorithm in this research have significantly improved its performance. This is because these enhancements maintain the robust global search attribute of the Bees Algorithm and improve its local search procedure

    A reinforcement learning design for HIV clinical trials

    Get PDF
    A dissertation submitted to the Faculty of Science, University of the Witwatersrand, Johannesburg, in fulfilment of the requirements for the degree of Master of Science. Johannesburg, 2014.Determining e ective treatment strategies for life-threatening illnesses such as HIV is a signi cant problem in clinical research. Currently, HIV treatment involves using combinations of anti-HIV drugs to inhibit the formation of drug-resistant strains. From a clinician's perspective, this usually requires careful selection of drugs on the basis of an individual's immune responses at a particular time. As the number of drugs available for treatment increases, this task becomes di cult. In a clinical trial setting, the task is even more challenging since experience using new drugs is limited. For these reasons, this research examines whether machine learning techniques, and more speci cally batch reinforcement learning, can be used for the purposes of determining the appropriate treatment for an HIV-infected patient at a particular time. To do so, we consider using tted Q-iteration with extremely randomized trees, neural tted Q-iteration and least squares policy iteration. The use of batch reinforcement learning means that samples of patient data are captured prior to learning to avoid imposing risks on a patient. Because samples are re-used, these methods are data-e cient and particularly suited to situations where large amounts of data are unavailable. We apply each of these learning methods to both numerically generated and real data sets. Results from this research highlight the advantages and disadvantages associated with each learning technique. Real data testing has revealed that these batch reinforcement learning techniques have the ability to suggest treatments that are reasonably consistent with those prescribed by clinicians. The inclusion of additional state variables describing more about an individual's health could further improve this learning process. Ultimately, the use of such reinforcement learning methods could be coupled with a clinician's knowledge for enhanced treatment design
    corecore