5 research outputs found

    A Finite Time Analysis of Two Time-Scale Actor Critic Methods

    Full text link
    Actor-critic (AC) methods have exhibited great empirical success compared with other reinforcement learning algorithms, where the actor uses the policy gradient to improve the learning policy and the critic uses temporal difference learning to estimate the policy gradient. Under the two time-scale learning rate schedule, the asymptotic convergence of AC has been well studied in the literature. However, the non-asymptotic convergence and finite sample complexity of actor-critic methods are largely open. In this work, we provide a non-asymptotic analysis for two time-scale actor-critic methods under non-i.i.d. setting. We prove that the actor-critic method is guaranteed to find a first-order stationary point (i.e., J(θ)22ϵ\|\nabla J(\boldsymbol{\theta})\|_2^2 \le \epsilon) of the non-concave performance function J(θ)J(\boldsymbol{\theta}), with O~(ϵ2.5)\mathcal{\tilde{O}}(\epsilon^{-2.5}) sample complexity. To the best of our knowledge, this is the first work providing finite-time analysis and sample complexity bound for two time-scale actor-critic methods.Comment: 45 page

    Reinforcement Learning for Racecar Control

    Get PDF
    This thesis investigates the use of reinforcement learning to learn to drive a racecar in the simulated environment of the Robot Automobile Racing Simulator. Real-life race driving is known to be difficult for humans, and expert human drivers use complex sequences of actions. There are a large number of variables, some of which change stochastically and all of which may affect the outcome. This makes driving a promising domain for testing and developing Machine Learning techniques that have the potential to be robust enough to work in the real world. Therefore the principles of the algorithms from this work may be applicable to a range of problems. The investigation starts by finding a suitable data structure to represent the information learnt. This is tested using supervised learning. Reinforcement learning is added and roughly tuned, and the supervised learning is then removed. A simple tabular representation is found satisfactory, and this avoids difficulties with more complex methods and allows the investigation to concentrate on the essentials of learning. Various reward sources are tested and a combination of three are found to produce the best performance. Exploration of the problem space is investigated. Results show exploration is essential but controlling how much is done is also important. It turns out the learning episodes need to be very long and because of this the task needs to be treated as continuous by using discounting to limit the size of the variables stored. Eligibility traces are used with success to make the learning more efficient. The tabular representation is made more compact by hashing and more accurate by using smaller buckets. This slows the learning but produces better driving. The improvement given by a rough form of generalisation indicates the replacement of the tabular method by a function approximator is warranted. These results show reinforcement learning can work within the Robot Automobile Racing Simulator, and lay the foundations for building a more efficient and competitive agent

    Convergence and divergence in standard and averaging reinforcement learning

    No full text
    Abstract. Although tabular reinforcement learning (RL) methods have been proved to converge to an optimal policy, the combination of particular conventional reinforcement learning techniques with function approximators can lead to divergence. In this paper we show why off-policy RL methods combined with linear function approximators can lead to divergence. Furthermore, we analyze two different types of updates; standard and averaging RL updates. Although averaging RL will not diverge, we show that they can converge to wrong value functions. In our experiments we compare standard to averaging value iteration (VI) with CMACs and the results show that for small values of the discount factor averaging VI works better, whereas for large values of the discount factor standard VI performs better, although it does not always converge.

    Function Approximation For A Production And Storage Problem Under Uncertainty

    No full text
    In this work, we present an approximate value iteration algorithm for a production and storage model with multiple production stages and a single final product, subject to random demand. We use linear function approximation schemes in subsets of the state space and represent a few key states in a look-up table form. We obtain some promising results and perform sensitivity analysis with respect to the parameters of the algorithm for the benchmark problem studied. © 2005 IEEE.665670Davis, M.H.A., (1993) Markov Models and Optimization, , London: Chapman and HallSethi, S.P., Yan, H., Zhang, H., Zhang, Q., Optimal and hierarchical controls in dynamic stochastic manufacturing sytems: A review (2002) Manuf. & Serv, Ops. Management, 4 (2), pp. 133-170Yin, K.K., Liu, H., Yin, G.G., Stochastic models and numerical solutions for production planning with applications to the paper industry (2003) Computers & Chemical Engineering, 27, pp. 1693-1706Si, J., Barto, A., Powell, W., Wunsch, D., (2004) Handbook of Learning and Approximate Dynamic Programming, , Piscataway-NJ: John Wiley & Sons-IEEE PressBertsekas, D.P., Tsitsiklis, J.N., (1996) Neuro-dynamic Programming, , Belmont: Athena ScientificSutton, R.S., Barto, A.G., (1998) Reinforcement Learning: An Introduction, , Cambridge: MIT PressArruda, E.F., Almudevar, A., Do Val, J.B.R., Stability and optimally of a discrete production and storage model with uncertain demand (2004) Proceedings of the 43th IEEE Conference on Decision and Control, pp. 3354-3360. , NassauGordon, G., Stable function approximation in dynamic programming (1995) Proceedings of IMCL'95B. III, L.C., Residual algorithms: Reinforcement learning with function approximation (1995) International Conference on Machine Learning, pp. 30-37. , [Online], Available: citeseer.csail.mit.edu/baird95residual.htmlReynolds, S.I., The stability of general discounted reinforcement learning with linear function approximation (2002) Proceedings of the UK Workshop on Computational Intelligence, pp. 139-146. , Birmingham-UKWeiring, M.A., Convergence and divergence in standard and averaging reinforcement learning (2004) Proc. 15th European Conf. on Machine Learning, pp. 477-488. , Pisa-ItalyGolub, G.H., Van Loan, C.F., (1996) Matrix Computations, 3rd Ed., , Baltimore: Johns Hopkins University Pres
    corecore