Search CORE

388,163 research outputs found

Factored temporal difference learning in the new ties environment

Author: Bontovics Ákos
Gyenes Viktor
Lőrincz András
Publication venue
Publication date: 01/01/2008
Field of study

Although reinforcement learning is a popular method for training an agent for decision making based on rewards, well studied tabular methods are not applicable for large, realistic problems. In this paper, we experiment with a factored version of temporal difference learning, which boils down to a linear function approximation scheme utilising natural features coming from the structure of the task. We conducted experiments in the New Ties environment, which is a novel platform for multi-agent simulations. We show that learning utilising a factored representation is effective even in large state spaces, furthermore it outperforms tabular methods even in smaller problems both in learning speed and stability, because of its generalisation capabilities

University of Szeged

An Online Approach to Dynamic Channel Access and Transmission Scheduling

Author: Borst S.
Dams J.
Liu Y.
Yang X.
Publication venue
Publication date: 04/04/2015
Field of study

Making judicious channel access and transmission scheduling decisions is essential for improving performance as well as energy and spectral efficiency in multichannel wireless systems. This problem has been a subject of extensive study in the past decade, and the resulting dynamic and opportunistic channel access schemes can bring potentially significant improvement over traditional schemes. However, a common and severe limitation of these dynamic schemes is that they almost always require some form of a priori knowledge of the channel statistics. A natural remedy is a learning framework, which has also been extensively studied in the same context, but a typical learning algorithm in this literature seeks only the best static policy, with performance measured by weak regret, rather than learning a good dynamic channel access policy. There is thus a clear disconnect between what an optimal channel access policy can achieve with known channel statistics that actively exploits temporal, spatial and spectral diversity, and what a typical existing learning algorithm aims for, which is the static use of a single channel devoid of diversity gain. In this paper we bridge this gap by designing learning algorithms that track known optimal or sub-optimal dynamic channel access and transmission scheduling policies, thereby yielding performance measured by a form of strong regret, the accumulated difference between the reward returned by an optimal solution when a priori information is available and that by our online algorithm. We do so in the context of two specific algorithms that appeared in [1] and [2], respectively, the former for a multiuser single-channel setting and the latter for a single-user multichannel setting. In both cases we show that our algorithms achieve sub-linear regret uniform in time and outperforms the standard weak-regret learning algorithms.Comment: 10 pages, to appear in MobiHoc 201

arXiv.org e-Print Archive

Crossref

Recommended from our members

The application of temporal difference learning in optimal diet models

Author: Alonso E.
Broom M.
Teichmann J.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2014
Field of study

An experience-based aversive learning model of foraging behaviour in uncertain environments is presented. We use Q-learning as a model-free implementation of Temporal difference learning motivated by growing evidence for neural correlates in natural reinforcement settings. The predator has the choice of including an aposematic prey in its diet or to forage on alternative food sources. We show how the predator's foraging behaviour and energy intake depend on toxicity of the defended prey and the presence of Batesian mimics. We introduce the precondition of exploration of the action space for successful aversion formation and show how it predicts foraging behaviour in the presence of conflicting rewards which is conditionally suboptimal in a fixed environment but allows better adaptation in changing environments

City Research Online

Recommended from our members

Optimization Foundations of Reinforcement Learning

Author: Bhandari Jalaj
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2020
Field of study

Reinforcement learning (RL) has attracted rapidly increasing interest in the machine learning and artificial intelligence communities in the past decade. With tremendous success already demonstrated for Game AI, RL offers great potential for applications in more complex, real world domains, for example in robotics, autonomous driving and even drug discovery. Although researchers have devoted a lot of engineering effort to deploy RL methods at scale, many state-of-the art RL techniques still seem mysterious - with limited theoretical guarantees on their behaviour in practice. In this thesis, we focus on understanding convergence guarantees for two key ideas in reinforcement learning, namely Temporal difference learning and policy gradient methods, from an optimization perspective. In Chapter 2, we provide a simple and explicit finite time analysis of Temporal difference (TD) learning with linear function approximation. Except for a few key insights, our analysis mirrors standard techniques for analyzing stochastic gradient descent algorithms, and therefore inherits the simplicity and elegance of that literature. Our convergence results extend seamlessly to the study of TD learning with eligibility traces, known as TD(λ), and to Q-learning for a class of high-dimensional optimal stopping problems. In Chapter 3, we turn our attention to policy gradient methods and present a simple and general understanding of their global convergence properties. The main challenge here is that even for simple control problems, policy gradient algorithms face non-convex optimization problems and are widely understood to converge only to a stationary point of the objective. We identify structural properties -- shared by finite MDPs and several classic control problems -- which guarantee that despite non-convexity, any stationary point of the policy gradient objective is globally optimal. In the final chapter, we extend our analysis for finite MDPs to show linear convergence guarantees for many popular variants of policy gradient methods like projected policy gradient, Frank-Wolfe, mirror descent and natural policy gradients

Columbia University Academic Commons

Recommended from our members

ADAPTIVE STEP-SIZES FOR REINFORCEMENT LEARNING

Author: Dabney William C
Publication venue: ScholarWorks@UMass Amherst
Publication date: 12/11/2014
Field of study

The central theme motivating this dissertation is the desire to develop reinforcement learning algorithms that “just work” regardless of the domain in which they are applied. The largest impediment to this goal is the sensitivity of reinforcement learning algorithms to the step-size parameter used to rescale incremental updates. Adaptive step-size algorithms attempt to reduce this sensitivity or eliminate the step-size parameter entirely by automatically adjusting the step size throughout the learning process. Such algorithms provide an alternative to the standard “guess-and-check” methods used to find parameters known as parameter tuning. However, the problems with parameter tuning are currently masked by the way experiments are conducted and presented. In this dissertation we seek algorithms that perform well over a broad subset of reinforcement learning problems with minimal parameter tuning. To accomplish this we begin by addressing the limitations of current empirical methods in reinforcement learning and propose improvements with benefits far outside the area of adaptive step-sizes. In order to study adaptive step-sizes in reinforcement learning we show that the general form of the adaptive step-size problem is a combination of two dissociable problems (adaptive scalar step-size and update whitening). We then derive new parameter-free adaptive scalar step-size algorithms for the reinforcement learning algorithm Sarsa(λ) and use our improved empirical methods to conduct a thorough experimental study of step-size algorithms in reinforcement learning. Our adaptive algorithms (VES and PARL2) both eliminate the need for a tunable step-size parameter and perform at least as well as Sarsa(λ) with an optimized step-size value. We conclude by developing natural temporal difference algorithms that provide an approximate solution to the update whitening problem and improve performance over their non-natural counterparts

ScholarWorks@UMass Amherst