Search CORE

4,544 research outputs found

Reinforcement Learning: A Survey

Author: Kaelbling L. P.
Littman M. L.
Moore A. W.
Publication venue
Publication date: 01/01/1996
Field of study

This paper surveys the field of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word ``reinforcement.'' The paper discusses central issues of reinforcement learning, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.Comment: See http://www.jair.org/ for any accompanying file

arXiv.org e-Print Archive

CiteSeerX

Certified Reinforcement Learning with Logic Guidance

Author: Abate Alessandro
Hasanbeig Mohammadhosein
Kroening Daniel
Publication venue
Publication date: 10/02/2020
Field of study

This paper proposes the first model-free Reinforcement Learning (RL) framework to synthesise policies for unknown, and continuous-state Markov Decision Processes (MDPs), such that a given linear temporal property is satisfied. We convert the given property into a Limit Deterministic Buchi Automaton (LDBA), namely a finite-state machine expressing the property. Exploiting the structure of the LDBA, we shape a synchronous reward function on-the-fly, so that an RL algorithm can synthesise a policy resulting in traces that probabilistically satisfy the linear temporal property. This probability (certificate) is also calculated in parallel with policy learning when the state space of the MDP is finite: as such, the RL algorithm produces a policy that is certified with respect to the property. Under the assumption of finite state space, theoretical guarantees are provided on the convergence of the RL algorithm to an optimal policy, maximising the above probability. We also show that our method produces ''best available'' control policies when the logical property cannot be satisfied. In the general case of a continuous state space, we propose a neural network architecture for RL and we empirically show that the algorithm finds satisfying policies, if there exist such policies. The performance of the proposed framework is evaluated via a set of numerical examples and benchmarks, where we observe an improvement of one order of magnitude in the number of iterations required for the policy synthesis, compared to existing approaches whenever available.Comment: This article draws from arXiv:1801.08099, arXiv:1809.0782

arXiv.org e-Print Archive

Energy-Efficient Transmission Scheduling with Strict Underflow Constraints

Author: Liu Mingyan
Shuman David I
Wu Owen Q.
Publication venue
Publication date: 15/02/2010
Field of study

We consider a single source transmitting data to one or more receivers/users over a shared wireless channel. Due to random fading, the wireless channel conditions vary with time and from user to user. Each user has a buffer to store received packets before they are drained. At each time step, the source determines how much power to use for transmission to each user. The source's objective is to allocate power in a manner that minimizes an expected cost measure, while satisfying strict buffer underflow constraints and a total power constraint in each slot. The expected cost measure is composed of costs associated with power consumption from transmission and packet holding costs. The primary application motivating this problem is wireless media streaming. For this application, the buffer underflow constraints prevent the user buffers from emptying, so as to maintain playout quality. In the case of a single user with linear power-rate curves, we show that a modified base-stock policy is optimal under the finite horizon, infinite horizon discounted, and infinite horizon average expected cost criteria. For a single user with piecewise-linear convex power-rate curves, we show that a finite generalized base-stock policy is optimal under all three expected cost criteria. We also present the sequences of critical numbers that complete the characterization of the optimal control laws in each of these cases when some additional technical conditions are satisfied. We then analyze the structure of the optimal policy for the case of two users. We conclude with a discussion of methods to identify implementable near-optimal policies for the most general case of M users.Comment: 109 pages, 11 pdf figures, template.tex is main file. We have significantly revised the paper from version 1. Additions include the case of a single receiver with piecewise-linear convex power-rate curves, the case of two receivers, and the infinite horizon average expected cost proble

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

CiteSeerX

Resource management in QoS-aware wireless cellular networks

Author: Zhang Zhi
Publication venue: Colorado State University. Libraries
Publication date: 01/01/2011
Field of study

2011 Summer.Includes bibliographical references.Emerging broadband wireless networks that support high speed packet data with heterogeneous quality of service (QoS) requirements demand more flexible and efficient use of the scarce spectral resource. Opportunistic scheduling exploits the time-varying, location-dependent channel conditions to achieve multiuser diversity. In this work, we study two types of resource allocation problems in QoS-aware wireless cellular networks. First, we develop a rigorous framework to study opportunistic scheduling in multiuser OFDM systems. We derive optimal opportunistic scheduling policies under three common QoS/fairness constraints for multiuser OFDM systems--temporal fairness, utilitarian fairness, and minimum-performance guarantees. To implement these optimal policies efficiently, we provide a modified Hungarian algorithm and a simple suboptimal algorithm. We then propose a generalized opportunistic scheduling framework that incorporates multiple mixed QoS/fairness constraints, including providing both lower and upper bound constraints. Next, taking input queues and channel memory into consideration, we reformulate the transmission scheduling problem as a new class of Markov decision processes (MDPs) with fairness constraints. We investigate the throughput maximization and the delay minimization problems in this context. We study two categories of fairness constraints, namely temporal fairness and utilitarian fairness. We consider two criteria: infinite horizon expected total discounted reward and expected average reward. We derive and prove explicit dynamic programming equations for the above constrained MDPs, and characterize optimal scheduling policies based on those equations. An attractive feature of our proposed schemes is that they can easily be extended to fit different objective functions and other fairness measures. Although we only focus on uplink scheduling, the scheme is equally applicable to the downlink case. Furthermore, we develop an efficient approximation method--temporal fair rollout--to reduce the computational cost

Mountain Scholar (Digital Collections of Colorado and Wyoming)