60 research outputs found
Uncertainty-of-Information Scheduling: A Restless Multi-armed Bandit Framework
This paper proposes using the uncertainty of information (UoI), measured by
Shannon's entropy, as a metric for information freshness. We consider a system
in which a central monitor observes multiple binary Markov processes through a
communication channel. The UoI of a Markov process corresponds to the monitor's
uncertainty about its state. At each time step, only one Markov process can be
selected to update its state to the monitor; hence there is a tradeoff among
the UoIs of the processes that depend on the scheduling policy used to select
the process to be updated. The age of information (AoI) of a process
corresponds to the time since its last update. In general, the associated UoI
can be a non-increasing function, or even an oscillating function, of its AoI,
making the scheduling problem particularly challenging. This paper investigates
scheduling policies that aim to minimize the average sum-UoI of the processes
over the infinite time horizon. We formulate the problem as a restless
multi-armed bandit (RMAB) problem, and develop a Whittle index policy that is
near-optimal for the RMAB after proving its indexability. We further provide an
iterative algorithm to compute the Whittle index for the practical deployment
of the policy. Although this paper focuses on UoI scheduling, our results apply
to a general class of RMABs for which the UoI scheduling problem is a special
case. Specifically, this paper's Whittle index policy is valid for any RMAB in
which the bandits are binary Markov processes and the penalty is a concave
function of the belief state of the Markov process. Numerical results
demonstrate the excellent performance of the Whittle index policy for this
class of RMABs.Comment: 28 pages, 5 figure
Policy iteration for perfect information stochastic mean payoff games with bounded first return times is strongly polynomial
Recent results of Ye and Hansen, Miltersen and Zwick show that policy
iteration for one or two player (perfect information) zero-sum stochastic
games, restricted to instances with a fixed discount rate, is strongly
polynomial. We show that policy iteration for mean-payoff zero-sum stochastic
games is also strongly polynomial when restricted to instances with bounded
first mean return time to a given state. The proof is based on methods of
nonlinear Perron-Frobenius theory, allowing us to reduce the mean-payoff
problem to a discounted problem with state dependent discount rate. Our
analysis also shows that policy iteration remains strongly polynomial for
discounted problems in which the discount rate can be state dependent (and even
negative) at certain states, provided that the spectral radii of the
nonnegative matrices associated to all strategies are bounded from above by a
fixed constant strictly less than 1.Comment: 17 page
Illustrated review of convergence conditions of the value iteration algorithm and the rolling horizon procedure for average-cost MDPs
International audienceThis paper is concerned with the links between the Value Iteration algorithm and the Rolling Horizon procedure, for solving problems of stochastic optimal control under the long-run average criterion, in Markov Decision Processes with finite state and action spaces. We review conditions of the literature which imply the geometric convergence of Value It- eration to the optimal value. Aperiodicity is an essential prerequisite for convergence. We prove that the convergence of Value Iteration generally implies that of Rolling Horizon. We also present a modified Rolling Horizon procedure that can be applied to models without analyzing periodicity, and discuss the impact of this transformation on convergence. We il- lustrate with numerous examples the different results. Finally, we discuss rules for stopping Value Iteration or finding the length of a Rolling Horizon. We provide an example which demonstrates the difficulty of the question, disproving in particular a conjectured rule pro- posed by Puterman
Fast Reinforcement Learning for Energy-Efficient Wireless Communications
We consider the problem of energy-efficient point-to-point transmission of
delay-sensitive data (e.g. multimedia data) over a fading channel. Existing
research on this topic utilizes either physical-layer centric solutions, namely
power-control and adaptive modulation and coding (AMC), or system-level
solutions based on dynamic power management (DPM); however, there is currently
no rigorous and unified framework for simultaneously utilizing both
physical-layer centric and system-level techniques to achieve the minimum
possible energy consumption, under delay constraints, in the presence of
stochastic and a priori unknown traffic and channel conditions. In this report,
we propose such a framework. We formulate the stochastic optimization problem
as a Markov decision process (MDP) and solve it online using reinforcement
learning. The advantages of the proposed online method are that (i) it does not
require a priori knowledge of the traffic arrival and channel statistics to
determine the jointly optimal power-control, AMC, and DPM policies; (ii) it
exploits partial information about the system so that less information needs to
be learned than when using conventional reinforcement learning algorithms; and
(iii) it obviates the need for action exploration, which severely limits the
adaptation speed and run-time performance of conventional reinforcement
learning algorithms. Our results show that the proposed learning algorithms can
converge up to two orders of magnitude faster than a state-of-the-art learning
algorithm for physical layer power-control and up to three orders of magnitude
faster than conventional reinforcement learning algorithms
Stochastic Shortest Path with Energy Constraints in POMDPs
We consider partially observable Markov decision processes (POMDPs) with a
set of target states and positive integer costs associated with every
transition. The traditional optimization objective (stochastic shortest path)
asks to minimize the expected total cost until the target set is reached. We
extend the traditional framework of POMDPs to model energy consumption, which
represents a hard constraint. The energy levels may increase and decrease with
transitions, and the hard constraint requires that the energy level must remain
positive in all steps till the target is reached. First, we present a novel
algorithm for solving POMDPs with energy levels, developing on existing POMDP
solvers and using RTDP as its main method. Our second contribution is related
to policy representation. For larger POMDP instances the policies computed by
existing solvers are too large to be understandable. We present an automated
procedure based on machine learning techniques that automatically extracts
important decisions of the policy allowing us to compute succinct human
readable policies. Finally, we show experimentally that our algorithm performs
well and computes succinct policies on a number of POMDP instances from the
literature that were naturally enhanced with energy levels.Comment: Technical report accompanying a paper published in proceedings of
AAMAS 201
Discrete-time controlled markov processes with average cost criterion: a survey
This work is a survey of the average cost control problem for discrete-time Markov processes. The authors have attempted to put together a comprehensive account of the considerable research on this problem over the past three decades. The exposition ranges from finite to Borel state and action spaces and includes a variety of methodologies to find and characterize optimal policies. The authors have included a brief historical perspective of the research efforts in this area and have compiled a substantial yet not exhaustive bibliography. The authors have also identified several important questions that are still open to investigation
- …