4 research outputs found
A Hidden Markov Restless Multi-armed Bandit Model for Playout Recommendation Systems
We consider a restless multi-armed bandit (RMAB) in which there are two types
of arms, say A and B. Each arm can be in one of two states, say or
Playing a type A arm brings it to state with probability one and not
playing it induces state transitions with arm-dependent probabilities. Whereas
playing a type B arm leads it to state with probability and not playing
it gets state that dependent on transition probabilities of arm. Further, play
of an arm generates a unit reward with a probability that depends on the state
of the arm. The belief about the state of the arm can be calculated using a
Bayesian update after every play. This RMAB has been designed for use in
recommendation systems where the user's preferences depend on the history of
recommendations. This RMAB can also be used in applications like creating of
playlists or placement of advertisements. In this paper we formulate the long
term reward maximization problem as infinite horizon discounted reward and
average reward problem. We analyse the RMAB by first studying discounted reward
scenario. We show that it is Whittle-indexable and then obtain a closed form
expression for the Whittle index for each arm calculated from the belief about
its state and the parameters that describe the arm. We next analyse the average
reward problem using vanishing discounted approach and derive the closed form
expression for Whittle index. For a RMAB to be useful in practice, we need to
be able to learn the parameters of the arms. We present an algorithm derived
from Thompson sampling scheme, that learns the parameters of the arms and also
illustrate its performance numerically
Constrained Restless Bandits for Dynamic Scheduling in Cyber-Physical Systems
This paper studies a class of constrained restless multi-armed bandits
(CRMAB). The constraints are in the form of time varying set of actions (set of
available arms). This variation can be either stochastic or semi-deterministic.
Given a set of arms, a fixed number of them can be chosen to be played in each
decision interval. The play of each arm yields a state dependent reward. The
current states of arms are partially observable through binary feedback signals
from arms that are played. The current availability of arms is fully
observable. The objective is to maximize long term cumulative reward. The
uncertainty about future availability of arms along with partial state
information makes this objective challenging. Applications for CRMAB abound in
the domain of cyber-physical systems. First, this optimization problem is
analyzed using Whittle's index policy. To this end, a constrained restless
single-armed bandit is studied. It is shown to admit a threshold-type optimal
policy and is also indexable. An algorithm to compute Whittle's index is
presented. An alternate solution method with lower complexity is also presented
in the form of an online rollout policy. Further, upper bounds on the value
function are derived in order to estimate the degree of sub-optimality of
various solutions. The simulation study compares the performance of Whittle's
index, online rollout, myopic and modified Whittle's index policies.Comment: 14 pages, 2 figure
Sequential Decision Making with Limited Observation Capability: Application to Wireless Networks
This work studies a generalized class of restless multi-armed bandits with
hidden states and allow cumulative feedback, as opposed to the conventional
instantaneous feedback. We call them lazy restless bandits (LRB) as the events
of decision-making are sparser than events of state transition. Hence, feedback
after each decision event is the cumulative effect of the following state
transition events. The states of arms are hidden from the decision-maker and
rewards for actions are state dependent. The decision-maker needs to choose one
arm in each decision interval, such that long term cumulative reward is
maximized.
As the states are hidden, the decision-maker maintains and updates its belief
about them. It is shown that LRBs admit an optimal policy which has threshold
structure in belief space. The Whittle-index policy for solving LRB problem is
analyzed; indexability of LRBs is shown. Further, closed-form index expressions
are provided for two sets of special cases; for more general cases, an
algorithm for index computation is provided. An extensive simulation study is
presented; Whittle-index, modified Whittle-index and myopic policies are
compared. Lagrangian relaxation of the problem provides an upper bound on the
optimal value function; it is used to assess the degree of sub-optimality
various policies
Simulation Based Algorithms for Markov Decision Processes and Multi-Action Restless Bandits
We consider multi-dimensional Markov decision processes and formulate a long
term discounted reward optimization problem. Two simulation based
algorithms---Monte Carlo rollout policy and parallel rollout policy are
studied, and various properties for these policies are discussed. We next
consider a restless multi-armed bandit (RMAB) with multi-dimensional state
space and multi-actions bandit model. A standard RMAB consists of two actions
for each arms whereas in multi-actions RMAB, there are more that two actions
for each arms. A popular approach for RMAB is Whittle index based heuristic
policy. Indexability is an important requirement to use index based policy.
Based on this, an RMAB is classified into indexable or non-indexable bandits.
Our interest is in the study of Monte-Carlo rollout policy for both indexable
and non-indexable restless bandits. We first analyze a standard indexable RMAB
(two-action model) and discuss an index based policy approach. We present
approximate index computation algorithm using Monte-Carlo rollout policy. This
algorithm's convergence is shown using two-timescale stochastic approximation
scheme. Later, we analyze multi-actions indexable RMAB, and discuss the index
based policy approach. We also study non-indexable RMAB for both standard and
multi-actions bandits using Monte-Carlo rollout policy.Comment: 3 Figure