3 research outputs found
Multi-armed Bandits with Constrained Arms and Hidden States
The problem of rested and restless multi-armed bandits with constrained
availability of arms is considered. The states of arms evolve in Markovian
manner and the exact states are hidden from the decision maker. First, some
structural results on value functions are claimed. Following these results, the
optimal policy turns out to be a \textit{threshold policy}. Further,
\textit{indexability} of rested bandits is established and index formula is
derived. The performance of index policy is illustrated and compared with
myopic policy using numerical examples.Comment: 17 pages, 2 figure
Constrained Restless Bandits for Dynamic Scheduling in Cyber-Physical Systems
This paper studies a class of constrained restless multi-armed bandits
(CRMAB). The constraints are in the form of time varying set of actions (set of
available arms). This variation can be either stochastic or semi-deterministic.
Given a set of arms, a fixed number of them can be chosen to be played in each
decision interval. The play of each arm yields a state dependent reward. The
current states of arms are partially observable through binary feedback signals
from arms that are played. The current availability of arms is fully
observable. The objective is to maximize long term cumulative reward. The
uncertainty about future availability of arms along with partial state
information makes this objective challenging. Applications for CRMAB abound in
the domain of cyber-physical systems. First, this optimization problem is
analyzed using Whittle's index policy. To this end, a constrained restless
single-armed bandit is studied. It is shown to admit a threshold-type optimal
policy and is also indexable. An algorithm to compute Whittle's index is
presented. An alternate solution method with lower complexity is also presented
in the form of an online rollout policy. Further, upper bounds on the value
function are derived in order to estimate the degree of sub-optimality of
various solutions. The simulation study compares the performance of Whittle's
index, online rollout, myopic and modified Whittle's index policies.Comment: 14 pages, 2 figure
Sequential Decision Making with Limited Observation Capability: Application to Wireless Networks
This work studies a generalized class of restless multi-armed bandits with
hidden states and allow cumulative feedback, as opposed to the conventional
instantaneous feedback. We call them lazy restless bandits (LRB) as the events
of decision-making are sparser than events of state transition. Hence, feedback
after each decision event is the cumulative effect of the following state
transition events. The states of arms are hidden from the decision-maker and
rewards for actions are state dependent. The decision-maker needs to choose one
arm in each decision interval, such that long term cumulative reward is
maximized.
As the states are hidden, the decision-maker maintains and updates its belief
about them. It is shown that LRBs admit an optimal policy which has threshold
structure in belief space. The Whittle-index policy for solving LRB problem is
analyzed; indexability of LRBs is shown. Further, closed-form index expressions
are provided for two sets of special cases; for more general cases, an
algorithm for index computation is provided. An extensive simulation study is
presented; Whittle-index, modified Whittle-index and myopic policies are
compared. Lagrangian relaxation of the problem provides an upper bound on the
optimal value function; it is used to assess the degree of sub-optimality
various policies