3 research outputs found
On the Whittle Index for Restless Multi-armed Hidden Markov Bandits
We consider a restless multi-armed bandit in which each arm can be in one of
two states. When an arm is sampled, the state of the arm is not available to
the sampler. Instead, a binary signal with a known randomness that depends on
the state of the arm is available. No signal is available if the arm is not
sampled. An arm-dependent reward is accrued from each sampling. In each time
step, each arm changes state according to known transition probabilities which
in turn depend on whether the arm is sampled or not sampled. Since the state of
the arm is never visible and has to be inferred from the current belief and a
possible binary signal, we call this the hidden Markov bandit. Our interest is
in a policy to select the arm(s) in each time step that maximizes the infinite
horizon discounted reward. Specifically, we seek the use of Whittle's index in
selecting the arms. We first analyze the single-armed bandit and show that in
general, it admits an approximate threshold-type optimal policy when there is a
positive reward for the `no-sample' action. We also identify several special
cases for which the threshold policy is indeed the optimal policy. Next, we
show that such a single-armed bandit also satisfies an approximate-indexability
property. For the case when the single-armed bandit admits a threshold-type
optimal policy, we perform the calculation of the Whittle index for each arm.
Numerical examples illustrate the analytical results.Comment: Revised version, corrected few typo
Learning in Restless Multi-Armed Bandits via Adaptive Arm Sequencing Rules
We consider a class of restless multi-armed bandit (RMAB) problems with
unknown arm dynamics. At each time, a player chooses an arm out of N arms to
play, referred to as an active arm, and receives a random reward from a finite
set of reward states. The reward state of the active arm transits according to
an unknown Markovian dynamics. The reward state of passive arms (which are not
chosen to play at time t) evolves according to an arbitrary unknown random
process. The objective is an arm-selection policy that minimizes the regret,
defined as the reward loss with respect to a player that always plays the most
rewarding arm. This class of RMAB problems has been studied recently in the
context of communication networks and financial investment applications. We
develop a strategy that selects arms to be played in a consecutive manner,
dubbed Adaptive Sequencing Rules (ASR) algorithm. The sequencing rules for
selecting arms under the ASR algorithm are adaptively updated and controlled by
the current sample reward means. By designing judiciously the adaptive
sequencing rules, we show that the ASR algorithm achieves a logarithmic regret
order with time, and a finite-sample bound on the regret is established.
Although existing methods have shown a logarithmic regret order with time in
this RMAB setting, the theoretical analysis shows a significant improvement in
the regret scaling with respect to the system parameters under ASR. Extensive
simulation results support the theoretical study and demonstrate strong
performance of the algorithm as compared to existing methods.Comment: A short version of this paper was presented at IEEE International
Symposium on Information Theory (ISIT) 201
Sequential Decision Making with Limited Observation Capability: Application to Wireless Networks
This work studies a generalized class of restless multi-armed bandits with
hidden states and allow cumulative feedback, as opposed to the conventional
instantaneous feedback. We call them lazy restless bandits (LRB) as the events
of decision-making are sparser than events of state transition. Hence, feedback
after each decision event is the cumulative effect of the following state
transition events. The states of arms are hidden from the decision-maker and
rewards for actions are state dependent. The decision-maker needs to choose one
arm in each decision interval, such that long term cumulative reward is
maximized.
As the states are hidden, the decision-maker maintains and updates its belief
about them. It is shown that LRBs admit an optimal policy which has threshold
structure in belief space. The Whittle-index policy for solving LRB problem is
analyzed; indexability of LRBs is shown. Further, closed-form index expressions
are provided for two sets of special cases; for more general cases, an
algorithm for index computation is provided. An extensive simulation study is
presented; Whittle-index, modified Whittle-index and myopic policies are
compared. Lagrangian relaxation of the problem provides an upper bound on the
optimal value function; it is used to assess the degree of sub-optimality
various policies