1,818 research outputs found
Active Search with a Cost for Switching Actions
Active Sequential Hypothesis Testing (ASHT) is an extension of the classical
sequential hypothesis testing problem with controls. Chernoff (Ann. Math.
Statist., 1959) proposed a policy called Procedure A and showed its asymptotic
optimality as the cost of sampling was driven to zero. In this paper we study a
further extension where we introduce costs for switching of actions. We show
that a modification of Chernoff's Procedure A, one that we call Sluggish
Procedure A, is asymptotically optimal even with switching costs. The growth
rate of the total cost, as the probability of false detection is driven to
zero, and as a switching parameter of the Sluggish Procedure A is driven down
to zero, is the same as that without switching costs.Comment: 8 pages. Presented at 2015 Information Theory and Applications
Worksho
Learning to detect an oddball target with observations from an exponential family
The problem of detecting an odd arm from a set of K arms of a multi-armed
bandit, with fixed confidence, is studied in a sequential decision-making
scenario. Each arm's signal follows a distribution from a vector exponential
family. All arms have the same parameters except the odd arm. The actual
parameters of the odd and non-odd arms are unknown to the decision maker.
Further, the decision maker incurs a cost for switching from one arm to
another. This is a sequential decision making problem where the decision maker
gets only a limited view of the true state of nature at each stage, but can
control his view by choosing the arm to observe at each stage. Of interest are
policies that satisfy a given constraint on the probability of false detection.
An information-theoretic lower bound on the total cost (expected time for a
reliable decision plus total switching cost) is first identified, and a
variation on a sequential policy based on the generalised likelihood ratio
statistic is then studied. Thanks to the vector exponential family assumption,
the signal processing in this policy at each stage turns out to be very simple,
in that the associated conjugate prior enables easy updates of the posterior
distribution of the model parameters. The policy, with a suitable threshold, is
shown to satisfy the given constraint on the probability of false detection.
Further, the proposed policy is asymptotically optimal in terms of the total
cost among all policies that satisfy the constraint on the probability of false
detection
The power-series algorithm applied to cyclic polling systems
Polling Systems;Queueing Theory;operations research
Batched bandit problems
Motivated by practical applications, chiefly clinical trials, we study the
regret achievable for stochastic bandits under the constraint that the employed
policy must split trials into a small number of batches. We propose a simple
policy, and show that a very small number of batches gives close to minimax
optimal regret bounds. As a byproduct, we derive optimal policies with low
switching cost for stochastic bandits.Comment: Published at http://dx.doi.org/10.1214/15-AOS1381 in the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Regret Minimisation in Multi-Armed Bandits Using Bounded Arm Memory
In this paper, we propose a constant word (RAM model) algorithm for regret
minimisation for both finite and infinite Stochastic Multi-Armed Bandit (MAB)
instances. Most of the existing regret minimisation algorithms need to remember
the statistics of all the arms they encounter. This may become a problem for
the cases where the number of available words of memory is limited. Designing
an efficient regret minimisation algorithm that uses a constant number of words
has long been interesting to the community. Some early attempts consider the
number of arms to be infinite, and require the reward distribution of the arms
to belong to some particular family. Recently, for finitely many-armed bandits
an explore-then-commit based algorithm~\citep{Liau+PSY:2018} seems to escape
such assumption. However, due to the underlying PAC-based elimination their
method incurs a high regret. We present a conceptually simple, and efficient
algorithm that needs to remember statistics of at most arms, and for any
-armed finite bandit instance it enjoys a upper-bound on regret. We extend it to achieve sub-linear
\textit{quantile-regret}~\citep{RoyChaudhuri+K:2018} and empirically verify the
efficiency of our algorithm via experiments
- …