16 research outputs found
Learning to detect an oddball target with observations from an exponential family
The problem of detecting an odd arm from a set of K arms of a multi-armed
bandit, with fixed confidence, is studied in a sequential decision-making
scenario. Each arm's signal follows a distribution from a vector exponential
family. All arms have the same parameters except the odd arm. The actual
parameters of the odd and non-odd arms are unknown to the decision maker.
Further, the decision maker incurs a cost for switching from one arm to
another. This is a sequential decision making problem where the decision maker
gets only a limited view of the true state of nature at each stage, but can
control his view by choosing the arm to observe at each stage. Of interest are
policies that satisfy a given constraint on the probability of false detection.
An information-theoretic lower bound on the total cost (expected time for a
reliable decision plus total switching cost) is first identified, and a
variation on a sequential policy based on the generalised likelihood ratio
statistic is then studied. Thanks to the vector exponential family assumption,
the signal processing in this policy at each stage turns out to be very simple,
in that the associated conjugate prior enables easy updates of the posterior
distribution of the model parameters. The policy, with a suitable threshold, is
shown to satisfy the given constraint on the probability of false detection.
Further, the proposed policy is asymptotically optimal in terms of the total
cost among all policies that satisfy the constraint on the probability of false
detection
Sequential Multi-hypothesis Testing in Multi-armed Bandit Problems:An Approach for Asymptotic Optimality
We consider a multi-hypothesis testing problem involving a K-armed bandit.
Each arm's signal follows a distribution from a vector exponential family. The
actual parameters of the arms are unknown to the decision maker. The decision
maker incurs a delay cost for delay until a decision and a switching cost
whenever he switches from one arm to another. His goal is to minimise the
overall cost until a decision is reached on the true hypothesis. Of interest
are policies that satisfy a given constraint on the probability of false
detection. This is a sequential decision making problem where the decision
maker gets only a limited view of the true state of nature at each stage, but
can control his view by choosing the arm to observe at each stage. An
information-theoretic lower bound on the total cost (expected time for a
reliable decision plus total switching cost) is first identified, and a
variation on a sequential policy based on the generalised likelihood ratio
statistic is then studied. Due to the vector exponential family assumption, the
signal processing at each stage is simple; the associated conjugate prior
distribution on the unknown model parameters enables easy updates of the
posterior distribution. The proposed policy, with a suitable threshold for
stopping, is shown to satisfy the given constraint on the probability of false
detection. Under a continuous selection assumption, the policy is also shown to
be asymptotically optimal in terms of the total cost among all policies that
satisfy the constraint on the probability of false detection
Active Anomaly Detection in Heterogeneous Processes
An active inference problem of detecting anomalies among heterogeneous
processes is considered. At each time, a subset of processes can be probed. The
objective is to design a sequential probing strategy that dynamically
determines which processes to observe at each time and when to terminate the
search so that the expected detection time is minimized under a constraint on
the probability of misclassifying any process. This problem falls into the
general setting of sequential design of experiments pioneered by Chernoff in
1959, in which a randomized strategy, referred to as the Chernoff test, was
proposed and shown to be asymptotically optimal as the error probability
approaches zero. For the problem considered in this paper, a low-complexity
deterministic test is shown to enjoy the same asymptotic optimality while
offering significantly better performance in the finite regime and faster
convergence to the optimal rate function, especially when the number of
processes is large. The computational complexity of the proposed test is also
of a significantly lower order.Comment: This work has been accepted for publication on IEEE Transactions on
Information Theor
Quickest Change Detection with Controlled Sensing
In the problem of quickest change detection, a change occurs at some unknown
time in the distribution of a sequence of random vectors that are monitored in
real time, and the goal is to detect this change as quickly as possible subject
to a certain false alarm constraint. In this work we consider this problem in
the presence of parametric uncertainty in the post-change regime and controlled
sensing. That is, the post-change distribution contains an unknown parameter,
and the distribution of each observation, before and after the change, is
affected by a control action. In this context, in addition to a stopping rule
that determines the time at which it is declared that the change has occurred,
one also needs to determine a sequential control policy, which chooses the
control action at each time based on the already collected observations. We
formulate this problem mathematically using Lorden's minimax criterion, and
assuming that there are finitely many possible actions and post-change
parameter values. We then propose a specific procedure for this problem that
employs an adaptive CuSum statistic in which (i) the estimate of the parameter
is based on a fixed number of the more recent observations, and (ii) each
action is selected to maximize the Kullback-Leibler divergence of the next
observation based on the current parameter estimate, apart from a small number
of exploration times. We show that this procedure, which we call the Windowed
Chernoff-CuSum (WCC), is first-order asymptotically optimal under Lorden's
minimax criterion, for every possible possible value of the unknown post-change
parameter, as the mean time to false alarm goes to infinity. We also provide
simulation results to illustrate the performance of the WCC procedure
Training a Single Bandit Arm
The stochastic multi-armed bandit problem captures the fundamental
exploration vs. exploitation tradeoff inherent in online decision-making in
uncertain settings. However, in several applications, the traditional objective
of maximizing the expected sum of rewards obtained can be inappropriate.
Motivated by the problem of optimizing job assignments to groom novice workers
with unknown trainability in labor platforms, we consider a new objective in
the classical setup. Instead of maximizing the expected total reward from
pulls, we consider the vector of cumulative rewards earned from each of the
arms at the end of pulls, and aim to maximize the expected value of the
highest reward. This corresponds to the objective of grooming a
single, highly skilled worker using a limited supply of training jobs.
For this new objective, we show that any policy must incur a regret of
in the worst case. We design an explore-then-commit
policy featuring exploration based on finely tuned confidence bounds on the
mean reward and an adaptive stopping criterion, which adapts to the problem
difficulty and guarantees a regret of in the
worst case. Our numerical experiments demonstrate that this policy improves
upon several natural candidate policies for this setting.Comment: 23 pages, 1 figure, 1 tabl
Optimal Best Arm Identification with Fixed Confidence
International audienceWe give a complete characterization of the complexity of best-arm identification in one-parameter bandit problems. We prove a new, tight lower bound on the sample complexity. We propose the `Track-and-Stop' strategy, which we prove to be asymptotically optimal. It consists in a new sampling rule (which tracks the optimal proportions of arm draws highlighted by the lower bound) and in a stopping rule named after Chernoff, for which we give a new analysis