We study structured multi-armed bandits, which is the problem of online
decision-making under uncertainty in the presence of structural information. In
this problem, the decision-maker needs to discover the best course of action
despite observing only uncertain rewards over time. The decision-maker is aware
of certain structural information regarding the reward distributions and would
like to minimize their regret by exploiting this information, where the regret
is its performance difference against a benchmark policy that knows the best
action ahead of time. In the absence of structural information, the classical
upper confidence bound (UCB) and Thomson sampling algorithms are well known to
suffer only minimal regret. As recently pointed out, neither algorithms are,
however, capable of exploiting structural information that is commonly
available in practice. We propose a novel learning algorithm that we call DUSA
whose worst-case regret matches the information-theoretic regret lower bound up
to a constant factor and can handle a wide range of structural information. Our
algorithm DUSA solves a dual counterpart of the regret lower bound at the
empirical reward distribution and follows its suggested play. Our proposed
algorithm is the first computationally viable learning policy for structured
bandit problems that has asymptotic minimal regret