2 research outputs found
On Regret with Multiple Best Arms
We study regret minimization problem with the existence of multiple
best/near-optimal arms in the multi-armed bandit setting. We consider the case
where the number of arms/actions is comparable or much larger than the time
horizon, and make no assumptions about the structure of the bandit instance.
Our goal is to design algorithms that can automatically adapt to the unknown
hardness of the problem, i.e., the number of best arms. Our setting captures
many modern applications of bandit algorithms where the action space is
enormous and the information about the underlying instance/structure is
unavailable. We first propose an adaptive algorithm that is agnostic to the
hardness level and theoretically derive its regret bound. We then prove a lower
bound for our problem setting, which indicates: (1) no algorithm can be optimal
simultaneously over all hardness levels; and (2) our algorithm achieves an
adaptive rate function that is Pareto optimal. With additional knowledge of the
expected reward of the best arm, we propose another adaptive algorithm that is
minimax optimal, up to polylog factors, over all hardness levels. Experimental
results confirm our theoretical guarantees and show advantages of our
algorithms over the previous state-of-the-art
Pareto Optimal Model Selection in Linear Bandits
We study a model selection problem in the linear bandit setting, where the
learner must adapt to the dimension of the optimal hypothesis class on the fly
and balance exploration and exploitation. More specifically, we assume a
sequence of nested linear hypothesis classes with dimensions , and the goal is to automatically adapt to the smallest hypothesis class
that contains the true linear model. Although previous papers provide various
guarantees for this model selection problem, the analysis therein either works
in favorable cases when one can cheaply conduct statistical testing to locate
the right hypothesis class or is based on the idea of "corralling" multiple
base algorithms which often performs relatively poorly in practice. These works
also mainly focus on upper bounding the regret. In this paper, we first
establish a lower bound showing that, even with a fixed action set, adaptation
to the unknown intrinsic dimension comes at a cost: there is no
algorithm that can achieve the regret bound
simultaneously for all values of . We also bring new ideas, i.e.,
constructing virtual mixture-arms to effectively summarize useful information,
into the model selection problem in linear bandits. Under a mild assumption on
the action set, we design a Pareto optimal algorithm with guarantees matching
the rate in the lower bound. Experimental results confirm our theoretical
results and show advantages of our algorithm compared to prior work