4 research outputs found
Stochastic Bandits with Context Distributions
We introduce a stochastic contextual bandit model where at each time step the
environment chooses a distribution over a context set and samples the context
from this distribution. The learner observes only the context distribution
while the exact context realization remains hidden. This allows for a broad
range of applications where the context is stochastic or when the learner needs
to predict the context. We leverage the UCB algorithm to this setting and show
that it achieves an order-optimal high-probability bound on the cumulative
regret for linear and kernelized reward functions. Our results strictly
generalize previous work in the sense that both our model and the algorithm
reduce to the standard setting when the environment chooses only Dirac delta
distributions and therefore provides the exact context to the learner. We
further analyze a variant where the learner observes the realized context after
choosing the action. Finally, we demonstrate the proposed method on synthetic
and real-world datasets.Comment: Accepted at NeurIPS 201
Corrupted Contextual Bandits with Action Order Constraints
We consider a variant of the novel contextual bandit problem with corrupted
context, which we call the contextual bandit problem with corrupted context and
action correlation, where actions exhibit a relationship structure that can be
exploited to guide the exploration of viable next decisions. Our setting is
primarily motivated by adaptive mobile health interventions and related
applications, where users might transitions through different stages requiring
more targeted action selection approaches. In such settings, keeping user
engagement is paramount for the success of interventions and therefore it is
vital to provide relevant recommendations in a timely manner. The context
provided by users might not always be informative at every decision point and
standard contextual approaches to action selection will incur high regret. We
propose a meta-algorithm using a referee that dynamically combines the policies
of a contextual bandit and multi-armed bandit, similar to previous work, as
wells as a simple correlation mechanism that captures action to action
transition probabilities allowing for more efficient exploration of
time-correlated actions. We evaluate empirically the performance of said
algorithm on a simulation where the sequence of best actions is determined by a
hidden state that evolves in a Markovian manner. We show that the proposed
meta-algorithm improves upon regret in situations where the performance of both
policies varies such that one is strictly superior to the other for a given
time period. To demonstrate that our setting has relevant practical
applicability, we evaluate our method on several real world data sets, clearly
showing better empirical performance compared to a set of simple algorithms