18 research outputs found
Transfer Learning for Contextual Multi-armed Bandits
Motivated by a range of applications, we study in this paper the problem of
transfer learning for nonparametric contextual multi-armed bandits under the
covariate shift model, where we have data collected on source bandits before
the start of the target bandit learning. The minimax rate of convergence for
the cumulative regret is established and a novel transfer learning algorithm
that attains the minimax regret is proposed. The results quantify the
contribution of the data from the source domains for learning in the target
domain in the context of nonparametric contextual multi-armed bandits.
In view of the general impossibility of adaptation to unknown smoothness, we
develop a data-driven algorithm that achieves near-optimal statistical
guarantees (up to a logarithmic factor) while automatically adapting to the
unknown parameters over a large collection of parameter spaces under an
additional self-similarity assumption. A simulation study is carried out to
illustrate the benefits of utilizing the data from the auxiliary source domains
for learning in the target domain
Optimal treatment allocations in space and time for on-line control of an emerging infectious disease
A key component in controlling the spread of an epidemic is deciding where, whenand to whom to apply an intervention.We develop a framework for using data to informthese decisionsin realtime.We formalize a treatment allocation strategy as a sequence of functions, oneper treatment period, that map up-to-date information on the spread of an infectious diseaseto a subset of locations where treatment should be allocated. An optimal allocation strategyoptimizes some cumulative outcome, e.g. the number of uninfected locations, the geographicfootprint of the disease or the cost of the epidemic. Estimation of an optimal allocation strategyfor an emerging infectious disease is challenging because spatial proximity induces interferencebetween locations, the number of possible allocations is exponential in the number oflocations, and because disease dynamics and intervention effectiveness are unknown at outbreak.We derive a Bayesian on-line estimator of the optimal allocation strategy that combinessimulationâoptimization with Thompson sampling.The estimator proposed performs favourablyin simulation experiments. This work is motivated by and illustrated using data on the spread ofwhite nose syndrome, which is a highly fatal infectious disease devastating bat populations inNorth America
Nearest Neighbour with Bandit Feedback
In this paper we adapt the nearest neighbour rule to the contextual bandit
problem. Our algorithm handles the fully adversarial setting in which no
assumptions at all are made about the data-generation process. When combined
with a sufficiently fast data-structure for (perhaps approximate) adaptive
nearest neighbour search, such as a navigating net, our algorithm is extremely
efficient - having a per trial running time polylogarithmic in both the number
of trials and actions, and taking only quasi-linear space
Federated Linear Contextual Bandits with User-level Differential Privacy
This paper studies federated linear contextual bandits under the notion of
user-level differential privacy (DP). We first introduce a unified federated
bandits framework that can accommodate various definitions of DP in the
sequential decision-making setting. We then formally introduce user-level
central DP (CDP) and local DP (LDP) in the federated bandits framework, and
investigate the fundamental trade-offs between the learning regrets and the
corresponding DP guarantees in a federated linear contextual bandits model. For
CDP, we propose a federated algorithm termed as \robin and show that it is
near-optimal in terms of the number of clients and the privacy budget
by deriving nearly-matching upper and lower regret bounds when
user-level DP is satisfied. For LDP, we obtain several lower bounds, indicating
that learning under user-level -LDP must suffer a regret
blow-up factor at least { or
} under different conditions.Comment: Accepted by ICML 202
Learning how to act: making good decisions with machine learning
This thesis is about machine learning and statistical approaches
to decision making. How can we learn from data to anticipate the
consequence of, and optimally select, interventions or actions?
Problems such as deciding which medication to prescribe to
patients, who should be released on bail, and how much to charge
for insurance are ubiquitous, and have far reaching impacts on
our lives. There are two fundamental approaches to learning how
to act: reinforcement learning, in which an agent directly
intervenes in a system and learns from the outcome, and
observational causal inference, whereby we seek to infer the
outcome of an intervention from observing the system.
The goal of this thesis to connect and unify these key
approaches. I introduce causal bandit problems: a synthesis that
combines causal graphical models, which were developed for
observational causal inference, with multi-armed bandit problems,
which are a subset of reinforcement learning problems that are
simple enough to admit formal analysis. I show that knowledge of
the causal structure allows us to transfer information learned
about the outcome of one action to predict the outcome of an
alternate action, yielding a novel form of structure between
bandit arms that cannot be exploited by existing algorithms. I
propose an algorithm for causal bandit problems and prove bounds
on the simple regret demonstrating it is close to mini-max
optimal and better than algorithms that do not use the additional
causal information
Active and Passive Causal Inference Learning
This paper serves as a starting point for machine learning researchers,
engineers and students who are interested in but not yet familiar with causal
inference. We start by laying out an important set of assumptions that are
collectively needed for causal identification, such as exchangeability,
positivity, consistency and the absence of interference. From these
assumptions, we build out a set of important causal inference techniques, which
we do so by categorizing them into two buckets; active and passive approaches.
We describe and discuss randomized controlled trials and bandit-based
approaches from the active category. We then describe classical approaches,
such as matching and inverse probability weighting, in the passive category,
followed by more recent deep learning based algorithms. By finishing the paper
with some of the missing aspects of causal inference from this paper, such as
collider biases, we expect this paper to provide readers with a diverse set of
starting points for further reading and research in causal inference and
discovery
Tracking Most Significant Shifts in Nonparametric Contextual Bandits
We study nonparametric contextual bandits where Lipschitz mean reward
functions may change over time. We first establish the minimax dynamic regret
rate in this less understood setting in terms of number of changes and
total-variation , both capturing all changes in distribution over context
space, and argue that state-of-the-art procedures are suboptimal in this
setting.
Next, we tend to the question of an adaptivity for this setting, i.e.
achieving the minimax rate without knowledge of or . Quite importantly,
we posit that the bandit problem, viewed locally at a given context ,
should not be affected by reward changes in other parts of context space . We therefore propose a notion of change, which we term experienced
significant shifts, that better accounts for locality, and thus counts
considerably less changes than and . Furthermore, similar to recent work
on non-stationary MAB (Suk & Kpotufe, 2022), experienced significant shifts
only count the most significant changes in mean rewards, e.g., severe best-arm
changes relevant to observed contexts.
Our main result is to show that this more tolerant notion of change can in
fact be adapted to
Non-stationary Contextual Bandits and Universal Learning
We study the fundamental limits of learning in contextual bandits, where a
learner's rewards depend on their actions and a known context, which extends
the canonical multi-armed bandit to the case where side-information is
available. We are interested in universally consistent algorithms, which
achieve sublinear regret compared to any measurable fixed policy, without any
function class restriction. For stationary contextual bandits, when the
underlying reward mechanism is time-invariant, [Blanchard et al.] characterized
learnable context processes for which universal consistency is achievable; and
further gave algorithms ensuring universal consistency whenever this is
achievable, a property known as optimistic universal consistency. It is well
understood, however, that reward mechanisms can evolve over time, possibly
depending on the learner's actions. We show that optimistic universal learning
for non-stationary contextual bandits is impossible in general, contrary to all
previously studied settings in online learning -- including standard supervised
learning. We also give necessary and sufficient conditions for universal
learning under various non-stationarity models, including online and
adversarial reward mechanisms. In particular, the set of learnable processes
for non-stationary rewards is still extremely general -- larger than i.i.d.,
stationary or ergodic -- but in general strictly smaller than that for
supervised learning or stationary contextual bandits, shedding light on new
non-stationary phenomena