8 research outputs found
Efficient Real-world Testing of Causal Decision Making via Bayesian Experimental Design for Contextual Optimisation
The real-world testing of decisions made using causal machine learning models
is an essential prerequisite for their successful application. We focus on
evaluating and improving contextual treatment assignment decisions: these are
personalised treatments applied to e.g. customers, each with their own
contextual information, with the aim of maximising a reward. In this paper we
introduce a model-agnostic framework for gathering data to evaluate and improve
contextual decision making through Bayesian Experimental Design. Specifically,
our method is used for the data-efficient evaluation of the regret of past
treatment assignments. Unlike approaches such as A/B testing, our method avoids
assigning treatments that are known to be highly sub-optimal, whilst engaging
in some exploration to gather pertinent information. We achieve this by
introducing an information-based design objective, which we optimise
end-to-end. Our method applies to discrete and continuous treatments. Comparing
our information-theoretic approach to baselines in several simulation studies
demonstrates the superior performance of our proposed approach.Comment: ICML 2022 Workshop on Adaptive Experimental Design and Active
Learning in the Real World. 16 pages, 5 figure
Robust Bandit Learning with Imperfect Context
A standard assumption in contextual multi-arm bandit is that the true context
is perfectly known before arm selection. Nonetheless, in many practical
applications (e.g., cloud resource management), prior to arm selection, the
context information can only be acquired by prediction subject to errors or
adversarial modification. In this paper, we study a contextual bandit setting
in which only imperfect context is available for arm selection while the true
context is revealed at the end of each round. We propose two robust arm
selection algorithms: MaxMinUCB (Maximize Minimum UCB) which maximizes the
worst-case reward, and MinWD (Minimize Worst-case Degradation) which minimizes
the worst-case regret. Importantly, we analyze the robustness of MaxMinUCB and
MinWD by deriving both regret and reward bounds compared to an oracle that
knows the true context. Our results show that as time goes on, MaxMinUCB and
MinWD both perform as asymptotically well as their optimal counterparts that
know the reward function. Finally, we apply MaxMinUCB and MinWD to online edge
datacenter selection, and run synthetic simulations to validate our theoretical
analysis
Dynamic Batch Learning in High-Dimensional Sparse Linear Contextual Bandits
We study the problem of dynamic batch learning in high-dimensional sparse
linear contextual bandits, where a decision maker, under a given
maximum-number-of-batch constraint and only able to observe rewards at the end
of each batch, can dynamically decide how many individuals to include in the
next batch (at the end of the current batch) and what personalized
action-selection scheme to adopt within each batch. Such batch constraints are
ubiquitous in a variety of practical contexts, including personalized product
offerings in marketing and medical treatment selection in clinical trials. We
characterize the fundamental learning limit in this problem via a regret lower
bound and provide a matching upper bound (up to log factors), thus prescribing
an optimal scheme for this problem. To the best of our knowledge, our work
provides the first inroad into a theoretical understanding of dynamic batch
learning in high-dimensional sparse linear contextual bandits. Notably, even a
special case of our result (when no batch constraint is present) yields the
first minimax optimal regret bound for standard online
learning in high-dimensional linear contextual bandits (for the no-margin
case), where is the sparsity parameter (or an upper bound thereof) and
is the learning horizon. This result (both that
is achievable and that is a lower bound) appears to be
unknown in the emerging literature of high-dimensional contextual bandits.Comment: 33 page
Online Learning of Energy Consumption for Navigation of Electric Vehicles
Energy efficient navigation constitutes an important challenge in electric vehicles, due to their limited battery capacity. We employ a Bayesian approach to model the energy consumption at road segments for efficient navigation. In order to learn the model parameters, we develop an online learning framework and investigate several exploration strategies such as Thompson Sampling and Upper Confidence Bound. We then extend our online learning framework to the multi-agent setting, where multiple vehicles adaptively navigate and learn the parameters of the energy model. We analyze Thompson Sampling and establish rigorous regret bounds on its performance in the single-agent and multi-agent settings, through an analysis of the algorithm under batched feedback. Finally, we demonstrate the performance of our methods via experiments on several real-world city road networks
Homomorphically Encrypted Linear Contextual Bandit
Contextual bandit is a general framework for online learning in sequential
decision-making problems that has found application in a large range of
domains, including recommendation system, online advertising, clinical trials
and many more. A critical aspect of bandit methods is that they require to
observe the contexts -- i.e., individual or group-level data -- and the rewards
in order to solve the sequential problem. The large deployment in industrial
applications has increased interest in methods that preserve the privacy of the
users. In this paper, we introduce a privacy-preserving bandit framework based
on asymmetric encryption. The bandit algorithm only observes encrypted
information (contexts and rewards) and has no ability to decrypt it. Leveraging
homomorphic encryption, we show that despite the complexity of the setting, it
is possible to learn over encrypted data. We introduce an algorithm that
achieves a regret bound in any linear contextual
bandit problem, while keeping data encrypted