229 research outputs found
DTR Bandit: Learning to Make Response-Adaptive Decisions With Low Regret
Dynamic treatment regimes (DTRs) are personalized, adaptive, multi-stage
treatment plans that adapt treatment decisions both to an individual's initial
features and to intermediate outcomes and features at each subsequent stage,
which are affected by decisions in prior stages. Examples include personalized
first- and second-line treatments of chronic conditions like diabetes, cancer,
and depression, which adapt to patient response to first-line treatment,
disease progression, and individual characteristics. While existing literature
mostly focuses on estimating the optimal DTR from offline data such as from
sequentially randomized trials, we study the problem of developing the optimal
DTR in an online manner, where the interaction with each individual affect both
our cumulative reward and our data collection for future learning. We term this
the DTR bandit problem. We propose a novel algorithm that, by carefully
balancing exploration and exploitation, is guaranteed to achieve rate-optimal
regret when the transition and reward models are linear. We demonstrate our
algorithm and its benefits both in synthetic experiments and in a case study of
adaptive treatment of major depressive disorder using real-world data
Doubly High-Dimensional Contextual Bandits: An Interpretable Model for Joint Assortment-Pricing
Key challenges in running a retail business include how to select products to
present to consumers (the assortment problem), and how to price products (the
pricing problem) to maximize revenue or profit. Instead of considering these
problems in isolation, we propose a joint approach to assortment-pricing based
on contextual bandits. Our model is doubly high-dimensional, in that both
context vectors and actions are allowed to take values in high-dimensional
spaces. In order to circumvent the curse of dimensionality, we propose a simple
yet flexible model that captures the interactions between covariates and
actions via a (near) low-rank representation matrix. The resulting class of
models is reasonably expressive while remaining interpretable through latent
factors, and includes various structured linear bandit and pricing models as
particular cases. We propose a computationally tractable procedure that
combines an exploration/exploitation protocol with an efficient low-rank matrix
estimator, and we prove bounds on its regret. Simulation results show that this
method has lower regret than state-of-the-art methods applied to various
standard bandit and pricing models. Real-world case studies on the
assortment-pricing problem, from an industry-leading instant noodles company to
an emerging beauty start-up, underscore the gains achievable using our method.
In each case, we show at least three-fold gains in revenue or profit by our
bandit method, as well as the interpretability of the latent factor models that
are learned
- …