53 research outputs found
Learning to Optimize under Non-Stationarity
We introduce algorithms that achieve state-of-the-art \emph{dynamic regret}
bounds for non-stationary linear stochastic bandit setting. It captures natural
applications such as dynamic pricing and ads allocation in a changing
environment. We show how the difficulty posed by the non-stationarity can be
overcome by a novel marriage between stochastic and adversarial bandits
learning algorithms. Defining and as the problem dimension, the
\emph{variation budget}, and the total time horizon, respectively, our main
contributions are the tuned Sliding Window UCB (\texttt{SW-UCB}) algorithm with
optimal dynamic regret, and the
tuning free bandit-over-bandit (\texttt{BOB}) framework built on top of the
\texttt{SW-UCB} algorithm with best
dynamic regret
Efficient and Interpretable Bandit Algorithms
Motivated by the importance of explainability in modern machine learning, we
design bandit algorithms that are \emph{efficient} and \emph{interpretable}. A
bandit algorithm is interpretable if it explores with the objective of reducing
uncertainty in the unknown model parameter. To quantify the interpretability,
we introduce a novel metric of \textit{uncertainty loss}, which compares the
rate of the uncertainty reduction to the theoretical optimum. We propose CODE,
a bandit algorithm based on a \textbf{C}onstrained \textbf{O}ptimal
\textbf{DE}sign, that is interpretable and maximally reduces the uncertainty.
The key idea in \code is to explore among all plausible actions, determined by
a statistical constraint, to achieve interpretability. We implement CODE
efficiently in both multi-armed and linear bandits and derive near-optimal
regret bounds by leveraging the optimality criteria of the approximate optimal
design. CODE can be also viewed as removing phases in conventional phased
elimination, which makes it more practical and general. We demonstrate the
advantage of \code by numerical experiments on both synthetic and real-world
problems. CODE outperforms other state-of-the-art interpretable designs while
matching the performance of popular but uninterpretable designs, such as upper
confidence bound algorithms
Risk-Aware Linear Bandits: Theory and Applications in Smart Order Routing
Motivated by practical considerations in machine learning for financial
decision-making, such as risk-aversion and large action space, we initiate the
study of risk-aware linear bandits. Specifically, we consider regret
minimization under the mean-variance measure when facing a set of actions whose
rewards can be expressed as linear functions of (initially) unknown parameters.
Driven by the variance-minimizing G-optimal design, we propose the Risk-Aware
Explore-then-Commit (RISE) algorithm and the Risk-Aware Successive Elimination
(RISE++) algorithm. Then, we rigorously analyze their regret upper bounds to
show that, by leveraging the linear structure, the algorithms can dramatically
reduce the regret when compared to existing methods. Finally, we demonstrate
the performance of the algorithms by conducting extensive numerical experiments
in a synthetic smart order routing setup. Our results show that both RISE and
RISE++ can outperform the competing methods, especially in complex
decision-making scenarios
Learning to Price Supply Chain Contracts against a Learning Retailer
The rise of big data analytics has automated the decision-making of companies
and increased supply chain agility. In this paper, we study the supply chain
contract design problem faced by a data-driven supplier who needs to respond to
the inventory decisions of the downstream retailer. Both the supplier and the
retailer are uncertain about the market demand and need to learn about it
sequentially. The goal for the supplier is to develop data-driven pricing
policies with sublinear regret bounds under a wide range of possible retailer
inventory policies for a fixed time horizon.
To capture the dynamics induced by the retailer's learning policy, we first
make a connection to non-stationary online learning by following the notion of
variation budget. The variation budget quantifies the impact of the retailer's
learning strategy on the supplier's decision-making. We then propose dynamic
pricing policies for the supplier for both discrete and continuous demand. We
also note that our proposed pricing policy only requires access to the support
of the demand distribution, but critically, does not require the supplier to
have any prior knowledge about the retailer's learning policy or the demand
realizations. We examine several well-known data-driven policies for the
retailer, including sample average approximation, distributionally robust
optimization, and parametric approaches, and show that our pricing policies
lead to sublinear regret bounds in all these cases.
At the managerial level, we answer affirmatively that there is a pricing
policy with a sublinear regret bound under a wide range of retailer's learning
policies, even though she faces a learning retailer and an unknown demand
distribution. Our work also provides a novel perspective in data-driven
operations management where the principal has to learn to react to the learning
policies employed by other agents in the system
Reinforcement Learning for Non-Stationary Markov Decision Processes: The Blessing of (More) Optimism
We consider un-discounted reinforcement learning (RL) in Markov decision
processes (MDPs) under drifting non-stationarity, i.e., both the reward and
state transition distributions are allowed to evolve over time, as long as
their respective total variations, quantified by suitable metrics, do not
exceed certain variation budgets. We first develop the Sliding Window
Upper-Confidence bound for Reinforcement Learning with Confidence Widening
(SWUCRL2-CW) algorithm, and establish its dynamic regret bound when the
variation budgets are known. In addition, we propose the
Bandit-over-Reinforcement Learning (BORL) algorithm to adaptively tune the
SWUCRL2-CW algorithm to achieve the same dynamic regret bound, but in a
parameter-free manner, i.e., without knowing the variation budgets. Notably,
learning non-stationary MDPs via the conventional optimistic exploration
technique presents a unique challenge absent in existing (non-stationary)
bandit learning settings. We overcome the challenge by a novel confidence
widening technique that incorporates additional optimism.Comment: To appear in proceedings of the 37th International Conference on
Machine Learning. Shortened conference version of its journal version
(available at: arXiv:1906.02922
- β¦