6 research outputs found
Maximizing Success Rate of Payment Routing using Non-stationary Bandits
This paper discusses the system architecture design and deployment of
non-stationary multi-armed bandit approaches to determine a near-optimal
payment routing policy based on the recent history of transactions. We propose
a Routing Service architecture using a novel Ray-based implementation for
optimally scaling bandit-based payment routing to over 10000 transactions per
second, adhering to the system design requirements and ecosystem constraints
with Payment Card Industry Data Security Standard (PCI DSS). We first evaluate
the effectiveness of multiple bandit-based payment routing algorithms on a
custom simulator to benchmark multiple non-stationary bandit approaches and
identify the best hyperparameters. We then conducted live experiments on the
payment transaction system on a fantasy sports platform Dream11. In the live
experiments, we demonstrated that our non-stationary bandit-based algorithm
consistently improves the success rate of transactions by 0.92\% compared to
the traditional rule-based methods over one month.Comment: 7 Pages, 6 Figure
Efficient Change-Point Detection for Tackling Piecewise-Stationary Bandits
International audienceWe introduce GLR-klUCB, a novel algorithm for the piecewise iid non-stationary bandit problem with bounded rewards. This algorithm combines an efficient bandit algorithm, kl-UCB, with an efficient, parameter-free, changepoint detector, the Bernoulli Generalized Likelihood Ratio Test, for which we provide new theoretical guarantees of independent interest. Unlike previous non-stationary bandit algorithms using a change-point detector, GLR-klUCB does not need to be calibrated based on prior knowledge on the arms' means. We prove that this algorithm can attain a regret in rounds on some ``easy'' instances, where A is the number of arms and the number of change-points, without prior knowledge of . In contrast with recently proposed algorithms that are agnostic to , we perform a numerical study showing that GLR-klUCB is also very efficient in practice, beyond easy instances
Zeroth-order non-convex learning via hierarchical dual averaging
International audienceWe propose a hierarchical version of dual averaging for zeroth-order online non-convex optimization-i.e., learning processes where, at each stage, the optimizer is facing an unknown non-convex loss function and only receives the incurred loss as feedback. The proposed class of policies relies on the construction of an online model that aggregates loss information as it arrives, and it consists of two principal components: (a) a regularizer adapted to the Fisher information metric (as opposed to the metric norm of the ambient space); and (b) a principled exploration of the problem's state space based on an adapted hierarchical schedule. This construction enables sharper control of the model's bias and variance, and allows us to derive tight bounds for both the learner's static and dynamic regret-i.e., the regret incurred against the best dynamic policy in hindsight over the horizon of play