6,350 research outputs found
Incorporating Behavioral Constraints in Online AI Systems
AI systems that learn through reward feedback about the actions they take are
increasingly deployed in domains that have significant impact on our daily
life. However, in many cases the online rewards should not be the only guiding
criteria, as there are additional constraints and/or priorities imposed by
regulations, values, preferences, or ethical principles. We detail a novel
online agent that learns a set of behavioral constraints by observation and
uses these learned constraints as a guide when making decisions in an online
setting while still being reactive to reward feedback. To define this agent, we
propose to adopt a novel extension to the classical contextual multi-armed
bandit setting and we provide a new algorithm called Behavior Constrained
Thompson Sampling (BCTS) that allows for online learning while obeying
exogenous constraints. Our agent learns a constrained policy that implements
the observed behavioral constraints demonstrated by a teacher agent, and then
uses this constrained policy to guide the reward-based online exploration and
exploitation. We characterize the upper bound on the expected regret of the
contextual bandit algorithm that underlies our agent and provide a case study
with real world data in two application domains. Our experiments show that the
designed agent is able to act within the set of behavior constraints without
significantly degrading its overall reward performance.Comment: 9 pages, 6 figure
Maximizing Success Rate of Payment Routing using Non-stationary Bandits
This paper discusses the system architecture design and deployment of
non-stationary multi-armed bandit approaches to determine a near-optimal
payment routing policy based on the recent history of transactions. We propose
a Routing Service architecture using a novel Ray-based implementation for
optimally scaling bandit-based payment routing to over 10000 transactions per
second, adhering to the system design requirements and ecosystem constraints
with Payment Card Industry Data Security Standard (PCI DSS). We first evaluate
the effectiveness of multiple bandit-based payment routing algorithms on a
custom simulator to benchmark multiple non-stationary bandit approaches and
identify the best hyperparameters. We then conducted live experiments on the
payment transaction system on a fantasy sports platform Dream11. In the live
experiments, we demonstrated that our non-stationary bandit-based algorithm
consistently improves the success rate of transactions by 0.92\% compared to
the traditional rule-based methods over one month.Comment: 7 Pages, 6 Figure
- …