2,813 research outputs found
Incorporating Behavioral Constraints in Online AI Systems
AI systems that learn through reward feedback about the actions they take are
increasingly deployed in domains that have significant impact on our daily
life. However, in many cases the online rewards should not be the only guiding
criteria, as there are additional constraints and/or priorities imposed by
regulations, values, preferences, or ethical principles. We detail a novel
online agent that learns a set of behavioral constraints by observation and
uses these learned constraints as a guide when making decisions in an online
setting while still being reactive to reward feedback. To define this agent, we
propose to adopt a novel extension to the classical contextual multi-armed
bandit setting and we provide a new algorithm called Behavior Constrained
Thompson Sampling (BCTS) that allows for online learning while obeying
exogenous constraints. Our agent learns a constrained policy that implements
the observed behavioral constraints demonstrated by a teacher agent, and then
uses this constrained policy to guide the reward-based online exploration and
exploitation. We characterize the upper bound on the expected regret of the
contextual bandit algorithm that underlies our agent and provide a case study
with real world data in two application domains. Our experiments show that the
designed agent is able to act within the set of behavior constraints without
significantly degrading its overall reward performance.Comment: 9 pages, 6 figure
The Assistive Multi-Armed Bandit
Learning preferences implicit in the choices humans make is a well studied
problem in both economics and computer science. However, most work makes the
assumption that humans are acting (noisily) optimally with respect to their
preferences. Such approaches can fail when people are themselves learning about
what they want. In this work, we introduce the assistive multi-armed bandit,
where a robot assists a human playing a bandit task to maximize cumulative
reward. In this problem, the human does not know the reward function but can
learn it through the rewards received from arm pulls; the robot only observes
which arms the human pulls but not the reward associated with each pull. We
offer sufficient and necessary conditions for successfully assisting the human
in this framework. Surprisingly, better human performance in isolation does not
necessarily lead to better performance when assisted by the robot: a human
policy can do better by effectively communicating its observed rewards to the
robot. We conduct proof-of-concept experiments that support these results. We
see this work as contributing towards a theory behind algorithms for
human-robot interaction.Comment: Accepted to HRI 201
- …