564 research outputs found
Trading off rewards and errors in multi-armed bandits
International audienceIn multi-armed bandits, the most common objective is the maximization of the cumulative reward. Alternative settings include active exploration, where a learner tries to gain accurate estimates of the rewards of all arms. While these objectives are contrasting, in many scenarios it is desirable to trade off rewards and errors. For instance, in educational games the designer wants to gather generalizable knowledge about the behavior of the students and teaching strategies (small estimation errors) but, at the same time, the system needs to avoid giving a bad experience to the players, who may leave the system permanently (large reward). In this paper, we formalize this tradeoff and introduce the ForcingBalance algorithm whose performance is provably close to the best possible tradeoff strategy. Finally, we demonstrate on real-world educational data that ForcingBalance returns useful information about the arms without compromising the overall reward
Hedging using reinforcement learning: Contextual -Armed Bandit versus -learning
The construction of replication strategies for contingent claims in the
presence of risk and market friction is a key problem of financial engineering.
In real markets, continuous replication, such as in the model of Black, Scholes
and Merton, is not only unrealistic but it is also undesirable due to high
transaction costs. Over the last decades stochastic optimal-control methods
have been developed to balance between effective replication and losses. More
recently, with the rise of artificial intelligence, temporal-difference
Reinforcement Learning, in particular variations of -learning in conjunction
with Deep Neural Networks, have attracted significant interest. From a
practical point of view, however, such methods are often relatively sample
inefficient, hard to train and lack performance guarantees. This motivates the
investigation of a stable benchmark algorithm for hedging. In this article, the
hedging problem is viewed as an instance of a risk-averse contextual -armed
bandit problem, for which a large body of theoretical results and well-studied
algorithms are available. We find that the -armed bandit model naturally
fits to the formulation of hedging, providing for a more accurate and
sample efficient approach than -learning and reducing to the Black-Scholes
model in the absence of transaction costs and risks.Comment: 15 pages, 7 figure
Rewards and errors in multi-arm bandits for interactive education
International audienceIn multi-armed bandits, the most common objective is the maximization of the cumulative reward. Alternative settings include active exploration, where a learner tries to gain accurate estimates of the rewards of all arms. While these objectives are contrasting, in many scenarios it is desirable to trade off rewards and errors. For instance, in educational games the designer wants to gather generalizable knowledge about the behavior of the students and teaching strategies (small estimation errors) but, at the same time, the system needs to avoid giving a bad experience to the players, who may leave the system permanently (large reward). In this paper, we formalize this tradeoff and introduce the ForcingBalance algorithm whose performance is provably close to the best possible tradeoff strategy. Finally, we demonstrate on real-world educational data that ForcingBalance returns useful information about the arms without compromising the overall reward
Regret Distribution in Stochastic Bandits: Optimal Trade-off between Expectation and Tail Risk
We study the trade-off between expectation and tail risk for regret
distribution in the stochastic multi-armed bandit problem. We fully
characterize the interplay among three desired properties for policy design:
worst-case optimality, instance-dependent consistency, and light-tailed risk.
We show how the order of expected regret exactly affects the decaying rate of
the regret tail probability for both the worst-case and instance-dependent
scenario. A novel policy is proposed to characterize the optimal regret tail
probability for any regret threshold. Concretely, for any given and , our policy achieves a worst-case expected regret
of (we call it -optimal) and an instance-dependent
expected regret of (we call it -consistent), while
enjoys a probability of incurring an regret
( in the worst-case scenario and in the
instance-dependent scenario) that decays exponentially with a polynomial
term. Such decaying rate is proved to be best achievable. Moreover, we discover
an intrinsic gap of the optimal tail rate under the instance-dependent scenario
between whether the time horizon is known a priori or not. Interestingly,
when it comes to the worst-case scenario, this gap disappears. Finally, we
extend our proposed policy design to (1) a stochastic multi-armed bandit
setting with non-stationary baseline rewards, and (2) a stochastic linear
bandit setting. Our results reveal insights on the trade-off between regret
expectation and regret tail risk for both worst-case and instance-dependent
scenarios, indicating that more sub-optimality and inconsistency leave space
for more light-tailed risk of incurring a large regret, and that knowing the
planning horizon in advance can make a difference on alleviating tail risks
- …