6,178 research outputs found
Online Regret Bounds for Undiscounted Continuous Reinforcement Learning
We derive sublinear regret bounds for undiscounted reinforcement learning in
continuous state space. The proposed algorithm combines state aggregation with
the use of upper confidence bounds for implementing optimism in the face of
uncertainty. Beside the existence of an optimal policy which satisfies the
Poisson equation, the only assumptions made are Holder continuity of rewards
and transition probabilities
The socioeconomic dynamics of the shifta conflict in Kenya, c. 1963-8
Using a set of oral testimonies, together with military, intelligence, and administrative reports from the 1960s, this article re-examines the shifta conflict in Kenya. The article moves away from mono-causal, nationalistic interpretations of the event, to focus instead on the underlying socioeconomic dynamics and domestic implications of the conflict. It argues that the nationalist interpretation fails to capture the diversity of participation in shifta, which was not simply made up of militant Somali nationalists, and that it fails to acknowledge the significance of an internal Kenyan conflict between a newly independent state in the process of nation building, and a group of ‘dissident’ frontier communities that were seen to defy the new order. Examination of this conflict provides insights into the operation of the early postcolonial Kenyan stateThe Arts and Humanities Research Council,The Royal Historical Society, Martin Lynn Scholarshi
Rotting bandits are not harder than stochastic ones
In stochastic multi-armed bandits, the reward distribution of each arm is
assumed to be stationary. This assumption is often violated in practice (e.g.,
in recommendation systems), where the reward of an arm may change whenever is
selected, i.e., rested bandit setting. In this paper, we consider the
non-parametric rotting bandit setting, where rewards can only decrease. We
introduce the filtering on expanding window average (FEWA) algorithm that
constructs moving averages of increasing windows to identify arms that are more
likely to return high rewards when pulled once more. We prove that for an
unknown horizon , and without any knowledge on the decreasing behavior of
the arms, FEWA achieves problem-dependent regret bound of
and a problem-independent one of
. Our result substantially improves over
the algorithm of Levine et al. (2017), which suffers regret
. FEWA also matches known bounds for
the stochastic bandit setting, thus showing that the rotting bandits are not
harder. Finally, we report simulations confirming the theoretical improvements
of FEWA
Upper-Confidence Bound for Channel Selection in LPWA Networks with Retransmissions
In this paper, we propose and evaluate different learning strategies based on
Multi-Arm Bandit (MAB) algorithms. They allow Internet of Things (IoT) devices
to improve their access to the network and their autonomy, while taking into
account the impact of encountered radio collisions. For that end, several
heuristics employing Upper-Confident Bound (UCB) algorithms are examined, to
explore the contextual information provided by the number of retransmissions.
Our results show that approaches based on UCB obtain a significant improvement
in terms of successful transmission probabilities. Furthermore, it also reveals
that a pure UCB channel access is as efficient as more sophisticated learning
strategies.Comment: The source code (MATLAB or Octave) used for the simula-tions and the
figures is open-sourced under the MIT License,
atBitbucket.org/scee\_ietr/ucb\_smart\_retran
Best-Arm Identification in Linear Bandits
We study the best-arm identification problem in linear bandit, where the
rewards of the arms depend linearly on an unknown parameter and the
objective is to return the arm with the largest reward. We characterize the
complexity of the problem and introduce sample allocation strategies that pull
arms to identify the best arm with a fixed confidence, while minimizing the
sample budget. In particular, we show the importance of exploiting the global
linear structure to improve the estimate of the reward of near-optimal arms. We
analyze the proposed strategies and compare their empirical performance.
Finally, as a by-product of our analysis, we point out the connection to the
-optimality criterion used in optimal experimental design.Comment: In Advances in Neural Information Processing Systems 27 (NIPS), 201
- …