6,178 research outputs found

    Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

    Full text link
    We derive sublinear regret bounds for undiscounted reinforcement learning in continuous state space. The proposed algorithm combines state aggregation with the use of upper confidence bounds for implementing optimism in the face of uncertainty. Beside the existence of an optimal policy which satisfies the Poisson equation, the only assumptions made are Holder continuity of rewards and transition probabilities

    The socioeconomic dynamics of the shifta conflict in Kenya, c. 1963-8

    Get PDF
    Using a set of oral testimonies, together with military, intelligence, and administrative reports from the 1960s, this article re-examines the shifta conflict in Kenya. The article moves away from mono-causal, nationalistic interpretations of the event, to focus instead on the underlying socioeconomic dynamics and domestic implications of the conflict. It argues that the nationalist interpretation fails to capture the diversity of participation in shifta, which was not simply made up of militant Somali nationalists, and that it fails to acknowledge the significance of an internal Kenyan conflict between a newly independent state in the process of nation building, and a group of ‘dissident’ frontier communities that were seen to defy the new order. Examination of this conflict provides insights into the operation of the early postcolonial Kenyan stateThe Arts and Humanities Research Council,The Royal Historical Society, Martin Lynn Scholarshi

    Rotting bandits are not harder than stochastic ones

    Get PDF
    In stochastic multi-armed bandits, the reward distribution of each arm is assumed to be stationary. This assumption is often violated in practice (e.g., in recommendation systems), where the reward of an arm may change whenever is selected, i.e., rested bandit setting. In this paper, we consider the non-parametric rotting bandit setting, where rewards can only decrease. We introduce the filtering on expanding window average (FEWA) algorithm that constructs moving averages of increasing windows to identify arms that are more likely to return high rewards when pulled once more. We prove that for an unknown horizon TT, and without any knowledge on the decreasing behavior of the KK arms, FEWA achieves problem-dependent regret bound of O~(log(KT)),\widetilde{\mathcal{O}}(\log{(KT)}), and a problem-independent one of O~(KT)\widetilde{\mathcal{O}}(\sqrt{KT}). Our result substantially improves over the algorithm of Levine et al. (2017), which suffers regret O~(K1/3T2/3)\widetilde{\mathcal{O}}(K^{1/3}T^{2/3}). FEWA also matches known bounds for the stochastic bandit setting, thus showing that the rotting bandits are not harder. Finally, we report simulations confirming the theoretical improvements of FEWA

    Upper-Confidence Bound for Channel Selection in LPWA Networks with Retransmissions

    Full text link
    In this paper, we propose and evaluate different learning strategies based on Multi-Arm Bandit (MAB) algorithms. They allow Internet of Things (IoT) devices to improve their access to the network and their autonomy, while taking into account the impact of encountered radio collisions. For that end, several heuristics employing Upper-Confident Bound (UCB) algorithms are examined, to explore the contextual information provided by the number of retransmissions. Our results show that approaches based on UCB obtain a significant improvement in terms of successful transmission probabilities. Furthermore, it also reveals that a pure UCB channel access is as efficient as more sophisticated learning strategies.Comment: The source code (MATLAB or Octave) used for the simula-tions and the figures is open-sourced under the MIT License, atBitbucket.org/scee\_ietr/ucb\_smart\_retran

    Best-Arm Identification in Linear Bandits

    Get PDF
    We study the best-arm identification problem in linear bandit, where the rewards of the arms depend linearly on an unknown parameter θ\theta^* and the objective is to return the arm with the largest reward. We characterize the complexity of the problem and introduce sample allocation strategies that pull arms to identify the best arm with a fixed confidence, while minimizing the sample budget. In particular, we show the importance of exploiting the global linear structure to improve the estimate of the reward of near-optimal arms. We analyze the proposed strategies and compare their empirical performance. Finally, as a by-product of our analysis, we point out the connection to the GG-optimality criterion used in optimal experimental design.Comment: In Advances in Neural Information Processing Systems 27 (NIPS), 201
    corecore