89 research outputs found

    Bootstrapping Monte Carlo Tree Search with an Imperfect Heuristic

    Full text link
    We consider the problem of using a heuristic policy to improve the value approximation by the Upper Confidence Bound applied in Trees (UCT) algorithm in non-adversarial settings such as planning with large-state space Markov Decision Processes. Current improvements to UCT focus on either changing the action selection formula at the internal nodes or the rollout policy at the leaf nodes of the search tree. In this work, we propose to add an auxiliary arm to each of the internal nodes, and always use the heuristic policy to roll out simulations at the auxiliary arms. The method aims to get fast convergence to optimal values at states where the heuristic policy is optimal, while retaining similar approximation as the original UCT in other states. We show that bootstrapping with the proposed method in the new algorithm, UCT-Aux, performs better compared to the original UCT algorithm and its variants in two benchmark experiment settings. We also examine conditions under which UCT-Aux works well.Comment: 16 pages, accepted for presentation at ECML'1

    Multi-objective reinforcement learning methods for action selection : dealing with multiple objectives and non-stationarity

    Get PDF
    Multi-objective decision-making entails planning based on a model to find the best policy to solve such problems. If this model is unknown, learning through interaction provides the means to behave in the environment. Multi-objective decision-making in a multi-agent system poses many unsolved challenges. Among them, multiple objectives and non-stationarity, caused by simultaneous learners, have been addressed separately so far. In this work, algorithms that address these issues by taking strengths from different methods are proposed and applied to a route choice scenario formulated as a multi-armed bandit problem. Therefore, the focus is on action selection. In the route choice problem, drivers must select a route while aiming to minimize both their travel time and toll. The proposed algorithms take and combine important aspects from works that tackle only one issue: non-stationarity or multiple objectives, making possible to handle these problems together. The methods used from these works are a set of Upper-Confidence Bound (UCB) algorithms and the Pareto Q-learning (PQL) algorithm. The UCB-based algorithms are Pareto UCB1 (PUCB1), the discounted UCB (DUCB) and sliding window UCB (SWUCB). PUCB1 deals with multiple objectives, while DUCB and SWUCB address non-stationarity in different ways. PUCB1 was extended to include characteristics from DUCB and SWUCB. In the case of PQL, as it is a state-based method that focuses on more than one objective, a modification was made to tackle a problem focused on action selection. Results obtained from a comparison in a route choice scenario show that the proposed algorithms deal with non-stationarity and multiple objectives, while using a discount factor is the best approach. Advantages, limitations and differences of these algorithms are discussed

    Action Guidance with MCTS for Deep Reinforcement Learning

    Full text link
    Deep reinforcement learning has achieved great successes in recent years, however, one main challenge is the sample inefficiency. In this paper, we focus on how to use action guidance by means of a non-expert demonstrator to improve sample efficiency in a domain with sparse, delayed, and possibly deceptive rewards: the recently-proposed multi-agent benchmark of Pommerman. We propose a new framework where even a non-expert simulated demonstrator, e.g., planning algorithms such as Monte Carlo tree search with a small number rollouts, can be integrated within asynchronous distributed deep reinforcement learning methods. Compared to a vanilla deep RL algorithm, our proposed methods both learn faster and converge to better policies on a two-player mini version of the Pommerman game.Comment: AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment (AIIDE'19). arXiv admin note: substantial text overlap with arXiv:1904.05759, arXiv:1812.0004

    A Multi-Armed Bandit Model Selection for Cold-Start User Recommendation

    Get PDF
    International audienceHow can we effectively recommend items to a user about whom we have no information? This is the problem we focus on, known as the cold-start problem. In this paper, we focus on the cold user problem.In most existing works, the cold-start problem is handled through the use of many kinds of information available about the user. However, what happens if we do not have any information?Recommender systems usually keep a substantial amount of prediction models that are available for analysis. Moreover, recommendations to new users yield uncertain returns. Assuming a number of alternative prediction models is available to select items to recommend to a cold user, this paper introduces a multi-armed bandit based model selection, named PdMS.In comparison with two baselines, PdMS improves the performance as measured by the nDCG.These improvements are demonstrated on real, public datasets
    corecore