83 research outputs found
Bootstrapping Monte Carlo Tree Search with an Imperfect Heuristic
We consider the problem of using a heuristic policy to improve the value
approximation by the Upper Confidence Bound applied in Trees (UCT) algorithm in
non-adversarial settings such as planning with large-state space Markov
Decision Processes. Current improvements to UCT focus on either changing the
action selection formula at the internal nodes or the rollout policy at the
leaf nodes of the search tree. In this work, we propose to add an auxiliary arm
to each of the internal nodes, and always use the heuristic policy to roll out
simulations at the auxiliary arms. The method aims to get fast convergence to
optimal values at states where the heuristic policy is optimal, while retaining
similar approximation as the original UCT in other states. We show that
bootstrapping with the proposed method in the new algorithm, UCT-Aux, performs
better compared to the original UCT algorithm and its variants in two benchmark
experiment settings. We also examine conditions under which UCT-Aux works well.Comment: 16 pages, accepted for presentation at ECML'1
Practical Open-Loop Optimistic Planning
We consider the problem of online planning in a Markov Decision Process when
given only access to a generative model, restricted to open-loop policies -
i.e. sequences of actions - and under budget constraint. In this setting, the
Open-Loop Optimistic Planning (OLOP) algorithm enjoys good theoretical
guarantees but is overly conservative in practice, as we show in numerical
experiments. We propose a modified version of the algorithm with tighter
upper-confidence bounds, KLOLOP, that leads to better practical performances
while retaining the sample complexity bound. Finally, we propose an efficient
implementation that significantly improves the time complexity of both
algorithms
Noisy Optimization: Convergence with a Fixed Number of Resamplings
It is known that evolution strategies in continuous domains might not
converge in the presence of noise. It is also known that, under mild
assumptions, and using an increasing number of resamplings, one can mitigate
the effect of additive noise and recover convergence. We show new sufficient
conditions for the convergence of an evolutionary algorithm with constant
number of resamplings; in particular, we get fast rates (log-linear
convergence) provided that the variance decreases around the optimum slightly
faster than in the so-called multiplicative noise model. Keywords: Noisy
optimization, evolutionary algorithm, theory.Comment: EvoStar (2014
A network-based dynamical ranking system for competitive sports
From the viewpoint of networks, a ranking system for players or teams in
sports is equivalent to a centrality measure for sports networks, whereby a
directed link represents the result of a single game. Previously proposed
network-based ranking systems are derived from static networks, i.e.,
aggregation of the results of games over time. However, the score of a player
(or team) fluctuates over time. Defeating a renowned player in the peak
performance is intuitively more rewarding than defeating the same player in
other periods. To account for this factor, we propose a dynamic variant of such
a network-based ranking system and apply it to professional men's tennis data.
We derive a set of linear online update equations for the score of each player.
The proposed ranking system predicts the outcome of the future games with a
higher accuracy than the static counterparts.Comment: 6 figure
Warm-Start AlphaZero Self-Play Search Enhancements
Recently, AlphaZero has achieved landmark results in deep reinforcement
learning, by providing a single self-play architecture that learned three
different games at super human level. AlphaZero is a large and complicated
system with many parameters, and success requires much compute power and
fine-tuning. Reproducing results in other games is a challenge, and many
researchers are looking for ways to improve results while reducing
computational demands. AlphaZero's design is purely based on self-play and
makes no use of labeled expert data ordomain specific enhancements; it is
designed to learn from scratch. We propose a novel approach to deal with this
cold-start problem by employing simple search enhancements at the beginning
phase of self-play training, namely Rollout, Rapid Action Value Estimate (RAVE)
and dynamically weighted combinations of these with the neural network, and
Rolling Horizon Evolutionary Algorithms (RHEA). Our experiments indicate that
most of these enhancements improve the performance of their baseline player in
three different (small) board games, with especially RAVE based variants
playing strongly
A Model of Oxidative Stress Management: Moderation of Carbohydrate Metabolizing Enzymes in SOD1-Null Drosophila melanogaster
The response to oxidative stress involves numerous genes and mutations in these genes often manifest in pleiotropic ways that presumably reflect perturbations in ROS-mediated physiology. The Drosophila melanogaster SOD1-null allele (cSODn108) is proposed to result in oxidative stress by preventing superoxide breakdown. In SOD1-null flies, oxidative stress management is thought to be reliant on the glutathione-dependent antioxidants that utilize NADPH to cycle between reduced and oxidized form. Previous studies suggest that SOD1-null Drosophila rely on lipid catabolism for energy rather than carbohydrate metabolism. We tested these connections by comparing the activity of carbohydrate metabolizing enzymes, lipid and triglyceride concentration, and steady state NADPH:NADP+ in SOD1-null and control transgenic rescue flies. We find a negative shift in the activity of carbohydrate metabolizing enzymes in SOD1-nulls and the NADP+-reducing enzymes were found to have significantly lower activity than the other enzymes assayed. Little evidence for the catabolism of lipids as preferential energy source was found, as the concentration of lipids and triglycerides were not significantly lower in SOD1-nulls compared with controls. Using a starvation assay to impact lipids and triglycerides, we found that lipids were indeed depleted in both genotypes when under starvation stress, suggesting that oxidative damage was not preventing the catabolism of lipids in SOD1-null flies. Remarkably, SOD1-nulls were also found to be relatively resistant to starvation. Age profiles of enzyme activity, triglyceride and lipid concentration indicates that the trends observed are consistent over the average lifespan of the SOD1-nulls. Based on our results, we propose a model of physiological response in which organisms under oxidative stress limit the production of ROS through the down-regulation of carbohydrate metabolism in order to moderate the products exiting the electron transport chain
A Parallel Monte-Carlo Tree Search Algorithm
Abstract. Monte-Carlo tree search is a powerful paradigm for the game of Go. We present a parallel Master-Slave algorithm for Monte-Carlo tree search. We experimented the algorithm on a network of computers using various configura-tions: from 12,500 to 100,000 playouts, from 1 to 64 slaves, and from 1 to 16 computers. On our architecture we obtain a speedup of 14 for 16 slaves. With a single slave and five seconds per move our algorithm scores 40.5 % against GNUGO, with sixteen slaves and five seconds per move it scores 70.5%. We also give the potential speedups of our algorithm for various playout times.
- …