6 research outputs found
Recherche directe de politique hors-ligne en apprentissage par renforcement Bayésien
This thesis presents research contributions in the study field of Bayesian Reinforcement Learning — a subfield of Reinforcement Learning where, even though the dynamics of the system are un- known, the existence of some prior knowledge is assumed in the form of a distribution over Markov decision processes.
In this thesis, two algorithms are presented: OPPS (Offline Prior- based Policy Search) and ANN-BRL (Artificial Neural Networks for Bayesian Reinforcement Learning), whose philosophy consists to analyse and exploit the knowledge available beforehand prior to interacting with the system(s), and which differ by the nature of the model they make use of. The former makes use of formula-based agents introduced by Maes et al. in (Maes, Wehenkel, and Ernst, 2012), while the latter relies on Artificial Neural Networks built via SAMME (Stagewise Additive Modelling using a Multi-class Exponential loss function) — an AdaBoost algorithm developed by Zhu et al. in (Zhu et al., 2009).
Moreover, we also describe a comprehensive benchmark which has been created to compare Bayesian Reinforcement Learning algo- rithms. In real life applications, the choice of the best agent to fulfil a given task depends not only on their performances, but also on the computation times required to deploy them. This benchmark has been designed to identify the best algorithms by taking both criteria into account, and resulted in the development of an open-source library: BBRL (Benchmarking tools for Bayesian Reinforcement Learning) (https://github.com/mcastron/BBRL/wiki).Cette dissertation présente diverses contributions scientifiques dans le domaine de l’apprentissage par renforcement Bayésien, dans lequel les dynamiques du système sont inconnues et pour lequelles nous disposons de connaissances a priori, existant sous la forme d’une distribution sur un ensemble de processus décisionnels Markoviens.
Nous présentons tout d’abord deux algorithmes, OPPS (Offline Prior-based Policy Search — recherche directe de politique hors-ligne) et ANN-BRL (Artificial Neural Networks for Bayesian Reinforcement Learning — réseaux de neurones artificiels pour l’apprentissage par renforcement Bayésien), dont la philosophie repose sur l’analyse et l’exploitation de ces connaissances a priori avant de commencer à intéragir avec le(s) système(s). Ces méthodes diffèrent par la nature de leur modèle. La première utilise des agents à base de formule introduits par Maes et al. dans (Maes, Wehenkel, and Ernst, 2012), tandis que la seconde repose sur l’utilisation de réseaux de neurones artificiels construits grâce à SAMME (Stagewise Additive Modeling using a Multi-class Exponential loss function — modélisation additive par cycle basée sur une fonction de perte exponentielle multi-classe), un algorithme d’adaboosting développé par Zhu et al. dans (Zhu et al., 2009),
Nous décrivons également un protocole expérimental que nous avons conçu afin de comparer les algorithmes d’apprentissage par renforcement Bayésien entre eux. Dans le cadre d’applications réelles, le choix du meilleur agent pour traiter une tâche spécifique dépend non seulement des ses performances, mais également des temps de calculs nécessaires pour le déployer. Ce protocole expérimental per- met de déterminer quel est le meilleur algorithme pour résoudre une tâche donnée en tenant compte de ces deux critères. Ce dernier a été mis à la disposition de la communauté scientifique sous la forme d’une bibliothèque logicielle libre : BBRL (Benchmarking tools for Bayesian Reinforcement Learning — outils de comparaison pour l’apprentissage par renforcement Bayésien) (https://github.com/mcastron/BBRL/wiki)
Learning for exploration/exploitation in reinforcement learning
We consider the problem of learning high-performance Exploration/Exploitation (E/E)
strategies for finite Markov Decision Processes (MDPs) when the MDP to be controlled
is supposed to be drawn from a known probability distribution pM(·). The performance
criterion is the sum of discounted rewards collected by the E/E strategy over an infinite
length trajectory. We propose an approach for solving this problem that works by
considering a rich set of candidate E/E strategies and by looking for the one that gives
the best average performances on MDPs drawn according to pM(·). As candidate E/E
strategies, we consider index-based strategies parametrized by small formulas combining
variables that include the estimated reward function, the number of times each transition
has occurred and the optimal value functions ˆ V and ˆQ of the estimated MDP (obtained
through value iteration). The search for the best formula is formalized as a multi-armed
bandit problem, each arm being associated with a formula. We experimentally compare
the performances of the approach with R-max as well as with -Greedy strategies and
the results are promising
Apprentissage par renforcement bayésien versus recherche directe de politique hors-ligne en utilisant une distribution a priori: comparaison empirique
peer reviewedCet article aborde le problème de prise de décision séquentielle dans des processus de déci- sion de Markov (MDPs) finis et inconnus. L’absence de connaissance sur le MDP est modélisée sous la forme d’une distribution de probabilité sur un ensemble de MDPs candidats connue a priori. Le cri- tère de performance utilisé est l’espérance de la somme des récompenses actualisées sur une trajectoire infinie. En parallèle du critère d’optimalité, les contraintes liées au temps de calcul sont formalisées rigoureusement. Tout d’abord, une phase « hors-ligne » précédant l’interaction avec le MDP inconnu offre à l’agent la possibilité d’exploiter la distribution a priori pendant un temps limité. Ensuite, durant la phase d’interaction avec le MDP, à chaque pas de temps, l’agent doit prendre une décision dans un laps de temps contraint déterminé. Dans ce contexte, nous comparons deux stratégies de prise de déci- sion : OPPS, une approche récente exploitant essentiellement la phase hors-ligne pour sélectionner une politique dans un ensemble de politiques candidates et BAMCP, une approche récente de planification en-ligne bayésienne.
Nous comparons empiriquement ces approches dans un contexte bayésien, en ce sens que nous évaluons leurs performances sur un large ensemble de problèmes tirés selon une distribution de test. A notre connaissance, il s’agit des premiers tests expérimentaux de ce type en apprentissage par renforcement. Nous étudions plusieurs cas de figure en considérant diverses distributions pouvant être utilisées aussi bien en tant que distribution a priori qu’en tant que distribution de test. Les résultats obtenus suggèrent qu’exploiter une distribution a priori durant une phase d’optimisation hors-ligne est un avantage non- négligeable pour des distributions a priori précises et/ou contraintes à de petits budgets temps en-ligne
Learning exploration/exploitation strategies for single trajectory reinforcement learning
peer reviewedWe consider the problem of learning high-performance Exploration/Exploitation (E/E) strategies for finite Markov Decision Processes (MDPs) when the MDP to be controlled is supposed to be drawn from a known probability distribution pM( ). The performance criterion is the sum of discounted rewards collected by the E/E strategy over an in finite length trajectory. We propose an approach for solving this problem that works by considering a rich set of candidate E/E strategies and by looking for the one that gives the best average performances on MDPs drawn according to pM( ). As candidate E/E strategies, we consider index-based strategies parametrized by small formulas combining variables that include the estimated reward function, the number of times each transition has occurred and the optimal value functions V and Q of the estimated MDP (obtained through value iteration). The search for the best formula is formalized as a multi-armed bandit problem, each arm being associated with a formula. We experimentally compare the performances of the approach with R-max as well as with e-Greedy strategies and the results are promising
Approximate Bayes Optimal Policy Search using Neural Networks
peer reviewedBayesian Reinforcement Learning (BRL) agents aim to maximise the expected collected rewards obtained when interacting with an unknown Markov Decision Process (MDP) while using some prior knowledge. State-of-the-art BRL agents rely on frequent updates of the belief on the MDP, as new observations of the environment are made. This offers theoretical guarantees to converge to an optimum, but is computationally intractable, even on small-scale problems. In this paper, we present a method that circumvents this issue by training a parametric policy able to recommend an action directly from raw observations. Artificial Neural Networks (ANNs) are used to represent this policy, and are trained on the trajectories sampled from the prior. The trained model is then used online, and is able to act on the real MDP at a very low computational cost. Our new algorithm shows strong empirical performance, on a wide range of test problems, and is robust to inaccuracies of the prior distribution
Optimal Control of Renewable Energy Communities with Controllable Assets
The control of Renewable Energy Communities (REC) with controllable assets (e.g batteries) can be formalised as an optimal control problem. This paper proposes a generic formulation for such a problem whereby the electricity generated by the community members is redistributed using repartition keys. These keys represent the fraction of the surplus of local electricity production (i.e electricity generated within the community but not consumed by any community member) to be allocated to each community member. This formalisation enables us to jointly optimise the controllable assets and the repartition keys, minimising the combined total value of the electricity bills of the members. To perform this optimisation, we propose two algorithms aimed at solving an optimal open-loop control problem in a receding horizon fashion. Moreover, we also propose another approximated algorithm which only optimises the controllable assets (as opposed to optimising both controllable assets and repartition keys). We test these algorithms on renewable energy community control problems constructed from synthetic data, inspired from a real-life case of REC. Our results show that the combined total value of the electricity bills of the members is greatly reduced when simultaneously optimising the controllable assets and the repartition keys (i.e the first two algorithms proposed). These findings strongly advocate the need for algorithms that adopt a more holistic standpoint when it comes to controlling energy systems such as renewable energy communities, co-optimising or jointly optimising them from both a traditional (very granular) control standpoint and a larger economic perspective