19 research outputs found
Understanding the influence of exploration on the dynamics of policy-gradient algorithms
Policy gradients are effective reinforcement learning algorithms for solving complex control problems. To compute near-optimal policies, it is nevertheless essential in practice to ensure that the variance of the policy remains sufficiently large and that the states are visited sufficiently often during the optimization procedure. Doing so is usually referred to as exploration and is often implemented in practice by adding intrinsic exploration bonuses to the rewards in the learning objective. We propose to analyze the influence of the variance of policies on the return, and the influence of these exploration bonuses on the policy gradient optimization procedure. First, we show an equivalence between optimizing stochastic policies by policy gradient and deterministic policies by continuation (i.e., by smoothing the policy parameters during the optimization). We then argue that the variance of policies acts as a smoothing hyperparameter to avoid local extrema during the optimization. Second, we study the learning objective when intrinsic exploration bonuses are added to the rewards. We show that adding these bonuses makes it possible to smooth the learning objective and to eliminate local optima while preserving the global maximum. Furthermore, computing gradient estimates with these reward bonuses leads to policy gradient algorithms with higher probabilities to eventually provide an optimal policy. In light of these two effects, we discuss and illustrate empirically typical exploration strategies based on entropy bonuses. s are effective reinforcement learning algorithms for solving complex control problems. To compute near-optimal policies, it is nevertheless essential in practice to ensure that the variance of the policy remains sufficiently large and that the states are visited sufficiently often during the optimization procedure. Doing so is usually referred to as exploration and is often implemented in practice by adding intrinsic exploration bonuses to the rewards in the learning objective. We propose to analyze the influence of the variance of policies on the return, and the influence of these exploration bonuses on the policy gradient optimization procedure. First, we show an equivalence between optimizing stochastic policies by policy gradient and deterministic policies by continuation (i.e., by smoothing the policy parameters during the optimization). We then argue that the variance of policies acts as a smoothing hyperparameter to avoid local extrema during the optimization. Second, we study the learning objective when intrinsic exploration bonuses are added to the rewards. We show that adding these bonuses makes it possible to smooth the learning objective and to eliminate local optima while preserving the global maximum. Furthermore, computing gradient estimates with these reward bonuses leads to policy gradient algorithms with higher probabilities to eventually provide an optimal policy. In light of these two effects, we discuss and illustrate empirically typical exploration strategies based on entropy bonuses
Behind the Myth of Exploration in Policy Gradients
Policy-gradient algorithms are effective reinforcement learning methods for
solving control problems with continuous state and action spaces. To compute
near-optimal policies, it is essential in practice to include exploration terms
in the learning objective. Although the effectiveness of these terms is usually
justified by an intrinsic need to explore environments, we propose a novel
analysis and distinguish two different implications of these techniques. First,
they make it possible to smooth the learning objective and to eliminate local
optima while preserving the global maximum. Second, they modify the gradient
estimates, increasing the probability that the stochastic parameter update
eventually provides an optimal policy. In light of these effects, we discuss
and illustrate empirically exploration strategies based on entropy bonuses,
highlighting their limitations and opening avenues for future works in the
design and analysis of such strategies
Optimal Control of Renewable Energy Communities subject to Network Peak Fees with Model Predictive Control and Reinforcement Learning Algorithms
We propose in this paper an optimal control framework for renewable energy
communities (RECs) equipped with controllable assets. Such RECs allow its
members to exchange production surplus through an internal market. The
objective is to control their assets in order to minimise the sum of individual
electricity bills. These bills account for the electricity exchanged through
the REC and with the retailers. Typically, for large companies, another
important part of the bills are the costs related to the power peaks; in our
framework, they are determined from the energy exchanges with the retailers. We
compare rule-based control strategies with the two following control
algorithms. The first one is derived from model predictive control techniques,
and the second one is built with reinforcement learning techniques. We also
compare variants of these algorithms that neglect the peak power costs. Results
confirm that using policies accounting for the power peaks lead to a
significantly lower sum of electricity bills and thus better control strategies
at the cost of higher computation time. Furthermore, policies trained with
reinforcement learning approaches appear promising for real-time control of the
communities, where model predictive control policies may be computationally
expensive in practice. These findings encourage pursuing the efforts toward
development of scalable control algorithms, operating from a centralised
standpoint, for renewable energy communities equipped with controllable assets.Comment: 13 pages (excl. appendices and references), 14 pages of appendix. 10
figures and 10 tables. To be reviewed as a journal pape
Learning optimal environments using projected stochastic gradient ascent
In this work, we propose a new methodology for jointly sizing a dynamical
system and designing its control law. First, the problem is formalized by
considering parametrized reinforcement learning environments and parametrized
policies. The objective of the optimization problem is to jointly find a
control policy and an environment over the joint hypothesis space of parameters
such that the sum of rewards gathered by the policy in this environment is
maximal. The optimization problem is then addressed by generalizing the direct
policy search algorithms to an algorithm we call Direct Environment Search with
(projected stochastic) Gradient Ascent (DESGA). We illustrate the performance
of DESGA on two benchmarks. First, we consider a parametrized space of
Mass-Spring-Damper (MSD) environments and control policies. Then, we use our
algorithm for optimizing the size of the components and the operation of a
small-scale autonomous energy system, i.e. a solar off-grid microgrid, composed
of photovoltaic panels, batteries, etc. On both benchmarks, we compare the
results of the execution of DESGA with a theoretical upper-bound on the
expected return. Furthermore, the performance of DESGA is compared to an
alternative algorithm. The latter performs a grid discretization of the
environment's hypothesis space and applies the REINFORCE algorithm to identify
pairs of environments and policies resulting in a high expected return. The
choice of this algorithm is also discussed and motivated. On both benchmarks,
we show that DESGA and the alternative algorithm result in a set of parameters
for which the expected return is nearly equal to its theoretical upper-bound.
Nevertheless, the execution of DESGA is much less computationally costly
Behind the Myth of Exploration in Policy Gradients
Policy-gradient algorithms are effective reinforcement learning methods for solving control problems with continuous state and action spaces. To compute near-optimal policies, it is essential in practice to include exploration terms in the learning objective. Although the effectiveness of these terms is usually justified by an intrinsic need to explore environments, we propose a novel analysis and distinguish two different implications of these techniques. First, they make it possible to smooth the learning objective and to eliminate local optima while preserving the global maximum. Second, they modify the gradient estimates, increasing the probability that the stochastic parameter update eventually provides an optimal policy. In light of these effects, we discuss and illustrate empirically exploration strategies based on entropy bonuses, highlighting their limitations and opening avenues for future works in the design and analysis of such strategies
Recurrent networks, hidden states and beliefs in partially observable environments
peer reviewedReinforcement learning aims to learn optimal policies from interaction with environments whose dynamics are unknown. Many methods rely on the approximation of a value function to derive near-optimal policies. In partially observable environments, these functions depend on the complete sequence of observations and past actions, called the history. In this work, we show empirically that recurrent neural networks trained to approximate such value functions internally filter the posterior probability distribution of the current state given the history, called the belief. More precisely, we show that, as a recurrent neural network learns the Q-function, its hidden states become more and more correlated with the beliefs of state variables that are relevant to optimal control. This correlation is measured through their mutual information. In addition, we show that the expected return of an agent increases with the ability of its recurrent architecture to reach a high mutual information between its hidden states and the beliefs. Finally, we show that the mutual information between the hidden states and the beliefs of variables that are irrelevant for optimal control decreases through the learning process. In summary, this work shows that in its hidden states, a recurrent neural network approximating the Q-function of a partially observable environment reproduces a sufficient statistic from the history that is correlated to the relevant part of the belief for taking optimal actions
Belief states of POMDPs and internal states of recurrent RL agents: an empirical analysis of their mutual information
peer reviewedReinforcement learning aims to learn optimal policies from interaction with
environments whose dynamics are unknown. Many methods rely on the approximation
of a value function to derive near-optimal policies. In partially observable
environments, these functions depend on the complete sequence of observations
and past actions, called the history. In this work, we show empirically that
recurrent neural networks trained to approximate such value functions
internally filter the posterior probability distribution of the current state
given the history, called the belief. More precisely, we show that, as a
recurrent neural network learns the Q-function, its hidden states become more
and more correlated with the beliefs of state variables that are relevant to
optimal control. This correlation is measured through their mutual information.
In addition, we show that the expected return of an agent increases with the
ability of its recurrent architecture to reach a high mutual information
between its hidden states and the beliefs. Finally, we show that the mutual
information between the hidden states and the beliefs of variables that are
irrelevant for optimal control decreases through the learning process. In
summary, this work shows that in its hidden states, a recurrent neural network
approximating the Q-function of a partially observable environment reproduces a
sufficient statistic from the history that is correlated to the relevant part
of the belief for taking optimal actions
Informed POMDP: Leveraging Additional Information in Model-Based RL
peer reviewedIn this work, we generalize the problem of learning through interaction in a POMDP by accounting for eventual additional information available at training time. First, we introduce the informed POMDP, a new learning paradigm offering a clear distinction between the training information and the execution observation. Next, we propose an objective for learning a sufficient statistic from the history for the optimal control that leverages this information. We then show that this informed objective consists of learning an environment model from which we can sample latent trajectories. Finally, we show for the Dreamer algorithm that the convergence speed of the policies is sometimes greatly improved on several environments by using this informed environment model. Those results and the simplicity of the proposed adaptation advocate for a systematic consideration of eventual additional information when learning in a POMDP using model-based RL
Reinforcement Learning for Joint Design and Control of Battery-PV Systems
The decentralisation and unpredictability of new renewable energy sources
require rethinking our energy system. Data-driven approaches, such as
reinforcement learning (RL), have emerged as new control strategies for
operating these systems, but they have not yet been applied to system design.
This paper aims to bridge this gap by studying the use of an RL-based method
for joint design and control of a real-world PV and battery system. The design
problem is first formulated as a mixed-integer linear programming problem
(MILP). The optimal MILP solution is then used to evaluate the performance of
an RL agent trained in a surrogate environment designed for applying an
existing data-driven algorithm. The main difference between the two models lies
in their optimization approaches: while MILP finds a solution that minimizes
the total costs for a one-year operation given the deterministic historical
data, RL is a stochastic method that searches for an optimal strategy over one
week of data on expectation over all weeks in the historical dataset. Both
methods were applied on a toy example using one-week data and on a case study
using one-year data. In both cases, models were found to converge to similar
control solutions, but their investment decisions differed. Overall, these
outcomes are an initial step illustrating benefits and challenges of using RL
for the joint design and control of energy systems