Performance, generalizability, and stability are three Reinforcement Learning
(RL) challenges relevant to many practical applications in which they present
themselves in combination. Still, state-of-the-art RL algorithms fall short
when addressing multiple RL objectives simultaneously and current human-driven
design practices might not be well-suited for multi-objective RL. In this paper
we present MetaPG, an evolutionary method that discovers new RL algorithms
represented as graphs, following a multi-objective search criteria in which
different RL objectives are encoded in separate fitness scores. Our findings
show that, when using a graph-based implementation of Soft Actor-Critic (SAC)
to initialize the population, our method is able to find new algorithms that
improve upon SAC's performance and generalizability by 3% and 17%,
respectively, and reduce instability up to 65%. In addition, we analyze the
graph structure of the best algorithms in the population and offer an
interpretation of specific elements that help trading performance for
generalizability and vice versa. We validate our findings in three different
continuous control tasks: RWRL Cartpole, RWRL Walker, and Gym Pendulum.Comment: 23 pages, 12 figures, 10 table