Reinforcement learning provides a compellingly universal approach for learning to achieve an objective, specified by a reward function, by trial-and-error interaction with an environment. While this approach has the versatility to be applied to almost any objective in any environment, it can be inhibitively inefficient. Humans are able to learn to achieve new objectives and improve their capabilities via reinforcement learning over human timescales by building on prior knowledge and capabilities. However, a large proportion of the reinforcement literature considers the traditional problem of learning to perform a task tabula-rasa. In this thesis, we aim to improve the efficiency of reinforcement learning, including both the sample efficiency and computational efficiency, by incorporating environment understanding and knowledge of prior behaviours via more information-dense supervised learning objectives.
In the first half of the thesis, we aim to acquire knowledge about the environment that can be leveraged for reinforcement learning. We begin by considering how to optimally combine an agent’s partial observations into a unified representation of an environment. We introduce a novel approach that can more effectively integrate partial information into a single representation than other self-supervised approaches. We next develop this general idea of learning environment representations into a diffusion-based approach for learning a full generative model of an environment. An agent can then perform model-based reinforcement learning by interacting with its environment model rather than the true environment, thereby reducing the environment dependency, and improving the sample efficiency of reinforcement learning. We demonstrate that our diffusion-based approach can more effectively capture visual details compared to related world modelling approaches, leading to greater performance and sample efficiency. However, this doesn’t reduce the computational cost of reinforcement learning, and in fact increases it, due to the additional cost of environment modelling.
In the second half of the thesis, we therefore aim to reduce the computational cost of tabula-rasa reinforcement learning by incorporating imitation learning on prior behaviours to provide an initial behaviour that can be efficiently improved with model-free reinforcement learning. We begin by considering the offline-only case from proprioceptive states with a clear objective, and demonstrate our proposed value-based approach leads to improved performance and computational efficiency over imitation and reinforcement learning approaches in isolation. We then extend this general idea to the more general offline-to-online case from visual observations without a well-defined reward function. The training procedure proposed is analogous to that used for modern large language models, providing many exciting directions for future research. We conclude by considering the future directions of generative world models and generalist agents
Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.