48 research outputs found
Concept Learning with Energy-Based Models
Many hallmarks of human intelligence, such as generalizing from limited
experience, abstract reasoning and planning, analogical reasoning, creative
problem solving, and capacity for language require the ability to consolidate
experience into concepts, which act as basic building blocks of understanding
and reasoning. We present a framework that defines a concept by an energy
function over events in the environment, as well as an attention mask over
entities participating in the event. Given few demonstration events, our method
uses inference-time optimization procedure to generate events involving similar
concepts or identify entities involved in the concept. We evaluate our
framework on learning visual, quantitative, relational, temporal concepts from
demonstration events in an unsupervised manner. Our approach is able to
successfully generate and identify concepts in a few-shot setting and resulting
learned concepts can be reused across environments. Example videos of our
results are available at sites.google.com/site/energyconceptmodel
Emergence of Grounded Compositional Language in Multi-Agent Populations
By capturing statistical patterns in large corpora, machine learning has
enabled significant advances in natural language processing, including in
machine translation, question answering, and sentiment analysis. However, for
agents to intelligently interact with humans, simply capturing the statistical
patterns is insufficient. In this paper we investigate if, and how, grounded
compositional language can emerge as a means to achieve goals in multi-agent
populations. Towards this end, we propose a multi-agent learning environment
and learning methods that bring about emergence of a basic compositional
language. This language is represented as streams of abstract discrete symbols
uttered by agents over time, but nonetheless has a coherent structure that
possesses a defined vocabulary and syntax. We also observe emergence of
non-verbal communication such as pointing and guiding when language
communication is unavailable
Interpretable and Pedagogical Examples
Teachers intentionally pick the most informative examples to show their
students. However, if the teacher and student are neural networks, the examples
that the teacher network learns to give, although effective at teaching the
student, are typically uninterpretable. We show that training the student and
teacher iteratively, rather than jointly, can produce interpretable teaching
strategies. We evaluate interpretability by (1) measuring the similarity of the
teacher's emergent strategies to intuitive strategies in each domain and (2)
conducting human experiments to evaluate how effective the teacher's strategies
are at teaching humans. We show that the teacher network learns to select or
generate interpretable, pedagogical examples to teach rule-based,
probabilistic, boolean, and hierarchical concepts
Prediction and Control with Temporal Segment Models
We introduce a method for learning the dynamics of complex nonlinear systems
based on deep generative models over temporal segments of states and actions.
Unlike dynamics models that operate over individual discrete timesteps, we
learn the distribution over future state trajectories conditioned on past
state, past action, and planned future action trajectories, as well as a latent
prior over action trajectories. Our approach is based on convolutional
autoregressive models and variational autoencoders. It makes stable and
accurate predictions over long horizons for complex, stochastic systems,
effectively expressing uncertainty and modeling the effects of collisions,
sensory noise, and action delays. The learned dynamics model and action prior
can be used for end-to-end, fully differentiable trajectory optimization and
model-based policy optimization, which we use to evaluate the performance and
sample-efficiency of our method.Comment: camera-ready version, ICML 201
Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control
We propose a plan online and learn offline (POLO) framework for the setting
where an agent, with an internal model, needs to continually act and learn in
the world. Our work builds on the synergistic relationship between local
model-based control, global value function learning, and exploration. We study
how local trajectory optimization can cope with approximation errors in the
value function, and can stabilize and accelerate value function learning.
Conversely, we also study how approximate value functions can help reduce the
planning horizon and allow for better policies beyond local solutions. Finally,
we also demonstrate how trajectory optimization can be used to perform
temporally coordinated exploration in conjunction with estimating uncertainty
in value function approximation. This exploration is critical for fast and
stable learning of the value function. Combining these components enable
solutions to complex simulated control tasks, like humanoid locomotion and
dexterous in-hand manipulation, in the equivalent of a few minutes of
experience in the real world.Comment: The first two authors contributed equally. Accepted at ICLR 2019.
Supplementary videos available at: https://sites.google.com/view/polo-mp
Generative Temporal Difference Learning for Infinite-Horizon Prediction
We introduce the -model, a predictive model of environment dynamics
with an infinite probabilistic horizon. Replacing standard single-step models
with -models leads to generalizations of the procedures central to
model-based control, including the model rollout and model-based value
estimation. The -model, trained with a generative reinterpretation of
temporal difference learning, is a natural continuous analogue of the successor
representation and a hybrid between model-free and model-based mechanisms. Like
a value function, it contains information about the long-term future; like a
standard predictive model, it is independent of task reward. We instantiate the
-model as both a generative adversarial network and normalizing flow,
discuss how its training reflects an inescapable tradeoff between training-time
and testing-time compounding errors, and empirically investigate its utility
for prediction and control.Comment: NeurIPS 2020. Project page at: https://gammamodels.github.io
A Game Theoretic Framework for Model Based Reinforcement Learning
Model-based reinforcement learning (MBRL) has recently gained immense
interest due to its potential for sample efficiency and ability to incorporate
off-policy data. However, designing stable and efficient MBRL algorithms using
rich function approximators have remained challenging. To help expose the
practical challenges in MBRL and simplify algorithm design from the lens of
abstraction, we develop a new framework that casts MBRL as a game between: (1)
a policy player, which attempts to maximize rewards under the learned model;
(2) a model player, which attempts to fit the real-world data collected by the
policy player. For algorithm development, we construct a Stackelberg game
between the two players, and show that it can be solved with approximate
bi-level optimization. This gives rise to two natural families of algorithms
for MBRL based on which player is chosen as the leader in the Stackelberg game.
Together, they encapsulate, unify, and generalize many previous MBRL
algorithms. Furthermore, our framework is consistent with and provides a clear
basis for heuristics known to be important in practice from prior works.
Finally, through experiments we validate that our proposed algorithms are
highly sample efficient, match the asymptotic performance of model-free policy
gradient, and scale gracefully to high-dimensional tasks like dexterous hand
manipulation.Comment: Project webpage: https://sites.google.com/view/mbrl-gam
One Policy to Control Them All: Shared Modular Policies for Agent-Agnostic Control
Reinforcement learning is typically concerned with learning control policies
tailored to a particular agent. We investigate whether there exists a single
global policy that can generalize to control a wide variety of agent
morphologies -- ones in which even dimensionality of state and action spaces
changes. We propose to express this global policy as a collection of identical
modular neural networks, dubbed as Shared Modular Policies (SMP), that
correspond to each of the agent's actuators. Every module is only responsible
for controlling its corresponding actuator and receives information from only
its local sensors. In addition, messages are passed between modules,
propagating information between distant modules. We show that a single modular
policy can successfully generate locomotion behaviors for several planar agents
with different skeletal structures such as monopod hoppers, quadrupeds, bipeds,
and generalize to variants not seen during training -- a process that would
normally require training and manual hyperparameter tuning for each morphology.
We observe that a wide variety of drastically diverse locomotion styles across
morphologies as well as centralized coordination emerges via message passing
between decentralized modules purely from the reinforcement learning objective.
Videos and code at https://huangwl18.github.io/modular-rl/Comment: Accepted at ICML 2020. Videos and code at
https://huangwl18.github.io/modular-rl
Model Based Planning with Energy Based Models
Model-based planning holds great promise for improving both sample efficiency
and generalization in reinforcement learning (RL). We show that energy-based
models (EBMs) are a promising class of models to use for model-based planning.
EBMs naturally support inference of intermediate states given start and goal
state distributions. We provide an online algorithm to train EBMs while
interacting with the environment, and show that EBMs allow for significantly
better online learning than corresponding feed-forward networks. We further
show that EBMs support maximum entropy state inference and are able to generate
diverse state space plans. We show that inference purely in state space -
without planning actions - allows for better generalization to previously
unseen obstacles in the environment and prevents the planner from exploiting
the dynamics model by applying uncharacteristic action sequences. Finally, we
show that online EBM training naturally leads to intentionally planned state
exploration which performs significantly better than random exploration.Comment: CoRL 201
Emergent Complexity via Multi-Agent Competition
Reinforcement learning algorithms can train agents that solve problems in
complex, interesting environments. Normally, the complexity of the trained
agent is closely related to the complexity of the environment. This suggests
that a highly capable agent requires a complex environment for training. In
this paper, we point out that a competitive multi-agent environment trained
with self-play can produce behaviors that are far more complex than the
environment itself. We also point out that such environments come with a
natural curriculum, because for any skill level, an environment full of agents
of this level will have the right level of difficulty. This work introduces
several competitive multi-agent environments where agents compete in a 3D world
with simulated physics. The trained agents learn a wide variety of complex and
interesting skills, even though the environment themselves are relatively
simple. The skills include behaviors such as running, blocking, ducking,
tackling, fooling opponents, kicking, and defending using both arms and legs. A
highlight of the learned behaviors can be found here: https://goo.gl/eR7fbXComment: Published as a conference paper at ICLR 201