11 research outputs found

    DiCE: The Infinitely Differentiable Monte-Carlo Estimator

    Get PDF
    The score function estimator is widely used for estimating gradients of stochastic objectives in stochastic computation graphs (SCG), eg, in reinforcement learning and meta-learning. While deriving the first-order gradient estimators by differentiating a surrogate loss (SL) objective is computationally and conceptually simple, using the same approach for higher-order derivatives is more challenging. Firstly, analytically deriving and implementing such estimators is laborious and not compliant with automatic differentiation. Secondly, repeatedly applying SL to construct new objectives for each order derivative involves increasingly cumbersome graph manipulations. Lastly, to match the first-order gradient under differentiation, SL treats part of the cost as a fixed sample, which we show leads to missing and wrong terms for estimators of higher-order derivatives. To address all these shortcomings in a unified way, we introduce DiCE, which provides a single objective that can be differentiated repeatedly, generating correct estimators of derivatives of any order in SCGs. Unlike SL, DiCE relies on automatic differentiation for performing the requisite graph manipulations. We verify the correctness of DiCE both through a proof and numerical evaluation of the DiCE derivative estimates. We also use DiCE to propose and evaluate a novel approach for multi-agent learning. Our code is available at https://www.github.com/alshedivat/lola

    DiCE: The infinitely differentiable Monte Carlo estimator

    No full text
    The score function estimator is widely used for estimating gradients of stochastic objectives in stochastic computation graphs (SCG), e.g., in reinforcement learning and meta-learning. While deriving the first order gradient estimators by differentiating a surrogate loss (SL) objective is computationally and conceptually simple, using the same approach for higher order derivatives is more challenging. Firstly, analytically deriving and implementing such estimators is laborious and not compliant with automatic differentiation. Secondly, repeatedly applying SL to construct new objectives for each order derivative involves increasingly cumbersome graph manipulations. Lastly, to match the first order gradient under differentiation, SL treats part of the cost as a fixed sample, which we show leads to missing and wrong terms for estimators of higher order derivatives. To address all these shortcomings in a unified way, we introduce DICE, which provides a single objective that can be differentiated repeatedly, generating correct estimators of derivatives of any order in SCGs. Unlike SL, DICE relies on automatic differentiation for performing the requisite graph manipulations. We verify the correctness of DICE both through a proof and numerical evaluation of the DICE derivative estimates. We also use DICE to propose and evaluate a novel approach for multi-agent learning. Our code is available at https://goo.gl/xkkGxN

    DiCE: The infinitely differentiable Monte Carlo estimator

    No full text
    The score function estimator is widely used for estimating gradients of stochastic objectives in stochastic computation graphs (SCG), e.g., in reinforcement learning and meta-learning. While deriving the first order gradient estimators by differentiating a surrogate loss (SL) objective is computationally and conceptually simple, using the same approach for higher order derivatives is more challenging. Firstly, analytically deriving and implementing such estimators is laborious and not compliant with automatic differentiation. Secondly, repeatedly applying SL to construct new objectives for each order derivative involves increasingly cumbersome graph manipulations. Lastly, to match the first order gradient under differentiation, SL treats part of the cost as a fixed sample, which we show leads to missing and wrong terms for estimators of higher order derivatives. To address all these shortcomings in a unified way, we introduce DICE, which provides a single objective that can be differentiated repeatedly, generating correct estimators of derivatives of any order in SCGs. Unlike SL, DICE relies on automatic differentiation for performing the requisite graph manipulations. We verify the correctness of DICE both through a proof and numerical evaluation of the DICE derivative estimates. We also use DICE to propose and evaluate a novel approach for multi-agent learning. Our code is available at https://goo.gl/xkkGxN

    A baseline for any order gradient estimation in stochastic computation graphs

    No full text
    By enabling correct differentiation in Stochastic Computation Graphs (SCGs), the infinitely differentiable Monte-Carlo estimator (DiCE) can generate correct estimates for the higher order gradients that arise in, e.g., multi-agent reinforcement learning and meta-learning. However, the baseline term in DiCE that serves as a control variate for reducing variance applies only to first order gradient estimation, limiting the utility of higher-order gradient estimates. To improve the sample efficiency of DiCE, we propose a new baseline term for higher order gradient estimation. This term may be easily included in the objective, and produces unbiased variance-reduced estimators under (automatic) differentiation, without affecting the estimate of the objective itself or of the first order gradient estimate. It reuses the same baseline function (e.g., the state-value function in reinforcement learning) already used for the first order baseline. We provide theoretical analysis and numerical evaluations of this new baseline, which demonstrate that it can dramatically reduce the variance of DiCE’s second order gradient estimators and also show empirically that it reduces the variance of third and fourth order gradients. This computational tool can be easily used to estimate higher order gradients with unprecedented efficiency and simplicity wherever automatic differentiation is utilised, and it has the potential to unlock applications of higher order gradients in reinforcement learning and meta-learning

    Learning with opponent-learning awareness

    No full text
    Multi-agent settings are quickly gathering importance in machine learning. This includes a plethora of recent work on deep multi-agent reinforcement learning, but also can be extended to hierarchical reinforcement learning, generative adversarial networks and decentralised optimization. In all these settings the presence of multiple learning agents renders the training problem non-stationary and often leads to unstable training or undesired final results. We present Learning with Opponent-Learning Awareness (LOLA), a method in which each agent shapes the anticipated learning of the other agents in the environment. The LOLA learning rule includes an additional term that accounts for the impact of one agent’s policy on the anticipated parameter update of the other agents. Preliminary results show that the encounter of two LOLA agents leads to the emergence of titfor-tat and therefore cooperation in the iterated prisoners’ dilemma (IPD), while independent learning does not. In this domain, LOLA also receives higher payouts compared to a naive learner, and is robust against exploitation by higher order gradient-based methods. Applied to infinitely repeated matching pennies, LOLA agents converge to the Nash equilibrium. In a round robin tournament we show that LOLA agents can successfully shape the learning of a range of multi-agent learning algorithms from literature, resulting in the highest average returns on the IPD. We also show that the LOLA update rule can be efficiently calculated using an extension of the likelihood ratio policy gradient estimator, making the method suitable for model-free reinforcement learning. This method thus scales to large parameter and input spaces and nonlinear function approximators. We also apply LOLA to a grid world task with an embedded social dilemma using deep recurrent policies and opponent modelling. Again, by explicitly considering the learning of the other agent, LOLA agents learn to cooperate out of self-interest

    Fully memristive neural networks for pattern classification with unsupervised learning

    No full text
    This paper was accepted for publication in the journal Nature Electronics and the definitive published version is available at https://doi.org/10.1038/s41928-018-0023-2.Neuromorphic computers comprised of artificial neurons and synapses could provide a more efficient approach to implementing neural network algorithms than traditional hardware. Recently, artificial neurons based on memristors have been developed, but with limited bio-realistic dynamics and no direct interaction with the artificial synapses in an integrated network. Here we show that a diffusive memristor based on silver nanoparticles in a dielectric film can be used to create an artificial neuron with stochastic leaky integrate-and-fire dynamics and tunable integration time, which is determined by silver migration alone or its interaction with circuit capacitance. We integrate these neurons with nonvolatile memristive synapses to build fully memristive artificial neural networks. With these integrated networks, we experimentally demonstrate unsupervised synaptic weight updating and pattern classification