2,113 research outputs found

    Learning with Opponent-Learning Awareness

    Full text link
    Multi-agent settings are quickly gathering importance in machine learning. This includes a plethora of recent work on deep multi-agent reinforcement learning, but also can be extended to hierarchical RL, generative adversarial networks and decentralised optimisation. In all these settings the presence of multiple learning agents renders the training problem non-stationary and often leads to unstable training or undesired final results. We present Learning with Opponent-Learning Awareness (LOLA), a method in which each agent shapes the anticipated learning of the other agents in the environment. The LOLA learning rule includes a term that accounts for the impact of one agent's policy on the anticipated parameter update of the other agents. Results show that the encounter of two LOLA agents leads to the emergence of tit-for-tat and therefore cooperation in the iterated prisoners' dilemma, while independent learning does not. In this domain, LOLA also receives higher payouts compared to a naive learner, and is robust against exploitation by higher order gradient-based methods. Applied to repeated matching pennies, LOLA agents converge to the Nash equilibrium. In a round robin tournament we show that LOLA agents successfully shape the learning of a range of multi-agent learning algorithms from literature, resulting in the highest average returns on the IPD. We also show that the LOLA update rule can be efficiently calculated using an extension of the policy gradient estimator, making the method suitable for model-free RL. The method thus scales to large parameter and input spaces and nonlinear function approximators. We apply LOLA to a grid world task with an embedded social dilemma using recurrent policies and opponent modelling. By explicitly considering the learning of the other agent, LOLA agents learn to cooperate out of self-interest. The code is at github.com/alshedivat/lola

    DiCE: The Infinitely Differentiable Monte-Carlo Estimator

    Get PDF
    The score function estimator is widely used for estimating gradients of stochastic objectives in stochastic computation graphs (SCG), eg, in reinforcement learning and meta-learning. While deriving the first-order gradient estimators by differentiating a surrogate loss (SL) objective is computationally and conceptually simple, using the same approach for higher-order derivatives is more challenging. Firstly, analytically deriving and implementing such estimators is laborious and not compliant with automatic differentiation. Secondly, repeatedly applying SL to construct new objectives for each order derivative involves increasingly cumbersome graph manipulations. Lastly, to match the first-order gradient under differentiation, SL treats part of the cost as a fixed sample, which we show leads to missing and wrong terms for estimators of higher-order derivatives. To address all these shortcomings in a unified way, we introduce DiCE, which provides a single objective that can be differentiated repeatedly, generating correct estimators of derivatives of any order in SCGs. Unlike SL, DiCE relies on automatic differentiation for performing the requisite graph manipulations. We verify the correctness of DiCE both through a proof and numerical evaluation of the DiCE derivative estimates. We also use DiCE to propose and evaluate a novel approach for multi-agent learning. Our code is available at https://www.github.com/alshedivat/lola

    Gut microbiome and antibiotic resistance effects during travelers\u27 diarrhea treatment and prevention

    Get PDF
    The travelers\u27 gut microbiome is potentially assaulted by acute and chronic perturbations (e.g., diarrhea, antibiotic use, and different environments). Prior studies of the impact of travel and travelers\u27 diarrhea (TD) on the microbiome have not directly compared antibiotic regimens, and studies of different antibiotic regimens have not considered travelers\u27 microbiomes. This gap is important to be addressed as the use of antibiotics to treat or prevent TD-even in moderate to severe cases or in regions with high infectious disease burden-is controversial based on the concerns for unintended consequences to the gut microbiome and antimicrobial resistance (AMR) emergence. Our study addresses this by evaluating the impact of defined antibiotic regimens (single-dose treatment or daily prophylaxis) on the gut microbiome and resistomes of deployed servicemembers, using samples collected during clinical trials. Our findings indicate that the antibiotic treatment regimens that were studied generally do not lead to adverse effects on the gut microbiome and resistome and identify the relative risks associated with prophylaxis. These results can be used to inform therapeutic guidelines for the prevention and treatment of TD and make progress toward using microbiome information in personalized medical care

    Using informative behavior to increase engagement while learning from human reward

    Get PDF
    In this work, we address a relatively unexplored aspect of designing agents that learn from human reward. We investigate how an agent’s non-task behavior can affect a human trainer’s training and agent learning. We use the TAMER framework, which facilitates the training of agents by human-generated reward signals, i.e., judgements of the quality of the agent’s actions, as the foundation for our investigation. Then, starting from the premise that the interaction between the agent and the trainer should be bi-directional, we propose two new training interfaces to increase a human trainer’s active involvement in the training process and thereby improve the agent’s task performance. One provides information on the agent’s uncertainty which is a metric calculated as data coverage, the other on its performance. Our results from a 51-subject user study show that these interfaces can induce the trainers to train longer and give more feedback. The agent’s performance, however, increases only in response to the addition of performance-oriented information, not by sharing uncertainty levels. These results suggest that the organizational maxim about human behavior, “you get what you measure”—i.e., sharing metrics with people causes them to focus on optimizing those metrics while de-emphasizing other objectives—also applies to the training of agents. Using principle component analysis, we show how trainers in the two conditions train agents differently. In addition, by simulating the influence of the agent’s uncertainty–informative behavior on a human’s training behavior, we show that trainers could be distracted by the agent sharing its uncertainty levels about its actions, giving poor feedback for the sake of reducing the agent’s uncertainty without improving the agent’s performance

    DiCE: The Infinitely Differentiable Monte-Carlo Estimator

    Get PDF
    The score function estimator is widely used for estimating gradients of stochastic objectives in stochastic computation graphs (SCG), eg, in reinforcement learning and meta-learning. While deriving the first-order gradient estimators by differentiating a surrogate loss (SL) objective is computationally and conceptually simple, using the same approach for higher-order derivatives is more challenging. Firstly, analytically deriving and implementing such estimators is laborious and not compliant with automatic differentiation. Secondly, repeatedly applying SL to construct new objectives for each order derivative involves increasingly cumbersome graph manipulations. Lastly, to match the first-order gradient under differentiation, SL treats part of the cost as a fixed sample, which we show leads to missing and wrong terms for estimators of higher-order derivatives. To address all these shortcomings in a unified way, we introduce DiCE, which provides a single objective that can be differentiated repeatedly, generating correct estimators of derivatives of any order in SCGs. Unlike SL, DiCE relies on automatic differentiation for performing the requisite graph manipulations. We verify the correctness of DiCE both through a proof and numerical evaluation of the DiCE derivative estimates. We also use DiCE to propose and evaluate a novel approach for multi-agent learning. Our code is available at https://www.github.com/alshedivat/lola

    Measurement of the Dipion Mass Spectrum in X(3872) -> J/Psi Pi+ Pi- Decays

    Get PDF
    We measure the dipion mass spectrum in X(3872)--> J/Psi Pi+ Pi- decays using 360 pb-1 of pbar-p collisions at 1.96 TeV collected with the CDF II detector. The spectrum is fit with predictions for odd C-parity (3S1, 1P1, and 3DJ) charmonia decaying to J/Psi Pi+ Pi-, as well as even C-parity states in which the pions are from Rho0 decay. The latter case also encompasses exotic interpretations, such as a D0-D*0Bar molecule. Only the 3S1 and J/Psi Rho hypotheses are compatible with our data. Since 3S1 is untenable on other grounds, decay via J/Psi Rho is favored, which implies C=+1 for the X(3872). Models for different J/Psi-Rho angular momenta L are considered. Flexibility in the models, especially the introduction of Rho-Omega interference, enable good descriptions of our data for both L=0 and 1.Comment: 7 pages, 4 figures -- Submitted to Phys. Rev. Let

    Search for Pair Production of Scalar Top Quarks Decaying to a tau Lepton and a b Quark in ppbar Collisions at sqrt{s}=1.96 TeV

    Get PDF
    We search for pair production of supersymmetric top quarks (~t_1), followed by R-parity violating decay ~t_1 -> tau b with a branching ratio beta, using 322 pb^-1 of ppbar collisions at sqrt{s}=1.96 TeV collected by the CDF II detector at Fermilab. Two candidate events pass our final selection criteria, consistent with the standard model expectation. We set upper limits on the cross section sigma(~t_1 ~tbar_1)*beta^2 as a function of the stop mass m(~t_1). Assuming beta=1, we set a 95% confidence level limit m(~t_1)>153 GeV/c^2. The limits are also applicable to the case of a third generation scalar leptoquark (LQ_3) decaying LQ_3 -> tau b.Comment: 7 pages, 2 eps figure

    Search for Higgs Boson Decaying to b-bbar and Produced in Association with W Bosons in p-pbar Collisions at sqrt{s}=1.96 TeV

    Get PDF
    We present a search for Higgs bosons decaying into b-bbar and produced in association with W bosons in p-pbar collisions at sqrt{s}=1.96 TeV. This search uses 320 pb-1 of the dataset accumulated by the upgraded Collider Detector at Fermilab. Events are selected that have a high-transverse momentum electron or muon, missing transverse energy, and two jets, one of which is consistent with a hadronization of a b quark. Both the number of events and the dijet mass distribution are consistent with standard model background expectations, and we set 95% confidence level upper limits on the production cross section times branching ratio for the Higgs boson or any new particle with similar decay kinematics. These upper limits range from 10 pb for mH=110 GeV/c2 to 3 pb for mH=150 GeV/c2.Comment: 7 pages, 3 figures; updated title to published versio

    Search for Second-Generation Scalar Leptoquarks in ppˉ\bm{p \bar{p}} Collisions at s\sqrt{s}=1.96 TeV

    Get PDF
    Results on a search for pair production of second generation scalar leptoquark in ppˉp \bar{p} collisions at s\sqrt{s}=1.96 TeV are reported. The data analyzed were collected by the CDF detector during the 2002-2003 Tevatron Run II and correspond to an integrated luminosity of 198 pb1^{-1}. Leptoquarks (LQ) are sought through their decay into (charged) leptons and quarks, with final state signatures represented by two muons and jets and one muon, large transverse missing energy and jets. We observe no evidence for LQLQ production and derive 95% C.L. upper limits on the LQLQ production cross sections as well as lower limits on their mass as a function of β\beta, where β\beta is the branching fraction for LQμqLQ \to \mu q.Comment: 9 pages (3 author list) 5 figure
    corecore