2,365 research outputs found
Learning with Opponent-Learning Awareness
Multi-agent settings are quickly gathering importance in machine learning.
This includes a plethora of recent work on deep multi-agent reinforcement
learning, but also can be extended to hierarchical RL, generative adversarial
networks and decentralised optimisation. In all these settings the presence of
multiple learning agents renders the training problem non-stationary and often
leads to unstable training or undesired final results. We present Learning with
Opponent-Learning Awareness (LOLA), a method in which each agent shapes the
anticipated learning of the other agents in the environment. The LOLA learning
rule includes a term that accounts for the impact of one agent's policy on the
anticipated parameter update of the other agents. Results show that the
encounter of two LOLA agents leads to the emergence of tit-for-tat and
therefore cooperation in the iterated prisoners' dilemma, while independent
learning does not. In this domain, LOLA also receives higher payouts compared
to a naive learner, and is robust against exploitation by higher order
gradient-based methods. Applied to repeated matching pennies, LOLA agents
converge to the Nash equilibrium. In a round robin tournament we show that LOLA
agents successfully shape the learning of a range of multi-agent learning
algorithms from literature, resulting in the highest average returns on the
IPD. We also show that the LOLA update rule can be efficiently calculated using
an extension of the policy gradient estimator, making the method suitable for
model-free RL. The method thus scales to large parameter and input spaces and
nonlinear function approximators. We apply LOLA to a grid world task with an
embedded social dilemma using recurrent policies and opponent modelling. By
explicitly considering the learning of the other agent, LOLA agents learn to
cooperate out of self-interest. The code is at github.com/alshedivat/lola
DiCE: The Infinitely Differentiable Monte-Carlo Estimator
The score function estimator is widely used for estimating gradients of
stochastic objectives in stochastic computation graphs (SCG), eg, in
reinforcement learning and meta-learning. While deriving the first-order
gradient estimators by differentiating a surrogate loss (SL) objective is
computationally and conceptually simple, using the same approach for
higher-order derivatives is more challenging. Firstly, analytically deriving
and implementing such estimators is laborious and not compliant with automatic
differentiation. Secondly, repeatedly applying SL to construct new objectives
for each order derivative involves increasingly cumbersome graph manipulations.
Lastly, to match the first-order gradient under differentiation, SL treats part
of the cost as a fixed sample, which we show leads to missing and wrong terms
for estimators of higher-order derivatives. To address all these shortcomings
in a unified way, we introduce DiCE, which provides a single objective that can
be differentiated repeatedly, generating correct estimators of derivatives of
any order in SCGs. Unlike SL, DiCE relies on automatic differentiation for
performing the requisite graph manipulations. We verify the correctness of DiCE
both through a proof and numerical evaluation of the DiCE derivative estimates.
We also use DiCE to propose and evaluate a novel approach for multi-agent
learning. Our code is available at https://www.github.com/alshedivat/lola
Gut microbiome and antibiotic resistance effects during travelers\u27 diarrhea treatment and prevention
The travelers\u27 gut microbiome is potentially assaulted by acute and chronic perturbations (e.g., diarrhea, antibiotic use, and different environments). Prior studies of the impact of travel and travelers\u27 diarrhea (TD) on the microbiome have not directly compared antibiotic regimens, and studies of different antibiotic regimens have not considered travelers\u27 microbiomes. This gap is important to be addressed as the use of antibiotics to treat or prevent TD-even in moderate to severe cases or in regions with high infectious disease burden-is controversial based on the concerns for unintended consequences to the gut microbiome and antimicrobial resistance (AMR) emergence. Our study addresses this by evaluating the impact of defined antibiotic regimens (single-dose treatment or daily prophylaxis) on the gut microbiome and resistomes of deployed servicemembers, using samples collected during clinical trials. Our findings indicate that the antibiotic treatment regimens that were studied generally do not lead to adverse effects on the gut microbiome and resistome and identify the relative risks associated with prophylaxis. These results can be used to inform therapeutic guidelines for the prevention and treatment of TD and make progress toward using microbiome information in personalized medical care
Using informative behavior to increase engagement while learning from human reward
In this work, we address a relatively unexplored aspect of designing agents that learn from human reward. We investigate how an agent’s non-task behavior can affect a human trainer’s training and agent learning. We use the TAMER framework, which facilitates the training of agents by human-generated reward signals, i.e., judgements of the quality of the agent’s actions, as the foundation for our investigation. Then, starting from the premise that the interaction between the agent and the trainer should be bi-directional, we propose two new training interfaces to increase a human trainer’s active involvement in the training process and thereby improve the agent’s task performance. One provides information on the agent’s uncertainty which is a metric calculated as data coverage, the other on its performance. Our results from a 51-subject user study show that these interfaces can induce the trainers to train longer and give more feedback. The agent’s performance, however, increases only in response to the addition of performance-oriented information, not by sharing uncertainty levels. These results suggest that the organizational maxim about human behavior, “you get what you measure”—i.e., sharing metrics with people causes them to focus on optimizing those metrics while de-emphasizing other objectives—also applies to the training of agents. Using principle component analysis, we show how trainers in the two conditions train agents differently. In addition, by simulating the influence of the agent’s uncertainty–informative behavior on a human’s training behavior, we show that trainers could be distracted by the agent sharing its uncertainty levels about its actions, giving poor feedback for the sake of reducing the agent’s uncertainty without improving the agent’s performance
DiCE: The Infinitely Differentiable Monte-Carlo Estimator
The score function estimator is widely used for estimating gradients of stochastic objectives in stochastic computation graphs (SCG), eg, in reinforcement learning and meta-learning. While deriving the first-order gradient estimators by differentiating a surrogate loss (SL) objective is computationally and conceptually simple, using the same approach for higher-order derivatives is more challenging. Firstly, analytically deriving and implementing such estimators is laborious and not compliant with automatic differentiation. Secondly, repeatedly applying SL to construct new objectives for each order derivative involves increasingly cumbersome graph manipulations. Lastly, to match the first-order gradient under differentiation, SL treats part of the cost as a fixed sample, which we show leads to missing and wrong terms for estimators of higher-order derivatives. To address all these shortcomings in a unified way, we introduce DiCE, which provides a single objective that can be differentiated repeatedly, generating correct estimators of derivatives of any order in SCGs. Unlike SL, DiCE relies on automatic differentiation for performing the requisite graph manipulations. We verify the correctness of DiCE both through a proof and numerical evaluation of the DiCE derivative estimates. We also use DiCE to propose and evaluate a novel approach for multi-agent learning. Our code is available at https://www.github.com/alshedivat/lola
Measurement of the Dipion Mass Spectrum in X(3872) -> J/Psi Pi+ Pi- Decays
We measure the dipion mass spectrum in X(3872)--> J/Psi Pi+ Pi- decays using
360 pb-1 of pbar-p collisions at 1.96 TeV collected with the CDF II detector.
The spectrum is fit with predictions for odd C-parity (3S1, 1P1, and 3DJ)
charmonia decaying to J/Psi Pi+ Pi-, as well as even C-parity states in which
the pions are from Rho0 decay. The latter case also encompasses exotic
interpretations, such as a D0-D*0Bar molecule. Only the 3S1 and J/Psi Rho
hypotheses are compatible with our data. Since 3S1 is untenable on other
grounds, decay via J/Psi Rho is favored, which implies C=+1 for the X(3872).
Models for different J/Psi-Rho angular momenta L are considered. Flexibility in
the models, especially the introduction of Rho-Omega interference, enable good
descriptions of our data for both L=0 and 1.Comment: 7 pages, 4 figures -- Submitted to Phys. Rev. Let
Search for Pair Production of Scalar Top Quarks Decaying to a tau Lepton and a b Quark in ppbar Collisions at sqrt{s}=1.96 TeV
We search for pair production of supersymmetric top quarks (~t_1), followed
by R-parity violating decay ~t_1 -> tau b with a branching ratio beta, using
322 pb^-1 of ppbar collisions at sqrt{s}=1.96 TeV collected by the CDF II
detector at Fermilab. Two candidate events pass our final selection criteria,
consistent with the standard model expectation. We set upper limits on the
cross section sigma(~t_1 ~tbar_1)*beta^2 as a function of the stop mass
m(~t_1). Assuming beta=1, we set a 95% confidence level limit m(~t_1)>153
GeV/c^2. The limits are also applicable to the case of a third generation
scalar leptoquark (LQ_3) decaying LQ_3 -> tau b.Comment: 7 pages, 2 eps figure
Search for Higgs Boson Decaying to b-bbar and Produced in Association with W Bosons in p-pbar Collisions at sqrt{s}=1.96 TeV
We present a search for Higgs bosons decaying into b-bbar and produced in
association with W bosons in p-pbar collisions at sqrt{s}=1.96 TeV. This search
uses 320 pb-1 of the dataset accumulated by the upgraded Collider Detector at
Fermilab. Events are selected that have a high-transverse momentum electron or
muon, missing transverse energy, and two jets, one of which is consistent with
a hadronization of a b quark. Both the number of events and the dijet mass
distribution are consistent with standard model background expectations, and we
set 95% confidence level upper limits on the production cross section times
branching ratio for the Higgs boson or any new particle with similar decay
kinematics. These upper limits range from 10 pb for mH=110 GeV/c2 to 3 pb for
mH=150 GeV/c2.Comment: 7 pages, 3 figures; updated title to published versio
Search for Second-Generation Scalar Leptoquarks in Collisions at =1.96 TeV
Results on a search for pair production of second generation scalar
leptoquark in collisions at =1.96 TeV are reported. The
data analyzed were collected by the CDF detector during the 2002-2003 Tevatron
Run II and correspond to an integrated luminosity of 198 pb. Leptoquarks
(LQ) are sought through their decay into (charged) leptons and quarks, with
final state signatures represented by two muons and jets and one muon, large
transverse missing energy and jets. We observe no evidence for production
and derive 95% C.L. upper limits on the production cross sections as well
as lower limits on their mass as a function of , where is the
branching fraction for .Comment: 9 pages (3 author list) 5 figure
- …