Search CORE

30 research outputs found

Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

Author: AG Barto
C Watkins
D Silver
LJ Lin
R Bellman
RJ Williams
VR Konda
WR Thompson
Publication venue
Publication date: 12/06/2019
Field of study

Value-based reinforcement-learning algorithms provide state-of-the-art results in model-free discrete-action settings, and tend to outperform actor-critic algorithms. We argue that actor-critic algorithms are limited by their need for an on-policy critic. We propose Bootstrapped Dual Policy Iteration (BDPI), a novel model-free reinforcement-learning algorithm for continuous states and discrete actions, with an actor and several off-policy critics. Off-policy critics are compatible with experience replay, ensuring high sample-efficiency, without the need for off-policy corrections. The actor, by slowly imitating the average greedy policy of the critics, leads to high-quality and state-specific exploration, which we compare to Thompson sampling. Because the actor and critics are fully decoupled, BDPI is remarkably stable, and unusually robust to its hyper-parameters. BDPI is significantly more sample-efficient than Bootstrapped DQN, PPO, and ACKTR, on discrete, continuous and pixel-based tasks. Source code: https://github.com/vub-ai-lab/bdpi.Comment: Accepted at the European Conference on Machine Learning 2019 (ECML

arXiv.org e-Print Archive

VU Research Portal

Crossref

When Does Reward Maximization Lead to Matching Law?

Author: A Soltani
B Alsop
DP Bertsekas
DR Shanks
GM Heyman
GS Corrado
J Mazur
JC Houk
JE Mazur
LP Sugrue
LT DeCarlo
M Davison
M Davison
M Davison
P Dayan
P Marbach
RJ Herrnstein
RJ Herrnstein
RJ Herrnstein
RJ Herrnstein
RS Sutton
SC Tanaka
Tim Bussey
Tomoki Fukai
VR Konda
W Schultz
WJ Vaughan
WM Baum
WM Baum
Y Loewenstein
Y Sakai
Y Sakai
Yutaka Sakai
Publication venue: Public Library of Science
Publication date: 01/01/2008
Field of study

What kind of strategies subjects follow in various behavioral circumstances has been a central issue in decision making. In particular, which behavioral strategy, maximizing or matching, is more fundamental to animal's decision behavior has been a matter of debate. Here, we prove that any algorithm to achieve the stationary condition for maximizing the average reward should lead to matching when it ignores the dependence of the expected outcome on subject's past choices. We may term this strategy of partial reward maximization “matching strategy”. Then, this strategy is applied to the case where the subject's decision system updates the information for making a decision. Such information includes subject's past actions or sensory stimuli, and the internal storage of this information is often called “state variables”. We demonstrate that the matching strategy provides an easy way to maximize reward when combined with the exploration of the state variables that correctly represent the crucial information for reward maximization. Our results reveal for the first time how a strategy to achieve matching behavior is beneficial to reward maximization, achieving a novel insight into the relationship between maximizing and matching

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Effect of sonic versus ultrasonic activation on aqueous solution penetration in root canal dentin.

Author: Balandrano Pinal F
Capar ID Ozcan E, Arslan H, Ertas H, Aydinbelge HA
Castelo-Baz P Varela-Patiño P, Cantatore G, Domínguez-Perez A, Ruíz-Piñón M, Miguéns-Vila R, Martín-Biedma B
Comparison of irrigant penetration up to working length and into simulated lateral canals using various irrigating techniques
De Castro PH Pereira JV Jr, Sponchiado EC, Marques AA, Garcia Lda F
Generali L Cavani F, Serena V, Pettenati C, Righi E, Bertoldi C
Ghorbanzadeh A Aminsobhani M, Sohrabi K, Chiniforush N, Ghafari S, Shamshiri AR, Noroozi N
Gokturk H Ozkocak I, Buyukgebiz F, Demir O
Gu LS Kim JR, Ling J, Choi KK, Pashley DH, Tay FR
Helvacıoğlu Kıvanç B Deniz Arısu H, Yanar NO, Silah HM, İnam R, Görgül G
Kanumuru PK Sooraparaju SG, Konda KR, Nujella SK, Reddy BK, Penigalapati SR
Khalap ND Kokate S, Hegde V
Kumar VR Bahuguna N, Manan R
Lloyd A Navarrete G, Marchesan MA, Clement D
Macedo R Verhaagen B, Rivas DF, Versluis M, Wesselink P, van der Sluis L
Mancini M Cerroni L, Iorio L, Armellin E, Conte G, Cianconi L
Merino A Estevez R, de Gregorio C, Cohenca N
Mitchell RP Baumgartner JC, Sedgley CM
Nakamura VC Pinheiro ET, Prado LC, Silveira AC, Carvalho APL, Mayer MPA, Gavini G
Neuhaus KW Liebi M, Stauffacher S, Eick S, Lussi A
Peters OA Schönenberger K, Laib A
Plotino G Cortese T, Grande NM, Leonardi DP, Di Giorgio G, Testarelli L, Gambarini G
Pérez VI Rodríguez PA, Echeverri D
Shahi S Yavari HR, Rahimi S, Eskandarinezhad M, Shakouei S, Unchi M
Sáinz-Pardo M Estevez R, Pablo ÓV, Rossi-Fedele G, Cisneros R
Torabinejad M Khademi AA, Babagoli J, Cho Y, Johnson WB, Bozhilov K, Kim J, Shabahang S
Vadhana S Latha J, Velmurugan N
Ávila S Rosas G, García Salmones JA, Rosas Bernal N, Llamosas Hernández E
Publication venue: 'Facultad de Odontologia, Universidad de Concepcion'
Publication date
Field of study

Crossref

Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

Author: AG Barto
C Watkins
D Silver
LJ Lin
R Bellman
RJ Williams
VR Konda
WR Thompson
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study