Search CORE

166 research outputs found

The Nature of Belief-Directed Exploratory Choice in Human Decision-Making

Author: Knox W. Bradley
Love Bradley C.
Otto A. Ross
Stone Peter
Publication venue: Frontiers Research Foundation
Publication date: 01/01/2012
Field of study

In non-stationary environments, there is a conflict between exploiting currently favored options and gaining information by exploring lesser-known options that in the past have proven less rewarding. Optimal decision-making in such tasks requires considering future states of the environment (i.e., planning) and properly updating beliefs about the state of the environment after observing outcomes associated with choices. Optimal belief-updating is reflective in that beliefs can change without directly observing environmental change. For example, after 10 s elapse, one might correctly believe that a traffic light last observed to be red is now more likely to be green. To understand human decision-making when rewards associated with choice options change over time, we develop a variant of the classic “bandit” task that is both rich enough to encompass relevant phenomena and sufficiently tractable to allow for ideal actor analysis of sequential choice behavior. We evaluate whether people update beliefs about the state of environment in a reflexive (i.e., only in response to observed changes in reward structure) or reflective manner. In contrast to purely “random” accounts of exploratory behavior, model-based analyses of the subjects’ choices and latencies indicate that people are reflective belief updaters. However, unlike the Ideal Actor model, our analyses indicate that people’s choice behavior does not reflect consideration of future environmental states. Thus, although people update beliefs in a reflective manner consistent with the Ideal Actor, they do not engage in optimal long-term planning, but instead myopically choose the option on every trial that is believed to have the highest immediate payoff

Crossref

Directory of Open Access Journals

PubMed Central

UCL Discovery

Frontiers - Publisher Connector

Models of human preference for learning reward functions

Author: Allievi Alessandro
Booth Serena
Hatgis-Kessell Stephane
Knox W. Bradley
Niekum Scott
Stone Peter
Publication venue
Publication date: 01/08/2023
Field of study

The utility of reinforcement learning is limited by the alignment of reward functions with the interests of human stakeholders. One promising method for alignment is to learn the reward function from human-generated preferences between pairs of trajectory segments, a type of reinforcement learning from human feedback (RLHF). These human preferences are typically assumed to be informed solely by partial return, the sum of rewards along each segment. We find this assumption to be flawed and propose modeling human preferences instead as informed by each segment's regret, a measure of a segment's deviation from optimal decision-making. Given infinitely many preferences generated according to regret, we prove that we can identify a reward function equivalent to the reward function that generated those preferences, and we prove that the previous partial return model lacks this identifiability property in multiple contexts. We empirically show that our proposed regret preference model outperforms the partial return preference model with finite training data in otherwise the same setting. Additionally, we find that our proposed regret preference model better predicts real human preferences and also learns reward functions from these preferences that lead to policies that are better human-aligned. Overall, this work establishes that the choice of preference model is impactful, and our proposed regret preference model provides an improvement upon a core assumption of recent research. We have open sourced our experimental code, the human preferences dataset we gathered, and our training and preference elicitation interfaces for gathering a such a dataset.Comment: 16 pages (40 pages with references and appendix), 23 figure

arXiv.org e-Print Archive

Contrastive Preference Learning: Learning from Human Feedback without RL

Author: Finn Chelsea
Hejna Joey
Knox W. Bradley
Niekum Scott
Rafailov Rafael
Sadigh Dorsa
Sikchi Harshit
Publication venue
Publication date: 23/10/2023
Field of study

Reinforcement Learning from Human Feedback (RLHF) has emerged as a popular paradigm for aligning models with human intent. Typically RLHF algorithms operate in two phases: first, use human preferences to learn a reward function and second, align the model by optimizing the learned reward via reinforcement learning (RL). This paradigm assumes that human preferences are distributed according to reward, but recent work suggests that they instead follow the regret under the user's optimal policy. Thus, learning a reward function from feedback is not only based on a flawed assumption of human preference, but also leads to unwieldy optimization challenges that stem from policy gradients or bootstrapping in the RL phase. Because of these optimization challenges, contemporary RLHF methods restrict themselves to contextual bandit settings (e.g., as in large language models) or limit observation dimensionality (e.g., state-based robotics). We overcome these limitations by introducing a new family of algorithms for optimizing behavior from human feedback using the regret-based model of human preferences. Using the principle of maximum entropy, we derive Contrastive Preference Learning (CPL), an algorithm for learning optimal policies from preferences without learning reward functions, circumventing the need for RL. CPL is fully off-policy, uses only a simple contrastive objective, and can be applied to arbitrary MDPs. This enables CPL to elegantly scale to high-dimensional and sequential RLHF problems while being simpler than prior methods.Comment: Code released at https://github.com/jhejna/cpl. Edited 10/23 only to fix typo in the titl

arXiv.org e-Print Archive

Recommended from our members

Power to the People: The Role of Humans in Interactive Machine Learning

Author: Amershi Saleema
Cakmak Maya
Knox W. Bradley
Kulesza Todd
Publication venue: American Association for Artificial Intelligence
Publication date
Field of study

Systems that can learn interactively from their end-users are quickly becoming widespread. Until recently, this progress has been fueled mostly by advances in machine learning; however, more and more researchers are realizing the importance of studying users of these systems. In this article we promote this approach and demonstrate how it can result in better user experiences and more effective learning systems. We present a number of case studies that demonstrate how interactivity results in a tight coupling between the system and the user, exemplify ways in which some existing systems fail to account for the user, and explore new ways for learning systems to interact with their users. After giving a glimpse of the progress that has been made thus far, we discuss some of the challenges we face in moving the field forward.This is an author's peer-reviewed final manuscript, as accepted by the publisher. The published article is copyrighted by the American Association for Artificial Intelligence and can be found at: http://www.aaai.org/Magazine/magazine.php

ScholarsArchive@OSU

Using informative behavior to increase engagement while learning from human reward

Author: AL Thomaz
AL Thomaz
BD Argall
DL Gill
DP Bertsekas
E Bouwers
E Lawrence
ED Demaine
F Kaplan
Guangliang Li
Hayley Hung
I Szita
R Maclin
R Sutton
S Chernova
Shimon Whiteson
W. Bradley Knox
WB Knox
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

In this work, we address a relatively unexplored aspect of designing agents that learn from human reward. We investigate how an agent’s non-task behavior can affect a human trainer’s training and agent learning. We use the TAMER framework, which facilitates the training of agents by human-generated reward signals, i.e., judgements of the quality of the agent’s actions, as the foundation for our investigation. Then, starting from the premise that the interaction between the agent and the trainer should be bi-directional, we propose two new training interfaces to increase a human trainer’s active involvement in the training process and thereby improve the agent’s task performance. One provides information on the agent’s uncertainty which is a metric calculated as data coverage, the other on its performance. Our results from a 51-subject user study show that these interfaces can induce the trainers to train longer and give more feedback. The agent’s performance, however, increases only in response to the addition of performance-oriented information, not by sharing uncertainty levels. These results suggest that the organizational maxim about human behavior, “you get what you measure”—i.e., sharing metrics with people causes them to focus on optimizing those metrics while de-emphasizing other objectives—also applies to the training of agents. Using principle component analysis, we show how trainers in the two conditions train agents differently. In addition, by simulating the influence of the agent’s uncertainty–informative behavior on a human’s training behavior, we show that trainers could be distracted by the agent sharing its uncertainty levels about its actions, giving poor feedback for the sake of reducing the agent’s uncertainty without improving the agent’s performance

DSpace@MIT

Crossref

Springer - Publisher Connector

UvA-DARE

International Migration, Integration and Social Cohesion online publications

Characterization of the near-Earth Asteroid 2002NY40

Author: Ayers
Baliunas
Benner
Binzel
Bowell
Bradley
Dennis Liang
Doyle T. Hall
Drummond
Fish
Gerwe
Hestroffer
Howell
Hudson
Jacob K. Barros
John L. Africano
John V. Lambert
Jorgensen
Keith T. Knox
Kris M. Hamada
Lewis C. Roberts
Lipschutz
Lucy
McCarthy
Müller
Nishimoto
Paul F. Sydney
Paul W. Kervin
Paxman
Pravec
Ragazzoni
Richardson
Rivkin
Roberts
Saint-Pe
Stetson
Sydney
Tsumuraya
Publication venue: 'Elsevier BV'
Publication date: 01/01/2007
Field of study

In August 2002, the near-Earth asteroid 2002 NY40, made its closest approach to the Earth. This provided an opportunity to study a near-Earth asteroid with a variety of instruments. Several of the telescopes at the Maui Space Surveillance System were trained at the asteroid and collected adaptive optics images, photometry and spectroscopy. Analysis of the imagery reveals the asteroid is triangular shaped with significant self-shadowing. The photometry reveals a 20-hour period and the spectroscopy shows that the asteroid is a Q-type

arXiv.org e-Print Archive

CiteSeerX

Crossref

Detection of Murine Leukemia Virus or Mouse DNA in Commercial RT-PCR Reagents and Human DNAs

Author: A Katzourakis
Anupama Shankar
B Oakes
BC Satterfield
Bradley S. Schneider
C Lintas
CH Shin
E Barnes
E Sato
GW Verhaegh
HaoQiang Zheng
Hongwei Jia
K Katoh
K Knox
K Tamura
KP Raisch
KS Sfanos
M Cornelissen
MJ Robinson
O Erlwein
O Hohn
P Hong
PW Tuke
RA Furuta
RA Smith
RH Silverman
S Hue
SC Lo
T Paprotka
T Sakuma
TX Henrich
W Switzer
Walid Heneine
William M. Switzer
WM Switzer
WM Switzer
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

The xenotropic murine leukemia virus (MLV)-related viruses (XMRV) have been reported in persons with prostate cancer, chronic fatigue syndrome, and less frequently in blood donors. Polytropic MLVs have also been described in persons with CFS and blood donors. However, many studies have failed to confirm these findings, raising the possibility of contamination as a source of the positive results. One PCR reagent, Platinum Taq polymerase (pol) has been reported to contain mouse DNA that produces false-positive MLV PCR results. We report here the finding of a large number of PCR reagents that have low levels of MLV sequences. We found that recombinant reverse-transcriptase (RT) enzymes from six companies derived from either MLV or avian myeloblastosis virus contained MLV pol DNA sequences but not gag or mouse DNA sequences. Sequence and phylogenetic analysis showed high relatedness to Moloney MLV, suggesting residual contamination with an RT-containing plasmid. In addition, we identified contamination with mouse DNA and a variety of MLV sequences in commercially available human DNAs from leukocytes, brain tissues, and cell lines. These results identify new sources of MLV contamination and highlight the importance of careful pre-screening of commercial specimens and diagnostic reagents to avoid false-positive MLV PCR results

Crossref

Directory of Open Access Journals

PubMed Central

Monte Carlo simulation of ultrafast processes in photoexcited semiconductors: Coherent and incoherent dynamics

The ultrafast dynamics of photoexcited carriers in a semiconductor is investigated by using a Monte Carlo simulation. In addition to a ‘‘conventional’’ Monte Carlo simulation, the coherence of the external light field and the resulting coherence in the carrier system are fully taken into account. This allows us to treat the correct time dependence of the generation process showing a time-dependent linewidth associated with a recombination from states off resonance due to stimulated emission. The subsequent dephasing of the carriers due to scattering processes is analyzed. In addition, the simulation contains the carrier-carrier interaction in Hartree-Fock approximation giving rise to a band-gap renormalization and excitonic effects which cannot be treated in a conventional Monte Carlo simulation where polarization effects are neglected. Thus the approach presents a unified numerical method for the investigation of phenomena occurring close to the band gap and those typical for the energy relaxation of hot carriers

Crossref

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

Evaluation of the Performance of Information Theory-Based Methods and Cross-Correlation to Estimate the Functional Connectivity in Cortical Networks

Author: A Borst
A Borst
A Grinvald
A Maccione
A Mazzoni
A Treves
AJ Cadotte
AP Bradley
B Gourévitch
C van Vreeswijk
C Van Vreeswijk
CK Knox
D Eytan
DC Tam
E Hulata
E Salinas
EM Izhikevich
EM Izhikevich
F Rieke
G Shahaf
GJ Brewer
J Buchmann
J Xu
J-n Teramae
JD Rolston
K Imfeld
KJ Friston
KJ Friston
KP Dockendorf
LMA Bettencourt
M Chavez
M Chiappalone
M Lungarella
M Lungarella
Matteo Garofalo
N Raichman
O Sporns
Olaf Sporns
P Dayan
Paolo Massobrio
RHR Hahnloser
S Marom
S Marom
S Yamada
Sergio Martinoia
T Fawcett
T Schreiber
Thierry Nieus
V Braitenberg
V Pasquale
W Li
Y Jimbo
Publication venue: Public Library of Science
Publication date: 01/01/2009
Field of study

Functional connectivity of in vitro neuronal networks was estimated by applying different statistical algorithms on data collected by Micro-Electrode Arrays (MEAs). First we tested these “connectivity methods” on neuronal network models at an increasing level of complexity and evaluated the performance in terms of ROC (Receiver Operating Characteristic) and PPC (Positive Precision Curve), a new defined complementary method specifically developed for functional links identification. Then, the algorithms better estimated the actual connectivity of the network models, were used to extract functional connectivity from cultured cortical networks coupled to MEAs. Among the proposed approaches, Transfer Entropy and Joint-Entropy showed the best results suggesting those methods as good candidates to extract functional links in actual neuronal networks from multi-site recordings

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Archivio istituzionale della ricerca - Università di Genova