166 research outputs found
The Nature of Belief-Directed Exploratory Choice in Human Decision-Making
In non-stationary environments, there is a conflict between exploiting currently favored options and gaining information by exploring lesser-known options that in the past have proven less rewarding. Optimal decision-making in such tasks requires considering future states of the environment (i.e., planning) and properly updating beliefs about the state of the environment after observing outcomes associated with choices. Optimal belief-updating is reflective in that beliefs can change without directly observing environmental change. For example, after 10âs elapse, one might correctly believe that a traffic light last observed to be red is now more likely to be green. To understand human decision-making when rewards associated with choice options change over time, we develop a variant of the classic âbanditâ task that is both rich enough to encompass relevant phenomena and sufficiently tractable to allow for ideal actor analysis of sequential choice behavior. We evaluate whether people update beliefs about the state of environment in a reflexive (i.e., only in response to observed changes in reward structure) or reflective manner. In contrast to purely ârandomâ accounts of exploratory behavior, model-based analyses of the subjectsâ choices and latencies indicate that people are reflective belief updaters. However, unlike the Ideal Actor model, our analyses indicate that peopleâs choice behavior does not reflect consideration of future environmental states. Thus, although people update beliefs in a reflective manner consistent with the Ideal Actor, they do not engage in optimal long-term planning, but instead myopically choose the option on every trial that is believed to have the highest immediate payoff
Models of human preference for learning reward functions
The utility of reinforcement learning is limited by the alignment of reward
functions with the interests of human stakeholders. One promising method for
alignment is to learn the reward function from human-generated preferences
between pairs of trajectory segments, a type of reinforcement learning from
human feedback (RLHF). These human preferences are typically assumed to be
informed solely by partial return, the sum of rewards along each segment. We
find this assumption to be flawed and propose modeling human preferences
instead as informed by each segment's regret, a measure of a segment's
deviation from optimal decision-making. Given infinitely many preferences
generated according to regret, we prove that we can identify a reward function
equivalent to the reward function that generated those preferences, and we
prove that the previous partial return model lacks this identifiability
property in multiple contexts. We empirically show that our proposed regret
preference model outperforms the partial return preference model with finite
training data in otherwise the same setting. Additionally, we find that our
proposed regret preference model better predicts real human preferences and
also learns reward functions from these preferences that lead to policies that
are better human-aligned. Overall, this work establishes that the choice of
preference model is impactful, and our proposed regret preference model
provides an improvement upon a core assumption of recent research. We have open
sourced our experimental code, the human preferences dataset we gathered, and
our training and preference elicitation interfaces for gathering a such a
dataset.Comment: 16 pages (40 pages with references and appendix), 23 figure
Contrastive Preference Learning: Learning from Human Feedback without RL
Reinforcement Learning from Human Feedback (RLHF) has emerged as a popular
paradigm for aligning models with human intent. Typically RLHF algorithms
operate in two phases: first, use human preferences to learn a reward function
and second, align the model by optimizing the learned reward via reinforcement
learning (RL). This paradigm assumes that human preferences are distributed
according to reward, but recent work suggests that they instead follow the
regret under the user's optimal policy. Thus, learning a reward function from
feedback is not only based on a flawed assumption of human preference, but also
leads to unwieldy optimization challenges that stem from policy gradients or
bootstrapping in the RL phase. Because of these optimization challenges,
contemporary RLHF methods restrict themselves to contextual bandit settings
(e.g., as in large language models) or limit observation dimensionality (e.g.,
state-based robotics). We overcome these limitations by introducing a new
family of algorithms for optimizing behavior from human feedback using the
regret-based model of human preferences. Using the principle of maximum
entropy, we derive Contrastive Preference Learning (CPL), an algorithm for
learning optimal policies from preferences without learning reward functions,
circumventing the need for RL. CPL is fully off-policy, uses only a simple
contrastive objective, and can be applied to arbitrary MDPs. This enables CPL
to elegantly scale to high-dimensional and sequential RLHF problems while being
simpler than prior methods.Comment: Code released at https://github.com/jhejna/cpl. Edited 10/23 only to
fix typo in the titl
Recommended from our members
Power to the People: The Role of Humans in Interactive Machine Learning
Systems that can learn interactively from their end-users are quickly becoming widespread. Until recently, this progress has been fueled mostly by advances in machine learning; however, more and more researchers are realizing the importance of studying users of these systems. In this article we promote this approach and demonstrate how it can result in better user experiences and more effective learning systems. We present a number of case studies that demonstrate how interactivity results in a tight coupling between the system and the user, exemplify ways in which some existing systems fail to account for the user, and explore new ways for learning systems to interact with their users. After giving a glimpse of the progress that has been made thus far, we discuss some of the challenges we face in moving the field forward.This is an author's peer-reviewed final manuscript, as accepted by the publisher. The published article is copyrighted by the American Association for Artificial Intelligence and can be found at: http://www.aaai.org/Magazine/magazine.php
Using informative behavior to increase engagement while learning from human reward
In this work, we address a relatively unexplored aspect of designing agents that learn from human reward. We investigate how an agentâs non-task behavior can affect a human trainerâs training and agent learning. We use the TAMER framework, which facilitates the training of agents by human-generated reward signals, i.e., judgements of the quality of the agentâs actions, as the foundation for our investigation. Then, starting from the premise that the interaction between the agent and the trainer should be bi-directional, we propose two new training interfaces to increase a human trainerâs active involvement in the training process and thereby improve the agentâs task performance. One provides information on the agentâs uncertainty which is a metric calculated as data coverage, the other on its performance. Our results from a 51-subject user study show that these interfaces can induce the trainers to train longer and give more feedback. The agentâs performance, however, increases only in response to the addition of performance-oriented information, not by sharing uncertainty levels. These results suggest that the organizational maxim about human behavior, âyou get what you measureââi.e., sharing metrics with people causes them to focus on optimizing those metrics while de-emphasizing other objectivesâalso applies to the training of agents. Using principle component analysis, we show how trainers in the two conditions train agents differently. In addition, by simulating the influence of the agentâs uncertaintyâinformative behavior on a humanâs training behavior, we show that trainers could be distracted by the agent sharing its uncertainty levels about its actions, giving poor feedback for the sake of reducing the agentâs uncertainty without improving the agentâs performance
Characterization of the near-Earth Asteroid 2002NY40
In August 2002, the near-Earth asteroid 2002 NY40, made its closest approach
to the Earth. This provided an opportunity to study a near-Earth asteroid with
a variety of instruments. Several of the telescopes at the Maui Space
Surveillance System were trained at the asteroid and collected adaptive optics
images, photometry and spectroscopy. Analysis of the imagery reveals the
asteroid is triangular shaped with significant self-shadowing. The photometry
reveals a 20-hour period and the spectroscopy shows that the asteroid is a
Q-type
Detection of Murine Leukemia Virus or Mouse DNA in Commercial RT-PCR Reagents and Human DNAs
The xenotropic murine leukemia virus (MLV)-related viruses (XMRV) have been reported in persons with prostate cancer, chronic fatigue syndrome, and less frequently in blood donors. Polytropic MLVs have also been described in persons with CFS and blood donors. However, many studies have failed to confirm these findings, raising the possibility of contamination as a source of the positive results. One PCR reagent, Platinum Taq polymerase (pol) has been reported to contain mouse DNA that produces false-positive MLV PCR results. We report here the finding of a large number of PCR reagents that have low levels of MLV sequences. We found that recombinant reverse-transcriptase (RT) enzymes from six companies derived from either MLV or avian myeloblastosis virus contained MLV pol DNA sequences but not gag or mouse DNA sequences. Sequence and phylogenetic analysis showed high relatedness to Moloney MLV, suggesting residual contamination with an RT-containing plasmid. In addition, we identified contamination with mouse DNA and a variety of MLV sequences in commercially available human DNAs from leukocytes, brain tissues, and cell lines. These results identify new sources of MLV contamination and highlight the importance of careful pre-screening of commercial specimens and diagnostic reagents to avoid false-positive MLV PCR results
Monte Carlo simulation of ultrafast processes in photoexcited semiconductors: Coherent and incoherent dynamics
The ultrafast dynamics of photoexcited carriers in a semiconductor is investigated by using a Monte Carlo simulation. In addition to a ââconventionalââ Monte Carlo simulation, the coherence of the external light field and the resulting coherence in the carrier system are fully taken into account. This allows us to treat the correct time dependence of the generation process showing a time-dependent linewidth associated with a recombination from states off resonance due to stimulated emission. The subsequent dephasing of the carriers due to scattering processes is analyzed. In addition, the simulation contains the carrier-carrier interaction in Hartree-Fock approximation giving rise to a band-gap renormalization and excitonic effects which cannot be treated in a conventional Monte Carlo simulation where polarization effects are neglected. Thus the approach presents a unified numerical method for the investigation of phenomena occurring close to the band gap and those typical for the energy relaxation of hot carriers
Evaluation of the Performance of Information Theory-Based Methods and Cross-Correlation to Estimate the Functional Connectivity in Cortical Networks
Functional connectivity of in vitro neuronal networks was estimated by applying different statistical algorithms on data collected by Micro-Electrode Arrays (MEAs). First we tested these âconnectivity methodsâ on neuronal network models at an increasing level of complexity and evaluated the performance in terms of ROC (Receiver Operating Characteristic) and PPC (Positive Precision Curve), a new defined complementary method specifically developed for functional links identification. Then, the algorithms better estimated the actual connectivity of the network models, were used to extract functional connectivity from cultured cortical networks coupled to MEAs. Among the proposed approaches, Transfer Entropy and Joint-Entropy showed the best results suggesting those methods as good candidates to extract functional links in actual neuronal networks from multi-site recordings
- âŠ