164 research outputs found

    The Nature of Belief-Directed Exploratory Choice in Human Decision-Making

    Get PDF
    In non-stationary environments, there is a conflict between exploiting currently favored options and gaining information by exploring lesser-known options that in the past have proven less rewarding. Optimal decision-making in such tasks requires considering future states of the environment (i.e., planning) and properly updating beliefs about the state of the environment after observing outcomes associated with choices. Optimal belief-updating is reflective in that beliefs can change without directly observing environmental change. For example, after 10 s elapse, one might correctly believe that a traffic light last observed to be red is now more likely to be green. To understand human decision-making when rewards associated with choice options change over time, we develop a variant of the classic “bandit” task that is both rich enough to encompass relevant phenomena and sufficiently tractable to allow for ideal actor analysis of sequential choice behavior. We evaluate whether people update beliefs about the state of environment in a reflexive (i.e., only in response to observed changes in reward structure) or reflective manner. In contrast to purely “random” accounts of exploratory behavior, model-based analyses of the subjects’ choices and latencies indicate that people are reflective belief updaters. However, unlike the Ideal Actor model, our analyses indicate that people’s choice behavior does not reflect consideration of future environmental states. Thus, although people update beliefs in a reflective manner consistent with the Ideal Actor, they do not engage in optimal long-term planning, but instead myopically choose the option on every trial that is believed to have the highest immediate payoff

    Models of human preference for learning reward functions

    Full text link
    The utility of reinforcement learning is limited by the alignment of reward functions with the interests of human stakeholders. One promising method for alignment is to learn the reward function from human-generated preferences between pairs of trajectory segments, a type of reinforcement learning from human feedback (RLHF). These human preferences are typically assumed to be informed solely by partial return, the sum of rewards along each segment. We find this assumption to be flawed and propose modeling human preferences instead as informed by each segment's regret, a measure of a segment's deviation from optimal decision-making. Given infinitely many preferences generated according to regret, we prove that we can identify a reward function equivalent to the reward function that generated those preferences, and we prove that the previous partial return model lacks this identifiability property in multiple contexts. We empirically show that our proposed regret preference model outperforms the partial return preference model with finite training data in otherwise the same setting. Additionally, we find that our proposed regret preference model better predicts real human preferences and also learns reward functions from these preferences that lead to policies that are better human-aligned. Overall, this work establishes that the choice of preference model is impactful, and our proposed regret preference model provides an improvement upon a core assumption of recent research. We have open sourced our experimental code, the human preferences dataset we gathered, and our training and preference elicitation interfaces for gathering a such a dataset.Comment: 16 pages (40 pages with references and appendix), 23 figure

    Contrastive Preference Learning: Learning from Human Feedback without RL

    Full text link
    Reinforcement Learning from Human Feedback (RLHF) has emerged as a popular paradigm for aligning models with human intent. Typically RLHF algorithms operate in two phases: first, use human preferences to learn a reward function and second, align the model by optimizing the learned reward via reinforcement learning (RL). This paradigm assumes that human preferences are distributed according to reward, but recent work suggests that they instead follow the regret under the user's optimal policy. Thus, learning a reward function from feedback is not only based on a flawed assumption of human preference, but also leads to unwieldy optimization challenges that stem from policy gradients or bootstrapping in the RL phase. Because of these optimization challenges, contemporary RLHF methods restrict themselves to contextual bandit settings (e.g., as in large language models) or limit observation dimensionality (e.g., state-based robotics). We overcome these limitations by introducing a new family of algorithms for optimizing behavior from human feedback using the regret-based model of human preferences. Using the principle of maximum entropy, we derive Contrastive Preference Learning (CPL), an algorithm for learning optimal policies from preferences without learning reward functions, circumventing the need for RL. CPL is fully off-policy, uses only a simple contrastive objective, and can be applied to arbitrary MDPs. This enables CPL to elegantly scale to high-dimensional and sequential RLHF problems while being simpler than prior methods.Comment: Code released at https://github.com/jhejna/cpl. Edited 10/23 only to fix typo in the titl

    Using informative behavior to increase engagement while learning from human reward

    Get PDF
    In this work, we address a relatively unexplored aspect of designing agents that learn from human reward. We investigate how an agent’s non-task behavior can affect a human trainer’s training and agent learning. We use the TAMER framework, which facilitates the training of agents by human-generated reward signals, i.e., judgements of the quality of the agent’s actions, as the foundation for our investigation. Then, starting from the premise that the interaction between the agent and the trainer should be bi-directional, we propose two new training interfaces to increase a human trainer’s active involvement in the training process and thereby improve the agent’s task performance. One provides information on the agent’s uncertainty which is a metric calculated as data coverage, the other on its performance. Our results from a 51-subject user study show that these interfaces can induce the trainers to train longer and give more feedback. The agent’s performance, however, increases only in response to the addition of performance-oriented information, not by sharing uncertainty levels. These results suggest that the organizational maxim about human behavior, “you get what you measure”—i.e., sharing metrics with people causes them to focus on optimizing those metrics while de-emphasizing other objectives—also applies to the training of agents. Using principle component analysis, we show how trainers in the two conditions train agents differently. In addition, by simulating the influence of the agent’s uncertainty–informative behavior on a human’s training behavior, we show that trainers could be distracted by the agent sharing its uncertainty levels about its actions, giving poor feedback for the sake of reducing the agent’s uncertainty without improving the agent’s performance

    Characterization of the near-Earth Asteroid 2002NY40

    Full text link
    In August 2002, the near-Earth asteroid 2002 NY40, made its closest approach to the Earth. This provided an opportunity to study a near-Earth asteroid with a variety of instruments. Several of the telescopes at the Maui Space Surveillance System were trained at the asteroid and collected adaptive optics images, photometry and spectroscopy. Analysis of the imagery reveals the asteroid is triangular shaped with significant self-shadowing. The photometry reveals a 20-hour period and the spectroscopy shows that the asteroid is a Q-type

    Detection of Murine Leukemia Virus or Mouse DNA in Commercial RT-PCR Reagents and Human DNAs

    Get PDF
    The xenotropic murine leukemia virus (MLV)-related viruses (XMRV) have been reported in persons with prostate cancer, chronic fatigue syndrome, and less frequently in blood donors. Polytropic MLVs have also been described in persons with CFS and blood donors. However, many studies have failed to confirm these findings, raising the possibility of contamination as a source of the positive results. One PCR reagent, Platinum Taq polymerase (pol) has been reported to contain mouse DNA that produces false-positive MLV PCR results. We report here the finding of a large number of PCR reagents that have low levels of MLV sequences. We found that recombinant reverse-transcriptase (RT) enzymes from six companies derived from either MLV or avian myeloblastosis virus contained MLV pol DNA sequences but not gag or mouse DNA sequences. Sequence and phylogenetic analysis showed high relatedness to Moloney MLV, suggesting residual contamination with an RT-containing plasmid. In addition, we identified contamination with mouse DNA and a variety of MLV sequences in commercially available human DNAs from leukocytes, brain tissues, and cell lines. These results identify new sources of MLV contamination and highlight the importance of careful pre-screening of commercial specimens and diagnostic reagents to avoid false-positive MLV PCR results

    Monte Carlo simulation of ultrafast processes in photoexcited semiconductors: Coherent and incoherent dynamics

    Get PDF
    The ultrafast dynamics of photoexcited carriers in a semiconductor is investigated by using a Monte Carlo simulation. In addition to a ‘‘conventional’’ Monte Carlo simulation, the coherence of the external light field and the resulting coherence in the carrier system are fully taken into account. This allows us to treat the correct time dependence of the generation process showing a time-dependent linewidth associated with a recombination from states off resonance due to stimulated emission. The subsequent dephasing of the carriers due to scattering processes is analyzed. In addition, the simulation contains the carrier-carrier interaction in Hartree-Fock approximation giving rise to a band-gap renormalization and excitonic effects which cannot be treated in a conventional Monte Carlo simulation where polarization effects are neglected. Thus the approach presents a unified numerical method for the investigation of phenomena occurring close to the band gap and those typical for the energy relaxation of hot carriers

    Evaluation of the Performance of Information Theory-Based Methods and Cross-Correlation to Estimate the Functional Connectivity in Cortical Networks

    Get PDF
    Functional connectivity of in vitro neuronal networks was estimated by applying different statistical algorithms on data collected by Micro-Electrode Arrays (MEAs). First we tested these “connectivity methods” on neuronal network models at an increasing level of complexity and evaluated the performance in terms of ROC (Receiver Operating Characteristic) and PPC (Positive Precision Curve), a new defined complementary method specifically developed for functional links identification. Then, the algorithms better estimated the actual connectivity of the network models, were used to extract functional connectivity from cultured cortical networks coupled to MEAs. Among the proposed approaches, Transfer Entropy and Joint-Entropy showed the best results suggesting those methods as good candidates to extract functional links in actual neuronal networks from multi-site recordings
    • 

    corecore