7,167 research outputs found

    Deep reinforcement learning für workload balance und Fälligkeitskontrolle in wafer fabs

    Get PDF
    Semiconductor wafer fabrication facilities (wafer fabs) often prioritize two operational objectives: work-in-process (WIP) and due date. WIP-oriented and due date-oriented dispatching rules are two commonly used methods to achieve workload balance and on-time delivery, respectively. However, it often requires sophisticated heuristics to achieve both objectives simultaneously. In this paper, we propose a novel approach using deep-Q-network reinforcement learning (DRL) for dispatching in wafer fabs. The DRL approach differs from traditional dispatching methods by using dispatch agents at work-centers to observe state changes in the wafer fabs. The agents train their deep-Q-networks by taking the states as inputs, allowing them to select the most appropriate dispatch action. Additionally, the reward function is integrated with workload and due date information on both local and global levels. Compared to the traditional WIP and due date-oriented rules, as well as heuristics-based rule in literature, the DRL approach is able to produce better global performance with regard to workload balance and on-time delivery

    A general theory of intertemporal decision-making and the perception of time

    Full text link
    Animals and humans make decisions based on their expected outcomes. Since relevant outcomes are often delayed, perceiving delays and choosing between earlier versus later rewards (intertemporal decision-making) is an essential component of animal behavior. The myriad observations made in experiments studying intertemporal decision-making and time perception have not yet been rationalized within a single theory. Here we present a theory-Training--Integrated Maximized Estimation of Reinforcement Rate (TIMERR)--that explains a wide variety of behavioral observations made in intertemporal decision-making and the perception of time. Our theory postulates that animals make intertemporal choices to optimize expected reward rates over a limited temporal window; this window includes a past integration interval (over which experienced reward rate is estimated) and the expected delay to future reward. Using this theory, we derive a mathematical expression for the subjective representation of time. A unique contribution of our work is in finding that the past integration interval directly determines the steepness of temporal discounting and the nonlinearity of time perception. In so doing, our theory provides a single framework to understand both intertemporal decision-making and time perception.Comment: 37 pages, 4 main figures, 3 supplementary figure

    NEUROBEHAVIORAL MEASUREMENTS OF NATURAL AND OPIOID REWARD VALUE

    Get PDF
    In the last decade, (non)prescription opioid abuse, opioid use disorder (OUD) diagnoses, and opioid-related overdoses have risen and represent a significant public health concern. One method of understanding OUD is as a disorder of choice that requires choosing opioid rewards at the expense of other nondrug rewards. The characterization of OUD as a disorder of choice is important as it implicates decision- making processes as therapeutic targets, such as the valuation of opioid rewards. However, reward-value measurement and interpretation are traditionally different in substance abuse research compared to related fields such as economics, animal behavior, and neuroeconomics and may be less effective for understanding how opioid rewards are valued. The present research therefore used choice procedures in line with behavioral/neuroeconomic studies to determine if drug-associated decision making could be predicted from economic choice theories. In Experiment 1, rats completed an isomorphic food-food probabilistic choice task with dynamic, unpredictable changes in reward probability that required constant updating of reward values. After initial training, the reward magnitude of one choice subsequently increased from one to two to three pellets. Additionally, rats were split between the Signaled and Unsignaled groups to understand how cues modulate reward value. After each choice, the Unsignaled group received distinct choice-dependent cues that were uninformative of the choice outcome. The Signaled group also received uninformative cues on one option, but the alternative choice produced reward-predictive cues that informed the trial outcome as a win or loss. Choice data were analyzed at a molar level using matching equations and molecular level using reinforcement learning (RL) models to determine how probability, reward magnitude, and reward-associated cues affected choice. Experiment 2 used an allomorphic drug versus food procedure where the food reward for one option was replaced by a self-administered remifentanil (REMI) infusion at doses of 1, 3 and 10 μg/kg. Finally, Experiment 3 assessed the potential for both REMI and food reward value to be commonly scaled within the brain by examining changes in nucleus accumbens (NAc) Oxygen (O2) dynamics. Results showed that increasing reward probability, magnitude, and the presence of reward-associated cues all independently increased the propensity of choosing the associated choice alternative, including REMI drug choices. Additionally, both molar matching and molecular RL models successfully parameterized rats’ decision dynamics. O2 dynamics were generally commensurate with the idea of a common value signal for REMI and food with changes in O2 signaling scaling with the reward magnitude of REMI rewards. Finally, RL model-derived reward prediction errors significantly correlated with peak O2 activity for reward delivery, suggesting a possible neurological mechanism of value updating. Results are discussed in terms of their implications for current conceptualizations of substance use disorders including a potential need to change the discourse surrounding how substance use disorders are modeled experimentally. Overall, the present research provides evidence that a choice model of substance use disorders may be a viable alternative to the disease model and could facilitate future treatment options centered around economic principles

    A Local Circuit Model of Learned Striatal and Dopamine Cell Responses under Probabilistic Schedules of Reward

    Full text link
    Before choosing, it helps to know both the expected value signaled by a predictive cue and the associated uncertainty that the reward will be forthcoming. Recently, Fiorillo et al. (2003) found the dopamine (DA) neurons of the SNc exhibit sustained responses related to the uncertainty that a cure will be followed by reward, in addition to phasic responses related to reward prediction errors (RPEs). This suggests that cue-dependent anticipations of the timing, magnitude, and uncertainty of rewards are learned and reflected in components of the DA signals broadcast by SNc neurons. What is the minimal local circuit model that can explain such multifaceted reward-related learning? A new computational model shows how learned uncertainty responses emerge robustly on single trial along with phasic RPE responses, such that both types of DA responses exhibit the empirically observed dependence on conditional probability, expected value of reward, and time since onset of the reward-predicting cue. The model includes three major pathways for computing: immediate expected values of cures, timed predictions of reward magnitudes (and RPEs), and the uncertainty associated with these predictions. The first two model pathways refine those previously modeled by Brown et al. (1999). A third, newly modeled, pathway is formed by medium spiny projection neurons (MSPNs) of the matrix compartment of the striatum, whose axons co-release GABA and a neuropeptide, substance P, both at synapses with GABAergic neurons in the SNr and with the dendrites (in SNr) of DA neurons whose somas are in ventral SNc. Co-release enables efficient computation of sustained DA uncertainty responses that are a non-monotonic function of the conditonal probability that a reward will follow the cue. The new model's incorporation of a striatal microcircuit allowed it to reveals that variability in striatal cholinergic transmission can explain observed difference, between monkeys, in the amplitutude of the non-monotonic uncertainty function. Involvement of matriceal MSPNs and striatal cholinergic transmission implpies a relation between uncertainty in the cue-reward contigency and action-selection functions of the basal ganglia. The model synthesizes anatomical, electrophysiological and behavioral data regarding the midbrain DA system in a novel way, by relating the ability to compute uncertainty, in parallel with other aspects of reward contingencies, to the unique distribution of SP inputs in ventral SN.National Science Foundation (SBE-354378); Higher Educational Council of Turkey; Canakkale Onsekiz Mart University of Turke

    Automating Staged Rollout with Reinforcement Learning

    Full text link
    Staged rollout is a strategy of incrementally releasing software updates to portions of the user population in order to accelerate defect discovery without incurring catastrophic outcomes such as system wide outages. Some past studies have examined how to quantify and automate staged rollout, but stop short of simultaneously considering multiple product or process metrics explicitly. This paper demonstrates the potential to automate staged rollout with multi-objective reinforcement learning in order to dynamically balance stakeholder needs such as time to deliver new features and downtime incurred by failures due to latent defects
    corecore