87 research outputs found

    Guarded Policy Optimization with Imperfect Online Demonstrations

    Full text link
    The Teacher-Student Framework (TSF) is a reinforcement learning setting where a teacher agent guards the training of a student agent by intervening and providing online demonstrations. Assuming optimal, the teacher policy has the perfect timing and capability to intervene in the learning process of the student agent, providing safety guarantee and exploration guidance. Nevertheless, in many real-world settings it is expensive or even impossible to obtain a well-performing teacher policy. In this work, we relax the assumption of a well-performing teacher and develop a new method that can incorporate arbitrary teacher policies with modest or inferior performance. We instantiate an Off-Policy Reinforcement Learning algorithm, termed Teacher-Student Shared Control (TS2C), which incorporates teacher intervention based on trajectory-based value estimation. Theoretical analysis validates that the proposed TS2C algorithm attains efficient exploration and substantial safety guarantee without being affected by the teacher's own performance. Experiments on various continuous control tasks show that our method can exploit teacher policies at different performance levels while maintaining a low training cost. Moreover, the student policy surpasses the imperfect teacher policy in terms of higher accumulated reward in held-out testing environments. Code is available at https://metadriverse.github.io/TS2C.Comment: Accepted at ICLR 2023 (top 25%

    State Regularized Policy Optimization on Data with Dynamics Shift

    Full text link
    In many real-world scenarios, Reinforcement Learning (RL) algorithms are trained on data with dynamics shift, i.e., with different underlying environment dynamics. A majority of current methods address such issue by training context encoders to identify environment parameters. Data with dynamics shift are separated according to their environment parameters to train the corresponding policy. However, these methods can be sample inefficient as data are used \textit{ad hoc}, and policies trained for one dynamics cannot benefit from data collected in all other environments with different dynamics. In this paper, we find that in many environments with similar structures and different dynamics, optimal policies have similar stationary state distributions. We exploit such property and learn the stationary state distribution from data with dynamics shift for efficient data reuse. Such distribution is used to regularize the policy trained in a new environment, leading to the SRPO (\textbf{S}tate \textbf{R}egularized \textbf{P}olicy \textbf{O}ptimization) algorithm. To conduct theoretical analyses, the intuition of similar environment structures is characterized by the notion of homomorphous MDPs. We then demonstrate a lower-bound performance guarantee on policies regularized by the stationary state distribution. In practice, SRPO can be an add-on module to context-based algorithms in both online and offline RL settings. Experimental results show that SRPO can make several context-based algorithms far more data efficient and significantly improve their overall performance.Comment: Preprint. Under Revie

    PrefRec: Recommender Systems with Human Preferences for Reinforcing Long-term User Engagement

    Full text link
    Current advances in recommender systems have been remarkably successful in optimizing immediate engagement. However, long-term user engagement, a more desirable performance metric, remains difficult to improve. Meanwhile, recent reinforcement learning (RL) algorithms have shown their effectiveness in a variety of long-term goal optimization tasks. For this reason, RL is widely considered as a promising framework for optimizing long-term user engagement in recommendation. Though promising, the application of RL heavily relies on well-designed rewards, but designing rewards related to long-term user engagement is quite difficult. To mitigate the problem, we propose a novel paradigm, recommender systems with human preferences (or Preference-based Recommender systems), which allows RL recommender systems to learn from preferences about users historical behaviors rather than explicitly defined rewards. Such preferences are easily accessible through techniques such as crowdsourcing, as they do not require any expert knowledge. With PrefRec, we can fully exploit the advantages of RL in optimizing long-term goals, while avoiding complex reward engineering. PrefRec uses the preferences to automatically train a reward function in an end-to-end manner. The reward function is then used to generate learning signals to train the recommendation policy. Furthermore, we design an effective optimization method for PrefRec, which uses an additional value function, expectile regression and reward model pre-training to improve the performance. We conduct experiments on a variety of long-term user engagement optimization tasks. The results show that PrefRec significantly outperforms previous state-of-the-art methods in all the tasks

    Could a Kilonova Kill: a Threat Assessment

    Full text link
    Binary neutron star mergers (BNS) produce high-energy emissions from several physically different sources, including a gamma-ray burst (GRB) and its afterglow, a kilonova, and, at late times, a remnant many parsecs in size. Ionizing radiation from these sources can be dangerous for life on Earth-like planets when located too close. Work to date has explored the substantial danger posed by the GRB to on-axis observers: here we focus instead on the potential threats posed to nearby off-axis observers. Our analysis is based largely on observations of the GW 170817/GRB 170817A multi-messenger event, as well as theoretical predictions. For baseline kilonova parameters, we find that the X-ray emission from the afterglow may be lethal out to ∼5\sim 5 pc and the off-axis gamma-ray emission may threaten a range out to ∼4\sim 4 pc, whereas the greatest threat comes years after the explosion, from the cosmic rays accelerated by the kilonova blast, which can be lethal out to distances up to ∼11\sim 11 pc. The distances quoted here are typical, but the values have significant uncertainties and depend on the viewing angle, ejected mass, and explosion energy in ways we quantify. Assessing the overall threat to Earth-like planets, have a similar kill distance to supernovae, but are far less common. However, our results rely on the scant available kilonova data, and multi-messenger observations will clarify the danger posed by such events.Comment: 21 pages, 5 figures. Comments welcom

    Thallium-208: a beacon of in situ neutron capture nucleosynthesis

    Full text link
    We demonstrate that the well-known 2.6 MeV gamma-ray emission line from thallium-208 could serve as a real-time indicator of astrophysical heavy element production, with both rapid (r) and intermediate (i) neutron capture processes capable of its synthesis. We consider the r process in a Galactic neutron star merger and show Tl-208 to be detectable from ~12 hours to ~10 days, and again ~1-20 years post-event. Detection of Tl-208 represents the only identified prospect for a direct signal of lead production (implying gold synthesis), arguing for the importance of future MeV telescope missions which aim to detect Galactic events but may also be able to reach some nearby galaxies in the Local Group.Comment: accepted to PR

    Proposed Lunar Measurements of rr-Process Radioisotopes to Distinguish Origin of Deep-sea 244Pu

    Full text link
    244Pu has recently been discovered in deep-sea deposits spanning the past 10 Myr, a period that includes two 60Fe pulses from nearby supernovae. 244Pu is among the heaviest rr-process products, and we consider whether it was created in the supernovae, which is disfavored by nucleosynthesis simulations, or in an earlier kilonova event that seeded 244Pu in the nearby interstellar medium that was subsequently swept up by the supernova debris. We discuss how these possibilities can be probed by measuring 244Pu and other rr-process radioisotopes such as 129I and 182Hf, both in lunar regolith samples returned to Earth by missions such as Chang'e and Artemis, and in deep-sea deposits.Comment: Extensive rewrite of v1 with added emphasis of lunar sample return missions, including Artemis and Chang'e. 11 pages, 4 figures, 2 tabl

    Transcriptome analysis reveals salt-stress-regulated biological processes and key pathways in roots of cotton (Gossypium hirsutum L.)

    Get PDF
    AbstractHigh salinity is one of the main factors limiting cotton growth and productivity. The genes that regulate salt stress in TM-1 upland cotton were monitored using microarray and real-time PCR (RT-PCR) with samples taken from roots. Microarray analysis showed that 1503 probe sets were up-regulated and 1490 probe sets were down-regulated in plants exposed for 3h to 100mM NaCl, and RT-PCR analysis validated 42 relevant/related genes. The distribution of enriched gene ontology terms showed such important processes as the response to water stress and pathways of hormone metabolism and signal transduction were induced by the NaCl treatment. Some key regulatory gene families involved in abiotic and biotic sources of stress such as WRKY, ERF, and JAZ were differentially expressed. Our transcriptome analysis might provide some useful insights into salt-mediated signal transduction pathways in cotton and offer a number of candidate genes as potential markers of tolerance to salt stress

    Gene Expression Profiles Deciphering Rice Phenotypic Variation between Nipponbare (Japonica) and 93-11 (Indica) during Oxidative Stress

    Get PDF
    Rice is a very important food staple that feeds more than half the world's population. Two major Asian cultivated rice (Oryza sativa L.) subspecies, japonica and indica, show significant phenotypic variation in their stress responses. However, the molecular mechanisms underlying this phenotypic variation are still largely unknown. A common link among different stresses is that they produce an oxidative burst and result in an increase of reactive oxygen species (ROS). In this study, methyl viologen (MV) as a ROS agent was applied to investigate the rice oxidative stress response. We observed that 93-11 (indica) seedlings exhibited leaf senescence with severe lesions under MV treatment compared to Nipponbare (japonica). Whole-genome microarray experiments were conducted, and 1,062 probe sets were identified with gene expression level polymorphisms between the two rice cultivars in addition to differential expression under MV treatment, which were assigned as Core Intersectional Probesets (CIPs). These CIPs were analyzed by gene ontology (GO) and highlighted with enrichment GO terms related to toxin and oxidative stress responses as well as other responses. These GO term-enriched genes of the CIPs include glutathine S-transferases (GSTs), P450, plant defense genes, and secondary metabolism related genes such as chalcone synthase (CHS). Further insertion/deletion (InDel) and regulatory element analyses for these identified CIPs suggested that there may be some eQTL hotspots related to oxidative stress in the rice genome, such as GST genes encoded on chromosome 10. In addition, we identified a group of marker genes individuating the japonica and indica subspecies. In summary, we developed a new strategy combining biological experiments and data mining to study the possible molecular mechanism of phenotypic variation during oxidative stress between Nipponbare and 93-11. This study will aid in the analysis of the molecular basis of quantitative traits
    • …
    corecore