281 research outputs found
Language Model Alignment with Elastic Reset
Finetuning language models with reinforcement learning (RL), e.g. from human
feedback (HF), is a prominent method for alignment. But optimizing against a
reward model can improve on reward while degrading performance in other areas,
a phenomenon known as reward hacking, alignment tax, or language drift. First,
we argue that commonly-used test metrics are insufficient and instead measure
how different algorithms tradeoff between reward and drift. The standard method
modified the reward with a Kullback-Lieber (KL) penalty between the online and
initial model. We propose Elastic Reset, a new algorithm that achieves higher
reward with less drift without explicitly modifying the training objective. We
periodically reset the online model to an exponentially moving average (EMA) of
itself, then reset the EMA model to the initial model. Through the use of an
EMA, our model recovers quickly after resets and achieves higher reward with
less drift in the same number of steps. We demonstrate that fine-tuning
language models with Elastic Reset leads to state-of-the-art performance on a
small scale pivot-translation benchmark, outperforms all baselines in a
medium-scale RLHF-like IMDB mock sentiment task and leads to a more performant
and more aligned technical QA chatbot with LLaMA-7B. Code available at
github.com/mnoukhov/elastic-reset.Comment: Published at NeurIPS 202
INTRACELLULAR TARGETS OF SPHINGOSINE-1-PHOSPHATE
The bioactive lipid mediator sphingosine-1-phosphate (S1P) has emerged as a key regulator of a variety of important physiological functions, including cell growth, cell survival, cell motility, angiogenesis, lymphocyte trafficking, and mast cell function. S1P is formed by two different sphingosine kinases (SphKs) and binds to a family of 5 differentially expressed G-protein coupled receptors (S1PRs). The majority of research to date has focused on the activation of these receptors, but there is compelling evidence to suggest that S1P exerts intracellular functions independent of S1PRs. However no bona fide intracellular targets of S1P have been identified. In my dissertation, I have identified a novel intracellular binding protein for S1P. This finding has important implications for the pleiotropic actions of S1P
Integration of the End Cap TEC+ of the CMS Silicon Strip Tracker
The silicon strip tracker of the CMS experiment has been completed and inserted into the CMS detector in late 2007. The largest sub-system of the tracker is its end cap system, comprising two large end caps (TEC) each containing 3200 silicon strip modules. To ease construction, the end caps feature a modular design: groups of about 20 silicon modules are placed on sub-assemblies called petals and these self-contained elements are then mounted into the TEC support structures. Each end cap consists of 144 petals, and the insertion of these petals into the end cap structure is referred to as TEC integration. The two end caps were integrated independently in Aachen (TEC+) and at CERN (TEC--). This note deals with the integration of TEC+, describing procedures for end cap integration and for quality control during testing of integrated sections of the end cap and presenting results from the testing
Over-communicate no more: Situated RL agents learn concise communication protocols
While it is known that communication facilitates cooperation in multi-agent
settings, it is unclear how to design artificial agents that can learn to
effectively and efficiently communicate with each other. Much research on
communication emergence uses reinforcement learning (RL) and explores
unsituated communication in one-step referential tasks -- the tasks are not
temporally interactive and lack time pressures typically present in natural
communication. In these settings, agents may successfully learn to communicate,
but they do not learn to exchange information concisely -- they tend towards
over-communication and an inefficient encoding. Here, we explore situated
communication in a multi-step task, where the acting agent has to forgo an
environmental action to communicate. Thus, we impose an opportunity cost on
communication and mimic the real-world pressure of passing time. We compare
communication emergence under this pressure against learning to communicate
with a cost on articulation effort, implemented as a per-message penalty (fixed
and progressively increasing). We find that while all tested pressures can
disincentivise over-communication, situated communication does it most
effectively and, unlike the cost on effort, does not negatively impact
emergence. Implementing an opportunity cost on communication in a temporally
extended environment is a step towards embodiment, and might be a pre-condition
for incentivising efficient, human-like communication
Reaction of Pertechnetate in Highly Alkaline Solution: Synthesis and Characterization of the Nitridotrioxotechnetate Ba[TcO3N]
Recommended from our members
Estimates of sea surface height and near-surface alongshore coastal currents from combinations of altimeters and tide gauges
Present methods used to retrieve altimeter data do not provide reliable estimates of
sea surface height (SSH) in the nearshore region, resulting in a measurement gap of
25–50 km next to the coast. In the present work, gridded SSH fields produced by
Archiving, Validation, and Interpretation of Satellite Oceanographic data (AVISO) in the
offshore region are combined with coastal tide gauge time series of SSH to improve
estimation in that gap along the west coast of the United States in the northern California
Current System between 40° and 45°N and 123.8° and 126°W. To assess the increase
in skill provided by this procedure, the geostrophic alongshore currents, calculated from
the new SSH fields in the gap region, are compared to three in situ, nearshore current
measurements, resulting in correlation coefficients of 0.73–0.83 and standard deviations
of the differences of 11.6–12.6 cm/s, substantially improved from the AVISO-only results.
When the Ekman current components are estimated and added to the geostrophic
currents, comparisons to the 10 m deep acoustic Doppler current profiler velocities are
only slightly improved. The Ekman components make a more significant contribution
when compared to HF radar surface current measurements, providing correlations of
0.94 and standard deviations of the differences of 6.4–9.5 cm/s. These results represent a
dramatic improvement in the quality of the SSH fields and estimated alongshore
currents when additional, realistic SSH data from the coastal region are added.
Here we use coastal tide gauges to provide the additional SSH data but also discuss more
general approaches for altimeter SSH retrievals in coastal regions where tide
gauge data are not available
Search for Neutral Heavy Leptons Produced in Z Decays
Weak isosinglet Neutral Heavy Leptons () have been searched for using data collected by the DELPHI detector corresponding to hadronic~Z decays at LEP1. Four separate searches have been performed, for short-lived production giving monojet or acollinear jet topologies, and for long-lived giving detectable secondary vertices or calorimeter clusters. No indication of the existence of these particles has been found, leading to an upper limit for the branching ratio Z of about at 95\% confidence level for masses between 3.5 and 50 GeV/. Outside this range the limit weakens rapidly with the mass. %Special emphasis has been given to the search for monojet--like topologies. One event %has passed the selection, in agreement with the expectation from the reaction: %. The results are also interpreted in terms of limits for the single production of excited neutrinos
- …
