Search CORE

44 research outputs found

Comment on Multigraviton Scattering in the Matrix Model

Author: Banks
Becker
Dine
Dine
Dine
Fabbrichesi
Joshua P. Gray
Okawa
Okawa
Paban
Robert Echols
Seiberg
Sen
Susskind
Taylor
Publication venue: 'Elsevier BV'
Publication date: 01/01/1998
Field of study

We show by explicit calculation that the matrix model effective action does not contain the term

v_{12}^2 v_{23}^2 v_{13}^2/{R^7 r^7}

, in the limit

R \gg r

, contradicting a result reported recently.Comment: LaTex, 10 pages. Note added and minor correction

arXiv.org e-Print Archive

CiteSeerX

Crossref

CERN Document Server

Perceiving individuals and groups: Expectancies, dispositional inferences, and causal attributions.

Author: David L. Hamilton
Jeffrey W. Sherman
Joshua Susskind
Kristin Maurer
Vinita Thakkar
Publication venue: 'American Psychological Association (APA)'
Publication date: 01/01/2002
Field of study

Crossref

Generating Facial Expressions with Deep Belief Nets

Author: Adam K. Anderson
Geoffrey E. Hinton
Javier R. Movellan
Joshua M. Susskind
Publication venue: 'IntechOpen'
Publication date: 01/05/2008
Field of study

IntechOpen

Crossref

Vanishing Gradients in Reinforcement Finetuning of Language Models

Author: Bradley Arwen
Littwin Etai
Nakkiran Preetum
Razin Noam
Saremi Omid
Susskind Joshua
Thilak Vimal
Zhou Hattie
Publication venue
Publication date: 31/10/2023
Field of study

Pretrained language models are commonly aligned with human preferences and downstream tasks via reinforcement finetuning (RFT), which entails maximizing a (possibly learned) reward function using policy gradient algorithms. This work highlights a fundamental optimization obstacle in RFT: we prove that the expected gradient for an input vanishes when its reward standard deviation under the model is small, even if the expected reward is far from optimal. Through experiments on an RFT benchmark and controlled environments, as well as a theoretical analysis, we then demonstrate that vanishing gradients due to small reward standard deviation are prevalent and detrimental, leading to extremely slow reward maximization. Lastly, we explore ways to overcome vanishing gradients in RFT. We find the common practice of an initial supervised finetuning (SFT) phase to be the most promising candidate, which sheds light on its importance in an RFT pipeline. Moreover, we show that a relatively small number of SFT optimization steps on as few as 1% of the input samples can suffice, indicating that the initial SFT phase need not be expensive in terms of compute and data labeling efforts. Overall, our results emphasize that being mindful for inputs whose expected gradient vanishes, as measured by the reward standard deviation, is crucial for successful execution of RFT

arXiv.org e-Print Archive