Search CORE

11 research outputs found

Language Models (Mostly) Know What They Know

We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks. The predicted P(IK) probabilities also increase appropriately in the presence of relevant source materials in the context, and in the presence of hints towards the solution of mathematical word problems. We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing.Comment: 23+17 pages; refs added, typos fixe

arXiv.org e-Print Archive

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types. Second, we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs. Third, we exhaustively describe our instructions, processes, statistical methodologies, and uncertainty about red teaming. We hope that this transparency accelerates our ability to work together as a community in order to develop shared norms, practices, and technical standards for how to red team language models

arXiv.org e-Print Archive

Towards Measuring the Representation of Subjective Global Opinions in Language Models

Author: Askell Amanda
Bakhtin Anton
Chen Carol
Clark Jack
Durmus Esin
Ganguli Deep
Hatfield-Dodds Zac
Hernandez Danny
Joseph Nicholas
Kaplan Jared
Liao Thomas I.
Lovitt Liane
McCandlish Sam
Nyugen Karina
Schiefer Nicholas
Sikder Orowa
Tamkin Alex
Thamkul Janel
Publication venue
Publication date: 28/06/2023
Field of study

Large language models (LLMs) may not equitably represent diverse global perspectives on societal issues. In this paper, we develop a quantitative framework to evaluate whose opinions model-generated responses are more similar to. We first build a dataset, GlobalOpinionQA, comprised of questions and answers from cross-national surveys designed to capture diverse opinions on global issues across different countries. Next, we define a metric that quantifies the similarity between LLM-generated survey responses and human responses, conditioned on country. With our framework, we run three experiments on an LLM trained to be helpful, honest, and harmless with Constitutional AI. By default, LLM responses tend to be more similar to the opinions of certain populations, such as those from the USA, and some European and South American countries, highlighting the potential for biases. When we prompt the model to consider a particular country's perspective, responses shift to be more similar to the opinions of the prompted populations, but can reflect harmful cultural stereotypes. When we translate GlobalOpinionQA questions to a target language, the model's responses do not necessarily become the most similar to the opinions of speakers of those languages. We release our dataset for others to use and build on. Our data is at https://huggingface.co/datasets/Anthropic/llm_global_opinions. We also provide an interactive visualization at https://llmglobalvalues.anthropic.com

arXiv.org e-Print Archive

Outlier Blindness: A Neurobiological Foundation for Neglect of Financial Risk

Author: Aldo Rustichini
Alireza Soltani
Allen Parducci
Alok Kumar
Alok Kumar
Andrew Lo
B Benoit
B Simon
Botond Koszegi
Botond Koszegi
Bryan Kelly
C Nicholas
Camillo Padoa
Cary Frydman
Catherine Donnelly
Christoph Ungemach
Christopher Summerfield
Colin Bredenberg
Dan Ariely
Daniel Kahneman
Daniel Kahneman
David Mclean
Deep Ganguli
Elise Payzan-Lenestour
Elise Payzan-Lenestour
Eugene F Fama
Franz Faul
Guy Aridor
Hang Zhang
Ing-Haw
J Arthur
Jakub Steiner
K Avinash
Lars A Lochstoer
Leon L Thurstone
Leon Tremblay
Levon Barseghyan
Matteo Carandini
Matthew Rabin
Mel W Khaw
Mel W Khaw
Michael Woodford
Michael Woodford
Michael Woodford
N Nassim
Nicholas Barberis
Nicola Gennaioli
Nicola Gennaioli
Nicola Gennaioli
P Hyman
Paul Glimcher
Pedro Bordalo
Philippe N Tobler
R Ahna
Rafael Polania
Rahul Bhui
Ralph Hertwig
Roland Benabou
Stefano Giglio
Xavier Gabaix
Xavier Gabaix
Xavier Gabaix
Xue-Xin Wei
Xue-Xin Wei
Publication venue: 'Elsevier BV'
Publication date: 01/01/2020
Field of study

Crossref

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models

Author: Abid Abubakar
Agarwal Akshat
Agha Omar
Alabi Jesujoba
Ali Tariq
Alipoormolabashi Pegah
Aminnaseri Moin
Anand Sajant
Andreassen Anders
Arakawa Riku
Argueta Cedrick
Arnaud Melody
Asaadi Shima
Ashcraft Courtney
Askell Amanda
Bahri Yasaman
Bai Yuntao
Baitemirova Medina Orduna
Balis John U.
Banjade Rabin
Bansal Mohit
Baral Chitta
Barnes Elizabeth
Barnes Richard
Baturan Marco
Belinkov Yonatan
Berant Jonathan
Betz Gregor
Bevilacqua Michele
Biderman Stella
Bischoff Sebastian
Bogar Hayden
Bojanowski Bartłomiej
Bosma Maarten
Bosscher Jelle
Boudeman Joseph
Bowman Samuel R.
Brown Adam R.
Burden John
Buzan Dilyar
Cain Mike
Callison-Burch Chris
Cameron Nicholas
Casares Pablo Antonio Moreno
Casey Sean
Chang Ernie
Chang Peter
Chang Trenton
Chen Angelica
Chen Danqi
Chen Derek
Chen Qinlang
Chen Yifu
Chi Ethan A.
Chi Nathan
Chi Ryan
Chiafullo Kristen
Choi Yejin
Chollet Francois
Chu Eric
Chua Joyce
Cohen Michael
Colón Luis Oliveros
Constant Noah
Contreras-Ochando Lidia
Cubuk Ekin Dogus
Dai Andrew
Datta Debajyoti
Debnath
Deckers Niklas
Dehaene Stanislas
Delgado Ramón Risco
Demberg Vera
Desbordes Théo
Dhole Kaustubh D.
Diao Cameron
Dillavou Sam
Divic Stefan
Dohan David
Doiron Nick
Donoway Elizabeth
Doshi Parth
Dour Cameron
Drakard David
Dsouza Amanda
Dugan Liam
Dyer Ethan
Eckersley Peter
Efrat Avia
Ekmekci Berk
Elbaghdadi Omar
Emelin Denis
Engel Jesse
Erdem Aykut
Erdem Erkut
Ermon Stefano
Evans Owain
Farooqi Maheen
Faruqui Manaal
Fedus William
Fiedel Noah
Fisac Jaime Fernández
Fisch Adam
Frank Robert
Freeman Daniel
Frohberg Jörg
Fung Pascale
Gabriel Raefer
Galijasevic Hana
Ganguli Deep
Gao Leo
Garbacea Cristina
Garg Rhythm
Garrette Dan
Garriga-Alonso Adrià
Gehrmann Sebastian
Geissinger Jack
Gerstenberg Tobias
Geva Mor
Ghazarian Sarik
Gheini Mozhdeh
Gholamidavoodi Arash
Ghosh Sayan
Gilboa Dar
Gimpel Kevin
Giulianelli Mario
González Daniel Moseguí
Gopalakrishnan Karthik
Gottardi Anna
Gruetter Samuel
Gu Michael
Gu Shixiang Shane
Gupta Aditya
Gupta Animesh
Gur-Ari Guy
Habacker Rahel
Hagen Matthias
Hagerman Eleanor
Hajishirzi Hannaneh
Hamdan Shadi
Han Sanghyun
Hao Yiding
Happé Francesca
Hashimoto Tatsu
Hatwar Sriharsha
He Luheng
Hedayatnia Behnam
Hendrycks Dan
Hernandez Danny
Hernandez-Orallo Jose
Herrick Austin
Hilton Jacob
Hoeve Maartje ter
Hou Yu
Hou Yufang
Howald Blake
Htut Phu Mon
Hupkes Dieuwke
Hussain Aman
Hwang Pinyu
Ignatyeva Katerina
Inden Benjamin
Ippolito Daphne
Ivanitskiy Michael
Iyer Anantharaman S.
Iyer Niveditha S.
Jacobs Rowan
Jaimovitch-López Gonzalo
Jerzak Ethan
Jiang Angela
Jones Joseph
Jumelet Jaap
Jurgens David
Kale Mihir
Kanclerz Kamil
Kaplan Jared
Karakaş Ayla
Kernion Jackson
Keskar Nitish Shirish
Khashabi Daniel
Khot Tushar
Kilman Dan
Kim Ethan
Kim Hannah
Kim Jeremy
Kiritchenko Svetlana
Kirubarajan Arun
Kleyko Denis
Kluska Agnieszka
Kocoń Jan
Kocurek Alexander W.
Koppel James
Kornev Timofei
Krakover Neta Gur-Ari
Krauth Karl
Kruszewski Germán
Kwatra Sanjeev
La Andrew
Lakretz Yair
Lam Emma
Lam Lucas
Lampinen Andrew
Leavitt Matthew L.
LeBras Ronan
Lee Dong-Ho
Lee Jaehoon
Lee Nayeon
Lee Ryan
Lee Soo-Hwan
Levy Daniel
Levy Omer
Lewis Martha
Lewkowycz Aitor
Li Tao
Liang Paul Pu
Liang Percy
Liao Peiyuan
Lin Bill Yuchen
Lin Stephanie
Linzen Tal
Liu Rosanne
Livescu Karen
Loe Bao Sheng
Lyu Qing
Madotto Andrea
Makini Sneha Priscilla
Manning Christopher D.
Manyasi Eunice Engefu
Marelli Marco
Mariani Giorgio
Markert Katja
Marsh Jennifer
Martínez-Plumed Fernando
Maru Marco
Mathewson Kory
Mazeika Mantas
McDonell Kyle
McElrath Melvin
Mehta Harsh
Mei Qiaozhu
Melo Gerard de
Melzi Simone
Menezes Arul
Meng Chenlin
Metz Luke
Miller John
Millière Raphaël
Misherghi Summer
Mishra Gaurav
Mishra Swaroop
Misra Diganta
Misra Vedant
Miłkowski Piotr
Mohammad Saif M.
Mollo Dimitri Coelho
Morency Louis-Philippe
Moschella Luca
Muennighoff Niklas
Mukund Varma T
Mullokandov Asher
Nangia Nikita
Neeraj Trishala
Neyshabur Behnam
Ng Ian
Nie Allen
Nkinyili Tiberius
Noble Isaac
Noble Lucy
Norelli Antonio
Novak Roman
Novikova Jekaterina
Nyamai Victoria
Oli Priti
Omondi Kevin
Pachchigar Shubh
Padmakumar Vishakh
Parascandolo Giambattista
Parrish Alicia
Patil Piyush
Pavlick Ellie
Peng Nanyun
Perszyk Danielle
Pezeshkpour Pouya
Phan Thomas
Phang Jason
Piantadosi Steven T.
Potthast Martin
Potts Christopher
Power Alethea
Prabhu Vinay Uday
Prasad Stephen
Qin Lianhui
Quintana Maria Jose Ramírez
Radom Jarema
Raffel Colin
Rahane Ameet
Ramasesh Vinay
Ramirez Cindy
Ramírez César Ferri
Rao Abhishek
Rashkin Hannah
Rastogi Abhinav
Rathkopf Charles
Raunak Vikas
Ray Alex
Raymaekers Robbe
Reddy Siva
Ren Xiang
Reynolds Laria
Richardson Kyle
Rivera Clara E.
Roberts B. Ryan
Roberts Nicholas
Rodola Emanuele
Rong Frieda
Roth Dan
Rothschild Theodore
Rous Sarah A.
Rozen Jos
Rudolph Rachel Etta
Rule Joshua S.
Sabharwal Ashish
Sadeghi Sepideh
Safaya Ali
Salakhutdinov Ruslan
Santilli Andrea
Santoro Adam
Sap Maarten
Saunders William
Saurous Rif A.
Schick Timo
Schmidt Ludwig
Schoenholz Samuel S.
Schubert Mátyás
Schuster Sebastian
Schuster Tal
Schütze Hinrich
Segal Elad
Seid Zachary
Shaham Uri
Shakeri Siamak
Shen Xudong
Shevlin Henry
Shi Sherry
Shieber Stuart M.
Shkaruta Ksenia
Shleifer Sam
Shoeb Abu Awal Md
Shridhar Kumar
Shultz Tyler
Shutova Ekaterina
Shyamolima
Siar Fatemeh
Sikand Rohan
Sileo Damien
Simon James B.
Singh Chandan
Singh Shikhar
Siro Clemencia
Sitelew Roman
Slone Ambrose
Sohl-Dickstein Jascha
Song Jiaming
Song Yangqiu
Srikumar Vivek
Srivastava Aarohi
Srivastava Shashank
Starritt Michael
Stein Benno
Stinson Catherine
Stovall Ryan
Strube Michael
Stuhlmüller Andreas
Suzgun Mirac
Swędrowski Michał
Taal Jeroen
Tabassum Arfa
Tam Derek
Tang Eric
Tang Jillian
Tazarv Ali
Teehan Ryan
Telleen-Lawton Timothy
Tenenbaum Joshua B.
Thompson Jana
Thormeyer Simon
Tiwari Mo
Tolkiehn Marie
Tong Xiaoyu
Torene Spencer
Toshniwal Shubham
Tunduny Titus
Upadhyay Shyam
Venkatesh Anu
Vicol Paul
Voigt Christian
Vossen Wout
Vuong Anh
Waites Chris
Wang Gloria
Wang Tianle
Wang Zijian
Wang Zijie J.
Wang Zirui
Warstadt Alex
Waweru Joan
Wei Jason
Wen Nuan
Winata Genta Indra
Wiseman Sam
Wong Hugh Mee
Wu Chiyu
Wu Te-Lin
Wu Xinyi
Wu Ziyi
Xia Fanyue
Xiang Alice
Xu Jiacheng
Xu Mimee
Yaghoobzadeh Yadollah
Yakura Hiromu
Yang Diyi
Yang Rylan
Yang Yichi
Yasunaga Michihiro
Yee Michael A.
Yosinski Jason
Yu Tao
Yuret Deniz
Zhang Hongming
Zhang Li
Zhang Oliver
Zhang Rui
Zhang William
Zhao Xinran
Zhao Zhuoye
Zheltonozhskii Evgenii
Zheng James
Zhou Sharon
Zoph Barret
Zou Andy
Zou James
Özyurt Batuhan
Şenel Lütfi Kerem
Publication venue
Publication date: 09/06/2022
Field of study

Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 442 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting