Search CORE

21 research outputs found

Human Feedback is not Gold Standard

Author: Bartolo Max
Blunsom Phil
Hosking Tom
Publication venue
Publication date: 28/09/2023
Field of study

Human feedback has become the de facto standard for evaluating the performance of Large Language Models, and is increasingly being used as a training objective. However, it is not clear which properties of a generated output this single `preference' score captures. We hypothesise that preference scores are subjective and open to undesirable biases. We critically analyse the use of human feedback for both training and evaluation, to verify whether it fully captures a range of crucial error criteria. We find that while preference scores have fairly good coverage, they under-represent important aspects like factuality. We further hypothesise that both preference scores and error annotation may be affected by confounders, and leverage instruction-tuned models to generate outputs that vary along two possible confounding dimensions: assertiveness and complexity. We find that the assertiveness of an output skews the perceived rate of factuality errors, indicating that human annotations are not a fully reliable evaluation metric or training objective. Finally, we offer preliminary evidence that using human feedback as a training objective disproportionately increases the assertiveness of model outputs. We encourage future work to carefully consider whether preference scores are well aligned with the desired objective

arXiv.org e-Print Archive

Interpretation of Natural Language Rules in Conversational Machine Reading

Author: Bartolo Max
Bouchard Guillaume
Lewis Patrick
Riedel Sebastian
Rocktäschel Tim
Saeidi Marzieh
Sheldon Mike
Singh Sameer
Publication venue
Publication date: 01/01/2018
Field of study

Most work in machine reading focuses on question answering problems where the answer is directly expressed in the text to read. However, many real-world question answering problems require the reading of text not because it contains the literal answer, but because it contains a recipe to derive an answer together with the reader's background knowledge. One example is the task of interpreting regulations to answer "Can I...?" or "Do I have to...?" questions such as "I am working in Canada. Do I have to carry on paying UK National Insurance?" after reading a UK government website about this topic. This task requires both the interpretation of rules and the application of background knowledge. It is further complicated due to the fact that, in practice, most questions are underspecified, and a human assistant will regularly have to ask clarification questions such as "How long have you been working abroad?" when the answer cannot be directly derived from the question and text. In this paper, we formalise this task and develop a crowd-sourcing strategy to collect 32k task instances based on real-world rules and crowd-generated questions and scenarios. We analyse the challenges of this task and assess its difficulty by evaluating the performance of rule-based and machine-learning baselines. We observe promising results when no background knowledge is necessary, and substantial room for improvement whenever background knowledge is needed.Comment: EMNLP 201

arXiv.org e-Print Archive

Crossref

UCL Discovery

Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity

Author: Bartolo Max
Lu Yao
Moore Alastair
Riedel Sebastian
Stenetorp Pontus
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 03/03/2022
Field of study

When primed with only a handful of training samples, very large, pretrained language models such as GPT-3 have shown competitive results when compared to fully-supervised, fine-tuned, large, pretrained language models. We demonstrate that the order in which the samples are provided can make the difference between near state-of-the-art and random guess performance: essentially some permutations are “fantastic” and some not. We analyse this phenomenon in detail, establishing that: it is present across model sizes (even for the largest current models), it is not related to a specific subset of samples, and that a given good permutation for one model is not transferable to another. While one could use a development set to determine which permutations are performant, this would deviate from the true few-shot setting as it requires additional annotated data. Instead, we use the generative nature of language models to construct an artificial development set and based on entropy statistics of the candidate permutations on this set, we identify performant prompts. Our method yields a 13% relative improvement for GPT-family models across eleven different established text classification tasks

arXiv.org e-Print Archive

UCL Discovery

Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation

Author: Bartolo Max
Jia Robin
Kiela Douwe
Riedel Sebastian
Stenetorp Pontus
Thrush Tristan
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2021
Field of study

Despite recent progress, state-of-the-art question answering models remain vulnerable to a variety of adversarial attacks. While dynamic adversarial data collection, in which a human annotator tries to write examples that fool a model-in-the-loop, can improve model robustness, this process is expensive which limits the scale of the collected data. In this work, we are the first to use synthetic adversarial data generation to make question answering models more robust to human adversaries. We develop a data generation pipeline that selects source passages, identifies candidate answers, generates questions, then finally filters or re-labels them to improve quality. Using this approach, we amplify a smaller human-written adversarial dataset to a much larger set of synthetic question-answer pairs. By incorporating our synthetic data, we improve the state-of-the-art on the AdversarialQA dataset by 3.7F1 and improve model generalisation on nine of the twelve MRQA datasets. We further conduct a novel human-in-the-loop evaluation to show that our models are considerably more robust to new human-written adversarial examples: crowdworkers can fool our model only 8.8% of the time on average, compared to 17.6% for a model trained without synthetic data

arXiv.org e-Print Archive

UCL Discovery

Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension

Author: Bartolo Max
Riedel Sebastian
Roberts Alastair
Stenetorp Pontus
Welbl Johannes
Publication venue: 'MIT Press - Journals'
Publication date: 01/01/2020
Field of study

Innovations in annotation methodology have been a catalyst for Reading Comprehension (RC) datasets and models. One recent trend to challenge current RC models is to involve a model in the annotation process: humans create questions adversarially, such that the model fails to answer them correctly. In this work we investigate this annotation methodology and apply it in three different settings, collecting a total of 36,000 samples with progressively stronger models in the annotation loop. This allows us to explore questions such as the reproducibility of the adversarial effect, transfer from data collected with varying model-in-the-loop strengths, and generalisation to data collected without a model. We find that training on adversarially collected samples leads to strong generalisation to non-adversarially collected datasets, yet with progressive performance deterioration with increasingly stronger models-in-the-loop. Furthermore, we find that stronger models can still learn from datasets collected with substantially weaker models-in-the-loop. When trained on data collected with a BiDAF model in the loop, RoBERTa achieves 39.9F1 on questions that it cannot answer when trained on SQuAD - only marginally lower than when trained on data collected using RoBERTa itself (41.0F1)

arXiv.org e-Print Archive

UCL Discovery

Dynabench: Rethinking Benchmarking in NLP

Author: Bansal Mohit
Bartolo Max
Geiger Atticus
Jia Robin
Kaushik Divyansh
Kiela Douwe
Ma Zhiyi
Nie Yixin
Potts Christopher
Prasad Grusha
Riedel Sebastian
Ringshia Pratik
Singh Amanpreet
Stenetorp Pontus
Thrush Tristan
Vidgen Bertie
Waseem Zeerak
Williams Adina
Wu Zhengxuan
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 11/06/2021
Field of study

We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation: annotators seek to create examples that a target model will misclassify, but that another person will not. In this paper, we argue that Dynabench addresses a critical need in our community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios. With Dynabench, dataset creation, model development, and model assessment can directly inform each other, leading to more robust and informative benchmarks. We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform, and address potential objections to dynamic benchmarking as a new standard for the field

UCL Discovery

Observing the Evolution of the Universe

Author: Aguirre James
Amblard Alexander
Ashoorioon Amjad
Baccigalupi Carlo
Balbi Amedeo
Bartlett James
Bartolo Nicola
Benford Dominic
Birkinshaw Mark
Bock Jamie
Bond Dick
Borrill Julian
Bouchet Francois
Bridges Michael
Bunn Emory
Calabrese Erminia
Cantalupo Christopher
Caramete Ana
Carbone Carmelita
Chatterjee Suchetana
Church Sarah
Chuss David
Contaldi Carlo
Cooray Asantha
Das Sudeep
De Bernardis Francesco
De Bernardis Paolo
De Zotti Gianfranco
Delabrouille Jacques
Devlin Mark
Dicker Simon
Dickinson Clive
DiPirro Michael
Dobbs Matt
Dodelson Scott
Dore Olivier
Dotson Jessie
Dsert F. Xavier
Dunkley Joanna
Falvella Maria Christina
Fixen Dale
Fosalba Pablo
Fowler Joseph
Gates Evalyn
Gear Walter Kieran
Golwala Sunil
Gorski Krzysztof
Gruppuso Alessandro
Gundersen Josh
Halpern Mark
Hanany Shaul
Hazumi Masashi
Hernandez-Monteagudo Carlos
Hertzberg Mark
Hinshaw Gary
Hirata Christopher
Hivon Eric
Holman Richard
Holmes Warren
Holzapfel William
Hu Wayne
Hubmayr Johannes
Huffenberger Kevin
Irwin Kent
Jackson Mark
Jaffe Andrew
Johnson Bradley
Jones William
Kaplinghat Manoj
Keating Brian
Keskitalo Reijo
Khoury Justin
Kinney Will
Kisner Theodore
Knox Lloyd
Kodama Hideo
Kogut Alan
Komatsu Eiichiro
Kosowsky Arthur
Kovac John
Krauss Lawrence
Kurki-Suonio Hannu
Landau Susana
Lawrence Charles
Leach Samuel
Lee Adrian
Leitch Erik
Leonardi Rodrigo
Lesgourgues Julien
Liddle Andrew
Lim Eugene
Limon Michele
Loverde Marilena
Lubin Philip
Magalhaes Antonio
Maino Davide
Marriage Tobias
Martin Victoria
Matarrese Sabino
Mather John
Mathur Harsh
Matsumura Tomotake
Meerburg Pieter
Melchiorri Alessandro
Meyer Stephan
Miller Amber
Milligan Michael
Moodley Kavilan
Neimack Michael
Nguyen Hogan
O'Dwyer Ian
Orlando Angiola
Pagano Luca
Page Lyman
Partridge Bruce
Pearson Timothy
Peiris Hiranya
Piacentini Francesco
Piccirillo Lucio
Pierpaoli Elena
Pietrobon Davide
Pisano Giampaolo
Pogosian Levon
Pogosyan Dmitri
Ponthieu Nicolas
Popa Lucia
Pryke Clement
Raeth Christoph
Ray Subharthi
Reichardt Christian
Ricciardi Sara
Richards Paul
Rocha Graca
Rudnick Lawrence
Ruhl John
Rusholme Benjamin
Scoccola Claudia
Scott Douglas
Sealfon Carolyn
Sehgal Neelima
Seiffert Michael
Senatore Leonardo
Serra Paolo
Shandera Sarah
Shimon Meir
Shirron Peter
Sievers Jonathan
Sigurdson Kris
Silk Joe
Silverberg Robert
Silverstein Eva
Staggs Suzanne
Stebbins Albert
Stivoli Frederico
Stompor Radek
Sugiyama Naoshi
Swetz Daniel
Tartari Andrea
Tegmark Max
Timbie Peter
Tristram Matthieu
Tucker Gregory
Urrestilla Jon
Vaillancourt John
Veneziani Marcella
Verde Licia
Vieira Joaquin
Wandelt Benjamin
Watson Scott
Wilson Grant
Wollack Edward
Wyman Mark
Yadav Amit
Yannick Giraud-Heraud
Zahn Olivier
Zaldarriage Mattias
Zemcov Michael
Zwart Jonathan
Publication venue
Publication date: 01/01/2009
Field of study

How did the universe evolve? The fine angular scale (l>1000) temperature and polarization anisotropies in the CMB are a Rosetta stone for understanding the evolution of the universe. Through detailed measurements one may address everything from the physics of the birth of the universe to the history of star formation and the process by which galaxies formed. One may in addition track the evolution of the dark energy and discover the net neutrino mass. We are at the dawn of a new era in which hundreds of square degrees of sky can be mapped with arcminute resolution and sensitivities measured in microKelvin. Acquiring these data requires the use of special purpose telescopes such as the Atacama Cosmology Telescope (ACT), located in Chile, and the South Pole Telescope (SPT). These new telescopes are outfitted with a new generation of custom mm-wave kilo-pixel arrays. Additional instruments are in the planning stages.Comment: Science White Paper submitted to the US Astro2010 Decadal Survey. Full list of 177 author available at http://cmbpol.uchicago.ed

arXiv.org e-Print Archive

HAL-IN2P3

Online Research @ Cardiff

HAL-OBSPM

HAL-CEA

Hal-Diderot

DMLR: Data-centric Machine Learning Research -- Past, Present and Future

Drawing from discussions at the inaugural DMLR workshop at ICML 2023 and meetings prior, in this report we outline the relevance of community engagement and infrastructure development for the creation of next-generation public datasets that will advance machine learning science. We chart a path forward as a collective effort to sustain the creation and maintenance of these datasets and methods towards positive scientific, societal and business impact.Comment: This editorial report accompanies the inaugural Data-centric Machine Learning Research (DMLR) Workshop that took place at ICML 2023 https://dmlr.ai

arXiv.org e-Print Archive