20 research outputs found
Human Feedback is not Gold Standard
Human feedback has become the de facto standard for evaluating the
performance of Large Language Models, and is increasingly being used as a
training objective. However, it is not clear which properties of a generated
output this single `preference' score captures. We hypothesise that preference
scores are subjective and open to undesirable biases. We critically analyse the
use of human feedback for both training and evaluation, to verify whether it
fully captures a range of crucial error criteria. We find that while preference
scores have fairly good coverage, they under-represent important aspects like
factuality. We further hypothesise that both preference scores and error
annotation may be affected by confounders, and leverage instruction-tuned
models to generate outputs that vary along two possible confounding dimensions:
assertiveness and complexity. We find that the assertiveness of an output skews
the perceived rate of factuality errors, indicating that human annotations are
not a fully reliable evaluation metric or training objective. Finally, we offer
preliminary evidence that using human feedback as a training objective
disproportionately increases the assertiveness of model outputs. We encourage
future work to carefully consider whether preference scores are well aligned
with the desired objective
Interpretation of Natural Language Rules in Conversational Machine Reading
Most work in machine reading focuses on question answering problems where the
answer is directly expressed in the text to read. However, many real-world
question answering problems require the reading of text not because it contains
the literal answer, but because it contains a recipe to derive an answer
together with the reader's background knowledge. One example is the task of
interpreting regulations to answer "Can I...?" or "Do I have to...?" questions
such as "I am working in Canada. Do I have to carry on paying UK National
Insurance?" after reading a UK government website about this topic. This task
requires both the interpretation of rules and the application of background
knowledge. It is further complicated due to the fact that, in practice, most
questions are underspecified, and a human assistant will regularly have to ask
clarification questions such as "How long have you been working abroad?" when
the answer cannot be directly derived from the question and text. In this
paper, we formalise this task and develop a crowd-sourcing strategy to collect
32k task instances based on real-world rules and crowd-generated questions and
scenarios. We analyse the challenges of this task and assess its difficulty by
evaluating the performance of rule-based and machine-learning baselines. We
observe promising results when no background knowledge is necessary, and
substantial room for improvement whenever background knowledge is needed.Comment: EMNLP 201
Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity
When primed with only a handful of training samples, very large, pretrained language models such as GPT-3 have shown competitive results when compared to fully-supervised, fine-tuned, large, pretrained language models. We demonstrate that the order in which the samples are provided can make the difference between near state-of-the-art and random guess performance: essentially some permutations are “fantastic” and some not. We analyse this phenomenon in detail, establishing that: it is present across model sizes (even for the largest current models), it is not related to a specific subset of samples, and that a given good permutation for one model is not transferable to another. While one could use a development set to determine which permutations are performant, this would deviate from the true few-shot setting as it requires additional annotated data. Instead, we use the generative nature of language models to construct an artificial development set and based on entropy statistics of the candidate permutations on this set, we identify performant prompts. Our method yields a 13% relative improvement for GPT-family models across eleven different established text classification tasks
Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation
Despite recent progress, state-of-the-art question answering models remain
vulnerable to a variety of adversarial attacks. While dynamic adversarial data
collection, in which a human annotator tries to write examples that fool a
model-in-the-loop, can improve model robustness, this process is expensive
which limits the scale of the collected data. In this work, we are the first to
use synthetic adversarial data generation to make question answering models
more robust to human adversaries. We develop a data generation pipeline that
selects source passages, identifies candidate answers, generates questions,
then finally filters or re-labels them to improve quality. Using this approach,
we amplify a smaller human-written adversarial dataset to a much larger set of
synthetic question-answer pairs. By incorporating our synthetic data, we
improve the state-of-the-art on the AdversarialQA dataset by 3.7F1 and improve
model generalisation on nine of the twelve MRQA datasets. We further conduct a
novel human-in-the-loop evaluation to show that our models are considerably
more robust to new human-written adversarial examples: crowdworkers can fool
our model only 8.8% of the time on average, compared to 17.6% for a model
trained without synthetic data
Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension
Innovations in annotation methodology have been a catalyst for Reading
Comprehension (RC) datasets and models. One recent trend to challenge current
RC models is to involve a model in the annotation process: humans create
questions adversarially, such that the model fails to answer them correctly. In
this work we investigate this annotation methodology and apply it in three
different settings, collecting a total of 36,000 samples with progressively
stronger models in the annotation loop. This allows us to explore questions
such as the reproducibility of the adversarial effect, transfer from data
collected with varying model-in-the-loop strengths, and generalisation to data
collected without a model. We find that training on adversarially collected
samples leads to strong generalisation to non-adversarially collected datasets,
yet with progressive performance deterioration with increasingly stronger
models-in-the-loop. Furthermore, we find that stronger models can still learn
from datasets collected with substantially weaker models-in-the-loop. When
trained on data collected with a BiDAF model in the loop, RoBERTa achieves
39.9F1 on questions that it cannot answer when trained on SQuAD - only
marginally lower than when trained on data collected using RoBERTa itself
(41.0F1)
Dynabench: Rethinking Benchmarking in NLP
We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation: annotators seek to create examples that a target model will misclassify, but that another person will not. In this paper, we argue that Dynabench addresses a critical need in our community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios. With Dynabench, dataset creation, model development, and model assessment can directly inform each other, leading to more robust and informative benchmarks. We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform, and address potential objections to dynamic benchmarking as a new standard for the field
Observing the Evolution of the Universe
How did the universe evolve? The fine angular scale (l>1000) temperature and
polarization anisotropies in the CMB are a Rosetta stone for understanding the
evolution of the universe. Through detailed measurements one may address
everything from the physics of the birth of the universe to the history of star
formation and the process by which galaxies formed. One may in addition track
the evolution of the dark energy and discover the net neutrino mass.
We are at the dawn of a new era in which hundreds of square degrees of sky
can be mapped with arcminute resolution and sensitivities measured in
microKelvin. Acquiring these data requires the use of special purpose
telescopes such as the Atacama Cosmology Telescope (ACT), located in Chile, and
the South Pole Telescope (SPT). These new telescopes are outfitted with a new
generation of custom mm-wave kilo-pixel arrays. Additional instruments are in
the planning stages.Comment: Science White Paper submitted to the US Astro2010 Decadal Survey.
Full list of 177 author available at http://cmbpol.uchicago.ed
DMLR: Data-centric Machine Learning Research -- Past, Present and Future
Drawing from discussions at the inaugural DMLR workshop at ICML 2023 and
meetings prior, in this report we outline the relevance of community engagement
and infrastructure development for the creation of next-generation public
datasets that will advance machine learning science. We chart a path forward as
a collective effort to sustain the creation and maintenance of these datasets
and methods towards positive scientific, societal and business impact.Comment: This editorial report accompanies the inaugural Data-centric Machine
Learning Research (DMLR) Workshop that took place at ICML 2023
https://dmlr.ai