8 research outputs found
LMentry: A Language Model Benchmark of Elementary Language Tasks
As the performance of large language models rapidly improves, benchmarks are
getting larger and more complex as well. We present LMentry, a benchmark that
avoids this "arms race" by focusing on a compact set of tasks that are trivial
to humans, e.g. writing a sentence containing a specific word, identifying
which words in a list belong to a specific category, or choosing which of two
words is longer. LMentry is specifically designed to provide quick and
interpretable insights into the capabilities and robustness of large language
models. Our experiments reveal a wide variety of failure cases that, while
immediately obvious to humans, pose a considerable challenge for large language
models, including OpenAI's latest 175B-parameter instruction-tuned model,
TextDavinci002. LMentry complements contemporary evaluation approaches of large
language models, providing a quick, automatic, and easy-to-run "unit test",
without resorting to large benchmark suites of complex tasks.Comment: 24 pages, 2 figure
How Optimal is Greedy Decoding for Extractive Question Answering?
Fine-tuned language models use greedy decoding to answer reading
comprehension questions with relative success. However, this approach does not
ensure that the answer is a span in the given passage, nor does it guarantee
that it is the most probable one. Does greedy decoding actually perform worse
than an algorithm that does adhere to these properties? To study the
performance and optimality of greedy decoding, we present exact-extract, a
decoding algorithm that efficiently finds the most probable answer span in the
context. We compare the performance of T5 with both decoding algorithms on
zero-shot and few-shot extractive question answering. When no training examples
are available, exact-extract significantly outperforms greedy decoding.
However, greedy decoding quickly converges towards the performance of
exact-extract with the introduction of a few training examples, becoming more
extractive and increasingly likelier to generate the most probable span as the
training set grows. We also show that self-supervised training can bias the
model towards extractive behavior, increasing performance in the zero-shot
setting without resorting to annotated examples. Overall, our results suggest
that pretrained language models are so good at adapting to extractive question
answering, that it is often enough to fine-tune on a small training set for the
greedy algorithm to emulate the optimal decoding strategy.Comment: AKBC 2022 12 pages, 3 figure
ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding
We introduce ZeroSCROLLS, a zero-shot benchmark for natural language
understanding over long texts, which contains only test and small validation
sets, without training data. We adapt six tasks from the SCROLLS benchmark, and
add four new datasets, including two novel information fusing tasks, such as
aggregating the percentage of positive reviews. Using ZeroSCROLLS, we conduct a
comprehensive evaluation of both open-source and closed large language models,
finding that Claude outperforms ChatGPT, and that GPT-4 achieves the highest
average score. However, there is still room for improvement on multiple open
challenges in ZeroSCROLLS, such as aggregation tasks, where models struggle to
pass the naive baseline. As the state of the art is a moving target, we invite
researchers to evaluate their ideas on the live ZeroSCROLLS leaderboard.Comment: Findings of EMNLP 202
Beyond the imitation game: Quantifying and extrapolating the capabilities of language models
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 442 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Language models demonstrate both quantitative improvement and new qualitative
capabilities with increasing scale. Despite their potentially transformative
impact, these new capabilities are as yet poorly characterized. In order to
inform future research, prepare for disruptive new model capabilities, and
ameliorate socially harmful effects, it is vital that we understand the present
and near-future capabilities and limitations of language models. To address
this challenge, we introduce the Beyond the Imitation Game benchmark
(BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 442
authors across 132 institutions. Task topics are diverse, drawing problems from
linguistics, childhood development, math, common-sense reasoning, biology,
physics, social bias, software development, and beyond. BIG-bench focuses on
tasks that are believed to be beyond the capabilities of current language
models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense
transformer architectures, and Switch-style sparse transformers on BIG-bench,
across model sizes spanning millions to hundreds of billions of parameters. In
addition, a team of human expert raters performed all tasks in order to provide
a strong baseline. Findings include: model performance and calibration both
improve with scale, but are poor in absolute terms (and when compared with
rater performance); performance is remarkably similar across model classes,
though with benefits from sparsity; tasks that improve gradually and
predictably commonly involve a large knowledge or memorization component,
whereas tasks that exhibit "breakthrough" behavior at a critical scale often
involve multiple steps or components, or brittle metrics; social bias typically
increases with scale in settings with ambiguous context, but this can be
improved with prompting.Comment: 27 pages, 17 figures + references and appendices, repo:
https://github.com/google/BIG-benc