20 research outputs found
An Exploration of Data Augmentation and Sampling Techniques for Domain-Agnostic Question Answering
To produce a domain-agnostic question answering model for the Machine Reading
Question Answering (MRQA) 2019 Shared Task, we investigate the relative
benefits of large pre-trained language models, various data sampling
strategies, as well as query and context paraphrases generated by
back-translation. We find a simple negative sampling technique to be
particularly effective, even though it is typically used for datasets that
include unanswerable questions, such as SQuAD 2.0. When applied in conjunction
with per-domain sampling, our XLNet (Yang et al., 2019)-based submission
achieved the second best Exact Match and F1 in the MRQA leaderboard
competition.Comment: Accepted at the 2nd Workshop on Machine Reading for Question
Answerin
The Foundation Model Transparency Index
Foundation models have rapidly permeated society, catalyzing a wave of
generative AI applications spanning enterprise and consumer-facing contexts.
While the societal impact of foundation models is growing, transparency is on
the decline, mirroring the opacity that has plagued past digital technologies
(e.g. social media). Reversing this trend is essential: transparency is a vital
precondition for public accountability, scientific innovation, and effective
governance. To assess the transparency of the foundation model ecosystem and
help improve transparency over time, we introduce the Foundation Model
Transparency Index. The Foundation Model Transparency Index specifies 100
fine-grained indicators that comprehensively codify transparency for foundation
models, spanning the upstream resources used to build a foundation model (e.g
data, labor, compute), details about the model itself (e.g. size, capabilities,
risks), and the downstream use (e.g. distribution channels, usage policies,
affected geographies). We score 10 major foundation model developers (e.g.
OpenAI, Google, Meta) against the 100 indicators to assess their transparency.
To facilitate and standardize assessment, we score developers in relation to
their practices for their flagship foundation model (e.g. GPT-4 for OpenAI,
PaLM 2 for Google, Llama 2 for Meta). We present 10 top-level findings about
the foundation model ecosystem: for example, no developer currently discloses
significant information about the downstream impact of its flagship model, such
as the number of users, affected market sectors, or how users can seek redress
for harm. Overall, the Foundation Model Transparency Index establishes the
level of transparency today to drive progress on foundation model governance
via industry standards and regulatory intervention.Comment: Authored by the Center for Research on Foundation Models (CRFM) at
the Stanford Institute for Human-Centered Artificial Intelligence (HAI).
Project page: https://crfm.stanford.edu/fmt
OctoPack: Instruction Tuning Code Large Language Models
Finetuning large language models (LLMs) on instructions leads to vast
performance improvements on natural language tasks. We apply instruction tuning
using code, leveraging the natural structure of Git commits, which pair code
changes with human instructions. We compile CommitPack: 4 terabytes of Git
commits across 350 programming languages. We benchmark CommitPack against other
natural and synthetic code instructions (xP3x, Self-Instruct, OASST) on the 16B
parameter StarCoder model, and achieve state-of-the-art performance among
models not trained on OpenAI outputs, on the HumanEval Python benchmark (46.2%
pass@1). We further introduce HumanEvalPack, expanding the HumanEval benchmark
to a total of 3 coding tasks (Code Repair, Code Explanation, Code Synthesis)
across 6 languages (Python, JavaScript, Java, Go, C++, Rust). Our models,
OctoCoder and OctoGeeX, achieve the best performance across HumanEvalPack among
all permissive models, demonstrating CommitPack's benefits in generalizing to a
wider set of languages and natural coding tasks. Code, models and data are
freely available at https://github.com/bigcode-project/octopack.Comment: 57 pages (9 main), 39 figures, 16 table
A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity
Pretraining is the preliminary and fundamental step in developing capable
language models (LM). Despite this, pretraining data design is critically
under-documented and often guided by empirically unsupported intuitions. To
address this, we pretrain 28 1.5B parameter decoder-only models, training on
data curated (1) at different times, (2) with varying toxicity and quality
filters, and (3) with different domain compositions. First, we quantify the
effect of pretraining data age. A temporal shift between evaluation data and
pretraining data leads to performance degradation, which is not overcome by
finetuning. Second, we explore the effect of quality and toxicity filters,
showing a trade-off between performance on standard benchmarks and risk of
toxic generations. Our findings indicate there does not exist a
one-size-fits-all solution to filtering training data. We also find that the
effects of different types of filtering are not predictable from text domain
characteristics. Lastly, we empirically validate that the inclusion of
heterogeneous data sources, like books and web, is broadly beneficial and
warrants greater prioritization. These findings constitute the largest set of
experiments to validate, quantify, and expose many undocumented intuitions
about text pretraining, which we hope will help support more informed
data-centric decisions in LM development
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
We study the design decisions of publicly available instruction tuning
methods, and break down the development of Flan 2022 (Chung et al., 2022).
Through careful ablation studies on the Flan Collection of tasks and methods,
we tease apart the effect of design decisions which enable Flan-T5 to
outperform prior work by 3-17%+ across evaluation settings. We find task
balancing and enrichment techniques are overlooked but critical to effective
instruction tuning, and in particular, training with mixed prompt settings
(zero-shot, few-shot, and chain-of-thought) actually yields stronger (2%+)
performance in all settings. In further experiments, we show Flan-T5 requires
less finetuning to converge higher and faster than T5 on single downstream
tasks, motivating instruction-tuned models as more computationally-efficient
starting checkpoints for new tasks. Finally, to accelerate research on
instruction tuning, we make the Flan 2022 collection of datasets, templates,
and methods publicly available at
https://github.com/google-research/FLAN/tree/main/flan/v2