9 research outputs found
Emergent and Predictable Memorization in Large Language Models
Memorization, or the tendency of large language models (LLMs) to output
entire sequences from their training data verbatim, is a key concern for safely
deploying language models. In particular, it is vital to minimize a model's
memorization of sensitive datapoints such as those containing personal
identifiable information (PII). The prevalence of such undesirable memorization
can pose issues for model trainers, and may even require discarding an
otherwise functional model. We therefore seek to predict which sequences will
be memorized before a large model's full train-time by extrapolating the
memorization behavior of lower-compute trial runs. We measure memorization of
the Pythia model suite, and find that intermediate checkpoints are better
predictors of a model's memorization behavior than smaller fully-trained
models. We additionally provide further novel discoveries on the distribution
of memorization scores across models and data
GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training Data Exploration
Noticing the urgent need to provide tools for fast and user-friendly
qualitative analysis of large-scale textual corpora of the modern NLP, we
propose to turn to the mature and well-tested methods from the domain of
Information Retrieval (IR) - a research field with a long history of tackling
TB-scale document collections. We discuss how Pyserini - a widely used toolkit
for reproducible IR research can be integrated with the Hugging Face ecosystem
of open-source AI libraries and artifacts. We leverage the existing
functionalities of both platforms while proposing novel features further
facilitating their integration. Our goal is to give NLP researchers tools that
will allow them to develop retrieval-based instrumentation for their data
analytics needs with ease and agility. We include a Jupyter Notebook-based walk
through the core interoperability features, available on GitHub at
https://github.com/huggingface/gaia. We then demonstrate how the ideas we
present can be operationalized to create a powerful tool for qualitative data
analysis in NLP. We present GAIA Search - a search engine built following
previously laid out principles, giving access to four popular large-scale text
collections. GAIA serves a dual purpose of illustrating the potential of
methodologies we discuss but also as a standalone qualitative analysis tool
that can be leveraged by NLP researchers aiming to understand datasets prior to
using them in training. GAIA is hosted live on Hugging Face Spaces -
https://huggingface.co/spaces/spacerini/gaia
BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting
The BLOOM model is a large publicly available multilingual language model,
but its pretraining was limited to 46 languages. To extend the benefits of
BLOOM to other languages without incurring prohibitively large costs, it is
desirable to adapt BLOOM to new languages not seen during pretraining. In this
work, we apply existing language adaptation strategies to BLOOM and benchmark
its zero-shot prompting performance on eight new languages in a
resource-constrained setting. We find language adaptation to be effective at
improving zero-shot performance in new languages. Surprisingly, we find that
adapter-based finetuning is more effective than continued pretraining for large
models. In addition, we discover that prompting performance is not
significantly affected by language specifics, such as the writing system. It is
primarily determined by the size of the language adaptation data. We also add
new languages to BLOOMZ, which is a multitask finetuned version of BLOOM
capable of following task instructions zero-shot. We find including a new
language in the multitask fine-tuning mixture to be the most effective method
to teach BLOOMZ a new language. We conclude that with sufficient training data
language adaptation can generalize well to diverse languages. Our code is
available at https://github.com/bigscience-workshop/multilingual-modeling.Comment: ACL 202
Crosslingual Generalization through Multitask Finetuning
Multitask prompted finetuning (MTF) has been shown to help large language
models generalize to new tasks in a zero-shot setting, but so far explorations
of MTF have focused on English data and models. We apply MTF to the pretrained
multilingual BLOOM and mT5 model families to produce finetuned variants called
BLOOMZ and mT0. We find finetuning large multilingual language models on
English tasks with English prompts allows for task generalization to
non-English languages that appear only in the pretraining corpus. Finetuning on
multilingual tasks with English prompts further improves performance on English
and non-English tasks leading to various state-of-the-art zero-shot results. We
also investigate finetuning on multilingual tasks with prompts that have been
machine-translated from English to match the language of each dataset. We find
training on these machine-translated prompts leads to better performance on
human-written prompts in the respective languages. Surprisingly, we find models
are capable of zero-shot generalization to tasks in languages they have never
intentionally seen. We conjecture that the models are learning higher-level
capabilities that are both task- and language-agnostic. In addition, we
introduce xP3, a composite of supervised datasets in 46 languages with English
and machine-translated prompts. Our code, datasets and models are freely
available at https://github.com/bigscience-workshop/xmtf.Comment: 9 main pages (119 with appendix), 16 figures and 11 table
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Large language models (LLMs) have been shown to be able to perform new tasks
based on a few demonstrations or natural language instructions. While these
capabilities have led to widespread adoption, most LLMs are developed by
resource-rich organizations and are frequently kept from the public. As a step
towards democratizing this powerful technology, we present BLOOM, a
176B-parameter open-access language model designed and built thanks to a
collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer
language model that was trained on the ROOTS corpus, a dataset comprising
hundreds of sources in 46 natural and 13 programming languages (59 in total).
We find that BLOOM achieves competitive performance on a wide variety of
benchmarks, with stronger results after undergoing multitask prompted
finetuning. To facilitate future research and applications using LLMs, we
publicly release our models and code under the Responsible AI License
StarCoder: may the source be with you!
The BigCode community, an open-scientific collaboration working on the
responsible development of Large Language Models for Code (Code LLMs),
introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context
length, infilling capabilities and fast large-batch inference enabled by
multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced
from The Stack, a large collection of permissively licensed GitHub repositories
with inspection tools and an opt-out process. We fine-tuned StarCoderBase on
35B Python tokens, resulting in the creation of StarCoder. We perform the most
comprehensive evaluation of Code LLMs to date and show that StarCoderBase
outperforms every open Code LLM that supports multiple programming languages
and matches or outperforms the OpenAI code-cushman-001 model. Furthermore,
StarCoder outperforms every model that is fine-tuned on Python, can be prompted
to achieve 40\% pass@1 on HumanEval, and still retains its performance on other
programming languages. We take several important steps towards a safe
open-access model release, including an improved PII redaction pipeline and a
novel attribution tracing tool, and make the StarCoder models publicly
available under a more commercially viable version of the Open Responsible AI
Model license
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License