Search CORE

19 research outputs found

Scaling Data-Constrained Language Models

Author: Barak Boaz
Muennighoff Niklas
Piktus Aleksandra
Pyysalo Sampo
Raffel Colin
Rush Alexander M.
Scao Teven Le
Tazi Nouamane
Wolf Thomas
Publication venue
Publication date: 30/05/2023
Field of study

The current trend of scaling language models involves increasing both parameter count and training dataset size. Extrapolating this trend suggests that training dataset size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters. Finally, we experiment with approaches mitigating data scarcity, including augmenting the training dataset with code data or removing commonly used filters. Models and datasets from our 400 training runs are freely available at https://github.com/huggingface/datablations.Comment: 47 pages (9 main), 37 figures, 13 table

arXiv.org e-Print Archive

What Language Model to Train if You Have One Million GPU Hours?

Author: Bari M Saiful
Bekman Stas
Beltagy Iz
Biderman Stella
Elsahar Hady
Hesslow Daniel
Launay Julien
Muennighoff Niklas
Phang Jason
Press Ofir
Raffel Colin
Sanh Victor
Saulnier Lucile
Scao Teven Le
Shen Sheng
Sutawika Lintang
Tae Jaesung
Wang Thomas
Yong Zheng Xin
Publication venue
Publication date: 07/11/2022
Field of study

The crystallization of modeling methods around the Transformer architecture has been a boon for practitioners. Simple, well-motivated architectural variations can transfer across tasks and scale, increasing the impact of modeling research. However, with the emergence of state-of-the-art 100B+ parameters models, large language models are increasingly expensive to accurately design and train. Notably, it can be difficult to evaluate how modeling decisions may impact emergent capabilities, given that these capabilities arise mainly from sheer scale alone. In the process of building BLOOM--the Big Science Large Open-science Open-access Multilingual language model--our goal is to identify an architecture and training setup that makes the best use of our 1,000,000 A100-GPU-hours budget. Specifically, we perform an ablation study at the billion-parameter scale comparing different modeling practices and their impact on zero-shot generalization. In addition, we study the impact of various popular pre-training corpora on zero-shot generalization. We also study the performance of a multilingual model and how it compares to the English-only one. Finally, we consider the scaling behaviour of Transformers to choose the target model size, shape, and training setup. All our models and code are open-sourced at https://huggingface.co/bigscience .Comment: Findings of EMNLP 202

arXiv.org e-Print Archive

Crosslingual Generalization through Multitask Finetuning

Author: Aji Alham Fikri
Albanie Samuel
Almubarak Khalid
Alyafeai Zaid
Bari M Saiful
Biderman Stella
Muennighoff Niklas
Radev Dragomir
Raff Edward
Raffel Colin
Roberts Adam
Scao Teven Le
Schoelkopf Hailey
Shen Sheng
Sutawika Lintang
Tang Xiangru
Wang Thomas
Webson Albert
Yong Zheng-Xin
Publication venue
Publication date: 29/05/2023
Field of study

Multitask prompted finetuning (MTF) has been shown to help large language models generalize to new tasks in a zero-shot setting, but so far explorations of MTF have focused on English data and models. We apply MTF to the pretrained multilingual BLOOM and mT5 model families to produce finetuned variants called BLOOMZ and mT0. We find finetuning large multilingual language models on English tasks with English prompts allows for task generalization to non-English languages that appear only in the pretraining corpus. Finetuning on multilingual tasks with English prompts further improves performance on English and non-English tasks leading to various state-of-the-art zero-shot results. We also investigate finetuning on multilingual tasks with prompts that have been machine-translated from English to match the language of each dataset. We find training on these machine-translated prompts leads to better performance on human-written prompts in the respective languages. Surprisingly, we find models are capable of zero-shot generalization to tasks in languages they have never intentionally seen. We conjecture that the models are learning higher-level capabilities that are both task- and language-agnostic. In addition, we introduce xP3, a composite of supervised datasets in 46 languages with English and machine-translated prompts. Our code, datasets and models are freely available at https://github.com/bigscience-workshop/xmtf.Comment: 9 main pages (119 with appendix), 16 figures and 11 table

arXiv.org e-Print Archive

Mistral 7B

Author: Bamford Chris
Bressand Florian
Casas Diego de las
Chaplot Devendra Singh
Jiang Albert Q.
Lachaux Marie-Anne
Lacroix Timothée
Lample Guillaume
Lavaud Lélio Renard
Lavril Thibaut
Lengyel Gianna
Mensch Arthur
Sablayrolles Alexandre
Saulnier Lucile
Sayed William El
Scao Teven Le
Stock Pierre
Wang Thomas
Publication venue
Publication date: 10/10/2023
Field of study

We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license.Comment: Models and code are available at https://mistral.ai/news/announcing-mistral-7b

arXiv.org e-Print Archive

FinGPT: Large Generative Models for a Small Language

Author: Antao Samuel
Eskelinen Anni
Ginter Filip
Heinonen Jyrki
Kanerva Jenna
Komulainen Ville
Kupari Hanna-Mari
Laippala Veronika
Luoma Jouni
Luukkonen Risto
Merioksa Mikko
Muennighoff Niklas
Piktus Aleksandra
Pyysalo Sampo
Sairanen Samuli
Scao Teven Le
Suominen Osma
Tazi Nouamane
Vahtola Aija
Wang Thomas
Wolf Thomas
Publication venue
Publication date: 03/11/2023
Field of study

Large language models (LLMs) excel in many tasks in NLP and beyond, but most open models have very limited coverage of smaller languages and LLM work tends to focus on languages where nearly unlimited data is available for pretraining. In this work, we study the challenges of creating LLMs for Finnish, a language spoken by less than 0.1% of the world population. We compile an extensive dataset of Finnish combining web crawls, news, social media and eBooks. We pursue two approaches to pretrain models: 1) we train seven monolingual models from scratch (186M to 13B parameters) dubbed FinGPT, 2) we continue the pretraining of the multilingual BLOOM model on a mix of its original training data and Finnish, resulting in a 176 billion parameter model we call BLUUMI. For model evaluation, we introduce FIN-bench, a version of BIG-bench with Finnish tasks. We also assess other model qualities such as toxicity and bias. Our models and tools are openly available at https://turkunlp.org/gpt3-finnish.Comment: 17 pages (10 main), 7 figures, 5 table

arXiv.org e-Print Archive

Multitask Prompted Training Enables Zero-Shot Task Generalization

International audienceLarge language models have recently been shown to attain reasonable zero-shot generalization on a diverse set of tasks (Brown et al., 2020). It has been hypothesized that this is a consequence of implicit multitask learning in language models’ pretraining (Radford et al., 2019). Can zero-shot generalization instead be directly induced by explicit multitask learning? To test this question at scale, we develop a system for easily mapping any natural language tasks into a human-readable prompted form. We convert a large set of supervised datasets, each with multiple prompts with diverse wording. These prompted datasets allow for benchmarking the ability of a model to perform completely held-out tasks. We fine-tune a pre-trained encoder-decoder model (Raffel et al., 2020; Lester et al., 2021) on this multitask mixture covering a wide variety of tasks. The model attains strong zero-shot performance on several standard datasets, often outperforming models up to 16x its size. Further, our approach attains strong performance on a subset of tasks from the BIG-bench benchmark, outperforming models up to 6x its size. All trained models are available at https://github.com/bigscience-workshop/t-zero, and all prompts are available at https://github.com/bigscience-workshop/promptsource

INRIA a CCSD electronic archive server