14 research outputs found
Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models
The high cost of full-parameter fine-tuning (FFT) of Large Language Models
(LLMs) has led to a series of parameter-efficient fine-tuning (PEFT) methods.
However, it remains unclear which methods provide the best cost-performance
trade-off at different model scales. We introduce Astraios, a suite of 28
instruction-tuned OctoCoder models using 7 tuning methods and 4 model sizes up
to 16 billion parameters. Through investigations across 5 tasks and 8 different
datasets encompassing both code comprehension and code generation tasks, we
find that FFT generally leads to the best downstream performance across all
scales, and PEFT methods differ significantly in their efficacy based on the
model scale. LoRA usually offers the most favorable trade-off between cost and
performance. Further investigation into the effects of these methods on both
model robustness and code security reveals that larger models tend to
demonstrate reduced robustness and less security. At last, we explore the
relationships among updated parameters, cross-entropy loss, and task
performance. We find that the tuning effectiveness observed in small models
generalizes well to larger models, and the validation loss in instruction
tuning can be a reliable indicator of overall downstream performance.Comment: 25 pages (12 main), 19 figures, 8 table
OctoPack: Instruction Tuning Code Large Language Models
Finetuning large language models (LLMs) on instructions leads to vast
performance improvements on natural language tasks. We apply instruction tuning
using code, leveraging the natural structure of Git commits, which pair code
changes with human instructions. We compile CommitPack: 4 terabytes of Git
commits across 350 programming languages. We benchmark CommitPack against other
natural and synthetic code instructions (xP3x, Self-Instruct, OASST) on the 16B
parameter StarCoder model, and achieve state-of-the-art performance among
models not trained on OpenAI outputs, on the HumanEval Python benchmark (46.2%
pass@1). We further introduce HumanEvalPack, expanding the HumanEval benchmark
to a total of 3 coding tasks (Code Repair, Code Explanation, Code Synthesis)
across 6 languages (Python, JavaScript, Java, Go, C++, Rust). Our models,
OctoCoder and OctoGeeX, achieve the best performance across HumanEvalPack among
all permissive models, demonstrating CommitPack's benefits in generalizing to a
wider set of languages and natural coding tasks. Code, models and data are
freely available at https://github.com/bigcode-project/octopack.Comment: 57 pages (9 main), 39 figures, 16 table
The BigCode Project Governance Card
This document serves as an overview of the different mechanisms and areas of
governance in the BigCode project. It aims to support transparency by providing
relevant information about choices that were made during the project to the
broader public, and to serve as an example of intentional governance of an open
research project that future endeavors can leverage to shape their own
approach. The first section, Project Structure, covers the project
organization, its stated goals and values, its internal decision processes, and
its funding and resources. The second section, Data and Model Governance,
covers decisions relating to the questions of data subject consent, privacy,
and model release.Comment: 12 pages, related papers arXiv:2305.06161 and arXiv:2301.03988 and
arXiv:2211.15533v1, learn more at https://www.bigcode-project.org
The Stack: 3 TB of permissively licensed source code
Large Language Models (LLMs) play an ever-increasing role in the field of
Artificial Intelligence (AI)--not only for natural language processing but also
for code understanding and generation. To stimulate open and responsible
research on LLMs for code, we introduce The Stack, a 3.1 TB dataset consisting
of permissively licensed source code in 30 programming languages. We describe
how we collect the full dataset, construct a permissively licensed subset,
present a data governance plan, discuss limitations, and show promising results
on text2code benchmarks by training 350M-parameter decoders on different Python
subsets. We find that (1) near-deduplicating the data significantly boosts
performance across all experiments, and (2) it is possible to match previously
reported HumanEval and MBPP performance using only permissively licensed data.
We make the dataset available at https://hf.co/BigCode, provide a tool called
"Am I in The Stack" (https://hf.co/spaces/bigcode/in-the-stack) for developers
to search The Stack for copies of their code, and provide a process for code to
be removed from the dataset by following the instructions at
https://www.bigcode-project.org/docs/about/the-stack/
Zephyr: Direct Distillation of LM Alignment
We aim to produce a smaller language model that is aligned to user intent.
Previous research has shown that applying distilled supervised fine-tuning
(dSFT) on larger models significantly improves task accuracy; however, these
models are unaligned, i.e. they do not respond well to natural prompts. To
distill this property, we experiment with the use of preference data from AI
Feedback (AIF). Starting from a dataset of outputs ranked by a teacher model,
we apply distilled direct preference optimization (dDPO) to learn a chat model
with significantly improved intent alignment. The approach requires only a few
hours of training without any additional sampling during fine-tuning. The final
result, Zephyr-7B, sets the state-of-the-art on chat benchmarks for 7B
parameter models, and requires no human annotation. In particular, results on
MT-Bench show that Zephyr-7B surpasses Llama2-Chat-70B, the best open-access
RLHF-based model. Code, models, data, and tutorials for the system are
available at https://github.com/huggingface/alignment-handbook
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Large language models (LLMs) have been shown to be able to perform new tasks
based on a few demonstrations or natural language instructions. While these
capabilities have led to widespread adoption, most LLMs are developed by
resource-rich organizations and are frequently kept from the public. As a step
towards democratizing this powerful technology, we present BLOOM, a
176B-parameter open-access language model designed and built thanks to a
collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer
language model that was trained on the ROOTS corpus, a dataset comprising
hundreds of sources in 46 natural and 13 programming languages (59 in total).
We find that BLOOM achieves competitive performance on a wide variety of
benchmarks, with stronger results after undergoing multitask prompted
finetuning. To facilitate future research and applications using LLMs, we
publicly release our models and code under the Responsible AI License
Radiometric Characterization of a Water-Based Conical Blackbody Calibration Target for Millimeter-Wave Remote Sensing
In this paper, we present the design and radiometriccharacterization of a water-based conical blackbody calibrationtarget to be applied as a precise reference source in laboratoriesand for the accurate calibration of ground-based microwave in-struments. The concept of aqueous blackbody targets circumventsthe conflict between the electromagnetic and thermal propertiesof traditional calibration target designs. The temperature control-lable target consists of a conical low loss plastic shell, which hasbeen manufactured using a stereolithography three-dimensionalprinter to define the water–air interface. An advanced exponentialprofile of the shell is used to compensate the comparatively high re-fractive index of water and to obtain a high emissivity. Simulationsand active measurements of the coherent backscattering S11havebeen performed verifying its excellent electromagnetic properties.Furthermore, radiometric measurements at 110 GHz demonstratethe outstanding thermal performance at various target tempera-tures between 10 and 60 °C. The differences between the bright-ness temperature and the water temperature are measured to beless than 40 mK and therefore significantly smaller in comparisonto traditional calibration targets. In addition, the measured base-line ripples in the spectra caused by the water-based calibrationtarget are small compared to established blackbody concepts.Index Terms—Blackbody targets, calibration, electromagneticdesign, microwave absorbers, microwave radiometry, remotesensing, three-dimensional (3-D) printing
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
International audienceAs language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus