11 research outputs found
The Stack: 3 TB of permissively licensed source code
Large Language Models (LLMs) play an ever-increasing role in the field of
Artificial Intelligence (AI)--not only for natural language processing but also
for code understanding and generation. To stimulate open and responsible
research on LLMs for code, we introduce The Stack, a 3.1 TB dataset consisting
of permissively licensed source code in 30 programming languages. We describe
how we collect the full dataset, construct a permissively licensed subset,
present a data governance plan, discuss limitations, and show promising results
on text2code benchmarks by training 350M-parameter decoders on different Python
subsets. We find that (1) near-deduplicating the data significantly boosts
performance across all experiments, and (2) it is possible to match previously
reported HumanEval and MBPP performance using only permissively licensed data.
We make the dataset available at https://hf.co/BigCode, provide a tool called
"Am I in The Stack" (https://hf.co/spaces/bigcode/in-the-stack) for developers
to search The Stack for copies of their code, and provide a process for code to
be removed from the dataset by following the instructions at
https://www.bigcode-project.org/docs/about/the-stack/
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Large language models (LLMs) have been shown to be able to perform new tasks
based on a few demonstrations or natural language instructions. While these
capabilities have led to widespread adoption, most LLMs are developed by
resource-rich organizations and are frequently kept from the public. As a step
towards democratizing this powerful technology, we present BLOOM, a
176B-parameter open-access language model designed and built thanks to a
collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer
language model that was trained on the ROOTS corpus, a dataset comprising
hundreds of sources in 46 natural and 13 programming languages (59 in total).
We find that BLOOM achieves competitive performance on a wide variety of
benchmarks, with stronger results after undergoing multitask prompted
finetuning. To facilitate future research and applications using LLMs, we
publicly release our models and code under the Responsible AI License
Open sourcing AI: intellectual property at the service of platform leadership
Artificial Intelligence (AI) is one of the most strategic technologies of our century. Consequently, tech companies are adopting intellectual property strategies to protect their investment in the field, which encompasses copyright, patents, and trade secrets. While the number of AI-related patent applications is increasing, the number of open-source AI projects sponsored by major AI patent holders is also on the rise. This article explores the commercial and policy strategic reasons behind the growing adoption of open-source licensing in the AI space. More precisely, it assesses how IP rights are articulated around “openness” as a competitive factor in ecosystem competition, and how some players are using open-source licensing successfully to attract a critical mass of users and build an ecosystem around their AI platforms. Moreover, this article integrates the debate on the protectability of AI features by IP rights to assess the potential implications for open-source. Finally, it analyses the most used open-source licenses in AI projects and highlights existing and future challenges from an IP and contractual law perspective
Report on Blockchain for Societies
The Amsterdam Law Forum (ALF) has released in this edition a Special Collection titled Report on Blockchain for Societies, which has been edited by Dr Thibault Schrepel who is an Associate Professor at the Vrije Universiteit Amsterdam (VU) and PhD student in blockchain and antitrust Kirill Ryabtsev. This Report is composed of several contributions written by prominent scholars and practitioners in the area of blockchain technology, the Report discusses the application of blockchain governance to various legal fields such as data protection, regulation of luxurious goods industry, real estate, health and vaccines, and laws on nationality
Report on blockchain for societies
The Amsterdam Law Forum (ALF) has released in this edition a Special Collection titled Report on Blockchain for Societies, which has been edited by Dr Thibault Schrepel who is an Associate Professor at the Vrije Universiteit Amsterdam (VU) and PhD student in blockchain and antitrust Kirill Ryabtsev. This Report is composed of several contributions written by prominent scholars and practitioners in the area of blockchain technology, the Report discusses the application of blockchain governance to various legal fields such as data protection, regulation of luxurious goods industry, real estate, health and vaccines, and laws on nationality
Report on blockchain for societies
The Amsterdam Law Forum (ALF) has released in this edition a Special Collection titled Report on Blockchain for Societies, which has been edited by Dr Thibault Schrepel who is an Associate Professor at the Vrije Universiteit Amsterdam (VU) and PhD student in blockchain and antitrust Kirill Ryabtsev. This Report is composed of several contributions written by prominent scholars and practitioners in the area of blockchain technology, the Report discusses the application of blockchain governance to various legal fields such as data protection, regulation of luxurious goods industry, real estate, health and vaccines, and laws on nationality
Anales del Instituto Español de Edafología, Ecología y Fisiología Vegetal Tomo 6
[Volumen 1] Mariano Claver Aliod / Contribución al estudio de los suelos salícico-húmicos de la Sierra de Guadarrama.-- Enrique Gutiérrez Ríos y Lorenzo Hernando / Yacimientos de hentonita en Marruecos Español.-- A. Hoyos de Castro y F. González García / Identificación y propiedades de un caolín español.-- A. Hoyos de Castro y J. M. Ahumada Buesa / Nota sobre materiales de alfarería.-- F. Pino y J. Acosta Rodríguez / Nota sobre la determinación de hierro (II) en silicatos y rocas.-- Isidoro Asensio Amor / Estudio comparativo de métodos de análisis granulométricos de suelos.-- Luis Cavanillas Rodríguez / Estudios de transpiración vegetal (experiencías en Iisímetros con cultivos de maíz).--José Mª Rodríguez de la Borbolla y Alcalá / La influencia del cloro sobre las plantas.-- J. A. Jiménez Salas / La mecánica del suelo, una nueva rama de la Edafología (II).-- Charles Thom / Control de la población microbiana del suelo.-- Libros publicados.-- Reseña[Volumen 2] Manuel Carlos Alvarez Querol / Variables que influyen sobre la razón molecular sílice/alúmina en los suelos graníticos españoles.-- Ángel Hoyos de Castro / Contribución al estudio de los suelos silícicos
españoles.-- Arturo Caballero López / Estudios fisiológicos relacionados con las fitohormonas
en Sterubergia Lutea Gawl. et Ker.-- Charles Thom / The Penicillia. Molds men meet everyday (Los Penicillium,
mohos que vemos todos los días).-- José Mª Sierra de la Guerra / ¿Edafología o Geonomía?.-- Reseña[Volumen 3] Vicente Aleixandre Ferrandis / Caracterización de algunas arcillas españolas por cambio de bases y curvas de deshidratación.-- José Mª Albareda Herrera y Cruz Rodríguez Muñoz.-Fenómenos de ordenación y reoanisotropía de arcillas.-- Fernando Burriel Martí y Valentín Hernando Fernández / El fósforo en
los suelos españoles : I. Contribución a la determinación colorimétrica del fósforo.-- Florencio Bustinza Lachiondo y Arturo Caballero López / Sobre el empleo de un excipiente hidrosoluble en las técnicas de aplicación de fitohormonas.-- Ernesto Vieitez Cortizo y José L. Blanco / Relaciones entre la condición genética del maíz y las características biométricas de su polen (Trabajo preliminar).-- José Mª Albareda Herrera y Vicente Aleixandre Ferrandis / Sobre la aditividad en las deshidrataciones de mezclas de los minerales de arcilla.-- Libros publicados.--ReseñaPeer reviewe
StarCoder: may the source be with you!
The BigCode community, an open-scientific collaboration working on the
responsible development of Large Language Models for Code (Code LLMs),
introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context
length, infilling capabilities and fast large-batch inference enabled by
multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced
from The Stack, a large collection of permissively licensed GitHub repositories
with inspection tools and an opt-out process. We fine-tuned StarCoderBase on
35B Python tokens, resulting in the creation of StarCoder. We perform the most
comprehensive evaluation of Code LLMs to date and show that StarCoderBase
outperforms every open Code LLM that supports multiple programming languages
and matches or outperforms the OpenAI code-cushman-001 model. Furthermore,
StarCoder outperforms every model that is fine-tuned on Python, can be prompted
to achieve 40\% pass@1 on HumanEval, and still retains its performance on other
programming languages. We take several important steps towards a safe
open-access model release, including an improved PII redaction pipeline and a
novel attribution tracing tool, and make the StarCoder models publicly
available under a more commercially viable version of the Open Responsible AI
Model license
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License