11 research outputs found

    The Stack: 3 TB of permissively licensed source code

    Full text link
    Large Language Models (LLMs) play an ever-increasing role in the field of Artificial Intelligence (AI)--not only for natural language processing but also for code understanding and generation. To stimulate open and responsible research on LLMs for code, we introduce The Stack, a 3.1 TB dataset consisting of permissively licensed source code in 30 programming languages. We describe how we collect the full dataset, construct a permissively licensed subset, present a data governance plan, discuss limitations, and show promising results on text2code benchmarks by training 350M-parameter decoders on different Python subsets. We find that (1) near-deduplicating the data significantly boosts performance across all experiments, and (2) it is possible to match previously reported HumanEval and MBPP performance using only permissively licensed data. We make the dataset available at https://hf.co/BigCode, provide a tool called "Am I in The Stack" (https://hf.co/spaces/bigcode/in-the-stack) for developers to search The Stack for copies of their code, and provide a process for code to be removed from the dataset by following the instructions at https://www.bigcode-project.org/docs/about/the-stack/

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Full text link
    Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License

    Open sourcing AI: intellectual property at the service of platform leadership

    No full text
    Artificial Intelligence (AI) is one of the most strategic technologies of our century. Consequently, tech companies are adopting intellectual property strategies to protect their investment in the field, which encompasses copyright, patents, and trade secrets. While the number of AI-related patent applications is increasing, the number of open-source AI projects sponsored by major AI patent holders is also on the rise. This article explores the commercial and policy strategic reasons behind the growing adoption of open-source licensing in the AI space. More precisely, it assesses how IP rights are articulated around “openness” as a competitive factor in ecosystem competition, and how some players are using open-source licensing successfully to attract a critical mass of users and build an ecosystem around their AI platforms. Moreover, this article integrates the debate on the protectability of AI features by IP rights to assess the potential implications for open-source. Finally, it analyses the most used open-source licenses in AI projects and highlights existing and future challenges from an IP and contractual law perspective

    Report on Blockchain for Societies

    No full text
    The Amsterdam Law Forum (ALF) has released in this edition a Special Collection titled Report on Blockchain for Societies, which has been edited by Dr Thibault Schrepel who is an Associate Professor at the Vrije Universiteit Amsterdam (VU) and PhD student in blockchain and antitrust Kirill Ryabtsev. This Report is composed of several contributions written by prominent scholars and practitioners in the area of blockchain technology, the Report discusses the application of blockchain governance to various legal fields such as data protection, regulation of luxurious goods industry, real estate, health and vaccines, and laws on nationality

    Report on blockchain for societies

    No full text
    The Amsterdam Law Forum (ALF) has released in this edition a Special Collection titled Report on Blockchain for Societies, which has been edited by Dr Thibault Schrepel who is an Associate Professor at the Vrije Universiteit Amsterdam (VU) and PhD student in blockchain and antitrust Kirill Ryabtsev. This Report is composed of several contributions written by prominent scholars and practitioners in the area of blockchain technology, the Report discusses the application of blockchain governance to various legal fields such as data protection, regulation of luxurious goods industry, real estate, health and vaccines, and laws on nationality

    Report on blockchain for societies

    No full text
    The Amsterdam Law Forum (ALF) has released in this edition a Special Collection titled Report on Blockchain for Societies, which has been edited by Dr Thibault Schrepel who is an Associate Professor at the Vrije Universiteit Amsterdam (VU) and PhD student in blockchain and antitrust Kirill Ryabtsev. This Report is composed of several contributions written by prominent scholars and practitioners in the area of blockchain technology, the Report discusses the application of blockchain governance to various legal fields such as data protection, regulation of luxurious goods industry, real estate, health and vaccines, and laws on nationality

    Anales del Instituto Español de Edafología, Ecología y Fisiología Vegetal Tomo 6

    No full text
    [Volumen 1] Mariano Claver Aliod / Contribución al estudio de los suelos salícico-húmicos de la Sierra de Guadarrama.-- Enrique Gutiérrez Ríos y Lorenzo Hernando / Yacimientos de hentonita en Marruecos Español.-- A. Hoyos de Castro y F. González García / Identificación y propiedades de un caolín español.-- A. Hoyos de Castro y J. M. Ahumada Buesa / Nota sobre materiales de alfarería.-- F. Pino y J. Acosta Rodríguez / Nota sobre la determinación de hierro (II) en silicatos y rocas.-- Isidoro Asensio Amor / Estudio comparativo de métodos de análisis granulométricos de suelos.-- Luis Cavanillas Rodríguez / Estudios de transpiración vegetal (experiencías en Iisímetros con cultivos de maíz).--José Mª Rodríguez de la Borbolla y Alcalá / La influencia del cloro sobre las plantas.-- J. A. Jiménez Salas / La mecánica del suelo, una nueva rama de la Edafología (II).-- Charles Thom / Control de la población microbiana del suelo.-- Libros publicados.-- Reseña[Volumen 2] Manuel Carlos Alvarez Querol / Variables que influyen sobre la razón molecular sílice/alúmina en los suelos graníticos españoles.-- Ángel Hoyos de Castro / Contribución al estudio de los suelos silícicos españoles.-- Arturo Caballero López / Estudios fisiológicos relacionados con las fitohormonas en Sterubergia Lutea Gawl. et Ker.-- Charles Thom / The Penicillia. Molds men meet everyday (Los Penicillium, mohos que vemos todos los días).-- José Mª Sierra de la Guerra / ¿Edafología o Geonomía?.-- Reseña[Volumen 3] Vicente Aleixandre Ferrandis / Caracterización de algunas arcillas españolas por cambio de bases y curvas de deshidratación.-- José Mª Albareda Herrera y Cruz Rodríguez Muñoz.-Fenómenos de ordenación y reoanisotropía de arcillas.-- Fernando Burriel Martí y Valentín Hernando Fernández / El fósforo en los suelos españoles : I. Contribución a la determinación colorimétrica del fósforo.-- Florencio Bustinza Lachiondo y Arturo Caballero López / Sobre el empleo de un excipiente hidrosoluble en las técnicas de aplicación de fitohormonas.-- Ernesto Vieitez Cortizo y José L. Blanco / Relaciones entre la condición genética del maíz y las características biométricas de su polen (Trabajo preliminar).-- José Mª Albareda Herrera y Vicente Aleixandre Ferrandis / Sobre la aditividad en las deshidrataciones de mezclas de los minerales de arcilla.-- Libros publicados.--ReseñaPeer reviewe

    StarCoder: may the source be with you!

    Full text link
    The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model. Furthermore, StarCoder outperforms every model that is fine-tuned on Python, can be prompted to achieve 40\% pass@1 on HumanEval, and still retains its performance on other programming languages. We take several important steps towards a safe open-access model release, including an improved PII redaction pipeline and a novel attribution tracing tool, and make the StarCoder models publicly available under a more commercially viable version of the Open Responsible AI Model license

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    No full text
    Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License
    corecore