2,035 research outputs found

    Getting the most out of your tokenizer for pre-training and domain adaptation

    Full text link
    Tokenization is an understudied and often neglected component of modern LLMs. Most published works use a single tokenizer for all experiments, often borrowed from another model, without performing ablations or analysis to optimize tokenization. Moreover, the tokenizer is generally kept unchanged when fine-tuning a base model. In this paper, we show that the size, pre-tokenization regular expression, and training data of a tokenizer can significantly impact the model's generation speed, effective context size, memory usage, and downstream performance. We train specialized Byte-Pair Encoding code tokenizers, and conduct extensive ablations on the impact of tokenizer design on the performance of LLMs for code generation tasks such as HumanEval and MBPP, and provide recommendations for tokenizer hyper-parameters selection and switching the tokenizer in a pre-trained LLM. We perform our experiments on models trained from scratch and from pre-trained models, verifying their applicability to a wide range of use-cases. We find that when fine-tuning on more than 50 billion tokens, we can specialize the tokenizer of a pre-trained LLM to obtain large gains in generation speed and effective context size

    Construction de tarifs de biomasse pour l'évaluation de la disponibilité ligneuse en zone de savanes au Nord-Cameroun.

    Get PDF
    International audienceLa mise en place des forêts communautaires, la gestion aux échelles locales et régionale des ressources ligneuses, et l'organisation de l'approvisionnement en bois de feu des zones urbaines au Nord-Cameroun, ne pourront se faire sans connaître la disponibilité en bois des espaces concernés. Le couplage de résultats d'inventaires forestiers à des tarifs de biomasse permet d'estimer cette disponibilité. La construction de tarifs de biomasse fraîche pour quatre des espèces (Anogeissus leiocarpus, Acacia senegal, Acacia hockii, Acacia gerrardii) principalement exploitées dans la région de Kaélé est présentée. (Résumé d'auteur

    Fiches techniques des arbres utiles aux paysans du Nord Cameroun. Caractéristiques de l'arbre, ce qu'en font les paysans et ce qu'ils pourraient en faire

    Get PDF
    This booklet on trees that are useful for North Cameroonian farmers has been designed as a resource for development practicioners. It is gathering on data sheets species that are the most used by farmers, based on field studies and our field knowledge, free of consideration on what is good or not for farmers or what he/she can handle or not. The spirit is firsly to build up a range of possibilities of development agents could propose to farmers. These data sheets on useful fruits aim to help farmers, nurserymen and agroforestry agents to discuss the best trees to be promoted on the different areas of the terroir, and the best management to apply

    Verifying the Mathematical Library of an UAV Autopilot with Frama-C

    Get PDF
    Ensuring safety of critical systems is crucial and is often attained by extensive testing of the system. Formal methods are now commonly accepted as powerful tools to obtain guarantees on such systems, even if it is generally not possible to formally prove the safety and correctness of the whole system. This paper presents an ongoing work on the formal verification of the Paparazzi UAV autopilot using the Frama-C verification platform. We focus on a Paparazzi mathematical library providing different UAV state representations and associated conversion functions and manage to prove the absence of runtime errors in the library and some interesting functional properties on floating-point conversion functions

    Formal Verification for Autopilot - Preliminary state of the art

    Get PDF
    This document is a preliminary state of the art for the formal verification of the autopilot of an Unmanned Air Vehicle (UAV). We will first present UAV autopilots and more specifically the Paparazzi autopilot developed at ENAC which will be our case study. We then present which properties could be verified and on which representation of the autopilot (source code, model). A more complete state of the art of current formal methods will be then detailed and focus on deductive methods, abstract interpretation, model checking and proof assistants. Finally, some immediate perspective for the thesis are proposed

    Effect of a fungal chitosan preparation on Brettanomyces bruxellensis,a wine contaminant

    Get PDF
    To investigate the action mechanisms of a specific fungal origin chitosan preparation on Brettanomyces bruxellensis. METHODS AND RESULTS: Different approaches in a wine-model synthetic medium were carried out: optical and electronic microscopy, flow cytometry, ATP flow measurements and zeta potential characterization. The inactivation effect was confirmed. Moreover, fungal origin chitosan induced both physical and biological effects on B. bruxellensis cells. Physical effect led to aggregation of cells with chitosan likely due to charge interactions. At the same time, a biological effect induced a leakage of ATP and thus a viability loss of B. bruxellensis cells. CONCLUSIONS: The antimicrobial action mode of chitosan against B. bruxellensis is not a simple mechanism but the result of several mechanisms acting together. SIGNIFICANCE AND IMPACT OF THE STUDY: Brettanomyces bruxellensis, a yeast responsible for the production of undesirable aromatic compounds (volatile phenols), is a permanent threat to wine quality. Today, different means are implemented to fight against B. bruxellensis, but are not always sufficient. The chitosan of fungal origin is introduced as a new tool to control B. bruxellensis in winemaking and has poorly been studied before for this application

    Genome analysis of the necrotrophic fungal pathogens Sclerotinia sclerotiorum and Botrytis cinerea

    Get PDF
    Sclerotinia sclerotiorum and Botrytis cinerea are closely related necrotrophic plant pathogenic fungi notable for their wide host ranges and environmental persistence. These attributes have made these species models for understanding the complexity of necrotrophic, broad host-range pathogenicity. Despite their similarities, the two species differ in mating behaviour and the ability to produce asexual spores. We have sequenced the genomes of one strain of S. sclerotiorum and two strains of B. cinerea. The comparative analysis of these genomes relative to one another and to other sequenced fungal genomes is provided here. Their 38–39 Mb genomes include 11,860–14,270 predicted genes, which share 83% amino acid identity on average between the two species. We have mapped the S. sclerotiorum assembly to 16 chromosomes and found large-scale co-linearity with the B. cinerea genomes. Seven percent of the S. sclerotiorum genome comprises transposable elements compared t
    • …
    corecore