2,035 research outputs found
Getting the most out of your tokenizer for pre-training and domain adaptation
Tokenization is an understudied and often neglected component of modern LLMs.
Most published works use a single tokenizer for all experiments, often borrowed
from another model, without performing ablations or analysis to optimize
tokenization. Moreover, the tokenizer is generally kept unchanged when
fine-tuning a base model. In this paper, we show that the size,
pre-tokenization regular expression, and training data of a tokenizer can
significantly impact the model's generation speed, effective context size,
memory usage, and downstream performance. We train specialized Byte-Pair
Encoding code tokenizers, and conduct extensive ablations on the impact of
tokenizer design on the performance of LLMs for code generation tasks such as
HumanEval and MBPP, and provide recommendations for tokenizer hyper-parameters
selection and switching the tokenizer in a pre-trained LLM. We perform our
experiments on models trained from scratch and from pre-trained models,
verifying their applicability to a wide range of use-cases. We find that when
fine-tuning on more than 50 billion tokens, we can specialize the tokenizer of
a pre-trained LLM to obtain large gains in generation speed and effective
context size
Construction de tarifs de biomasse pour l'évaluation de la disponibilité ligneuse en zone de savanes au Nord-Cameroun.
International audienceLa mise en place des forêts communautaires, la gestion aux échelles locales et régionale des ressources ligneuses, et l'organisation de l'approvisionnement en bois de feu des zones urbaines au Nord-Cameroun, ne pourront se faire sans connaître la disponibilité en bois des espaces concernés. Le couplage de résultats d'inventaires forestiers à des tarifs de biomasse permet d'estimer cette disponibilité. La construction de tarifs de biomasse fraîche pour quatre des espèces (Anogeissus leiocarpus, Acacia senegal, Acacia hockii, Acacia gerrardii) principalement exploitées dans la région de Kaélé est présentée. (Résumé d'auteur
Fiches techniques des arbres utiles aux paysans du Nord Cameroun. Caractéristiques de l'arbre, ce qu'en font les paysans et ce qu'ils pourraient en faire
This booklet on trees that are useful for North Cameroonian farmers has been designed as a resource for development practicioners. It is gathering on data sheets species that are the most used by farmers, based on field studies and our field knowledge, free of consideration on what is good or not for farmers or what he/she can handle or not. The spirit is firsly to build up a range of possibilities of development agents could propose to farmers. These data sheets on useful fruits aim to help farmers, nurserymen and agroforestry agents to discuss the best trees to be promoted on the different areas of the terroir, and the best management to apply
Evaluation rapide et précise des performances d'une communication LoRa basée sur la fonction de Marcum
International audienc
Verifying the Mathematical Library of an UAV Autopilot with Frama-C
Ensuring safety of critical systems is crucial and is often attained by extensive testing of the system. Formal methods are now commonly accepted as powerful tools to obtain guarantees on such systems, even if it is generally not possible to formally prove the safety and correctness of the whole system. This paper presents an ongoing work on the formal verification of the Paparazzi UAV autopilot using the Frama-C verification platform. We focus on a Paparazzi mathematical library providing different UAV state representations and associated conversion functions and manage to prove the absence of runtime errors in the library and some interesting functional properties on floating-point conversion functions
Formal Verification for Autopilot - Preliminary state of the art
This document is a preliminary state of the art for the formal verification of the autopilot of an Unmanned Air Vehicle (UAV). We will first present UAV autopilots and more specifically the Paparazzi autopilot developed at ENAC which will be our case study. We then present which properties could be verified and on which representation of the autopilot (source code, model). A more complete state of the art of current formal methods will be then detailed and focus on deductive methods, abstract interpretation, model checking and proof assistants. Finally, some immediate perspective for the thesis are proposed
Effect of a fungal chitosan preparation on Brettanomyces bruxellensis,a wine contaminant
To investigate the action mechanisms of a specific fungal origin chitosan preparation on Brettanomyces bruxellensis. METHODS AND RESULTS: Different approaches in a wine-model synthetic medium were carried out: optical and electronic microscopy, flow cytometry, ATP flow measurements and zeta potential characterization. The inactivation effect was confirmed. Moreover, fungal origin chitosan induced both physical and biological effects on B. bruxellensis cells. Physical effect led to aggregation of cells with chitosan likely due to charge interactions. At the same time, a biological effect induced a leakage of ATP and thus a viability loss of B. bruxellensis cells. CONCLUSIONS: The antimicrobial action mode of chitosan against B. bruxellensis is not a simple mechanism but the result of several mechanisms acting together. SIGNIFICANCE AND IMPACT OF THE STUDY: Brettanomyces bruxellensis, a yeast responsible for the production of undesirable aromatic compounds (volatile phenols), is a permanent threat to wine quality. Today, different means are implemented to fight against B. bruxellensis, but are not always sufficient. The chitosan of fungal origin is introduced as a new tool to control B. bruxellensis in winemaking and has poorly been studied before for this application
Genome analysis of the necrotrophic fungal pathogens Sclerotinia sclerotiorum and Botrytis cinerea
Sclerotinia sclerotiorum and Botrytis cinerea are closely related necrotrophic plant pathogenic fungi notable for their wide host ranges and environmental persistence. These attributes have made these species models for understanding the complexity of necrotrophic, broad host-range pathogenicity. Despite their similarities, the two species differ in mating behaviour and the ability to produce asexual spores. We have sequenced the genomes of one strain of S. sclerotiorum and two strains of B. cinerea. The comparative analysis of these genomes relative to one another and to other sequenced fungal genomes is provided here. Their 38–39 Mb genomes include 11,860–14,270 predicted genes, which share 83% amino acid identity on average between the two species. We have mapped the S. sclerotiorum assembly to 16 chromosomes and found large-scale co-linearity with the B. cinerea genomes. Seven percent of the S. sclerotiorum genome comprises transposable elements compared t
- …