151 research outputs found

    Towards a Cleaner Document-Oriented Multilingual Crawled Corpus

    Get PDF
    12 pages, 6 figures, 2 tablesInternational audienceThe need for raw large raw corpora has dramatically increased in recent years with the introduction of transfer learning and semi-supervised learning methods to Natural Language Processing. And while there have been some recent attempts to manually curate the amount of data necessary to train large language models, the main way to obtain this data is still through automatic web crawling. In this paper we take the existing multilingual web corpus OSCAR and its pipeline Ungoliant that extracts and classifies data from Common Crawl at the line level, and propose a set of improvements and automatic annotations in order to produce a new document-oriented version of OSCAR that could prove more suitable to pre-train large generative language models as well as hopefully other applications in Natural Language Processing and Digital Humanities

    Semi-automatic staging area for high-quality structured data extraction from scientific literature

    Full text link
    In this study, we propose a staging area for ingesting new superconductors' experimental data in SuperCon that is machine-collected from scientific articles. Our objective is to enhance the efficiency of updating SuperCon while maintaining or enhancing the data quality. We present a semi-automatic staging area driven by a workflow combining automatic and manual processes on the extracted database. An anomaly detection automatic process aims to pre-screen the collected data. Users can then manually correct any errors through a user interface tailored to simplify the data verification on the original PDF documents. Additionally, when a record is corrected, its raw data is collected and utilised to improve machine learning models as training data. Evaluation experiments demonstrate that our staging area significantly improves curation quality. We compare the interface with the traditional manual approach of reading PDF documents and recording information in an Excel document. Using the interface boosts the precision and recall by 6% and 50%, respectively to an average increase of 40% in F1-score.Comment: 5 tables, 9 figures, 31 page

    From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French

    Get PDF
    8 pages, 2 figures, 4 tablesInternational audienceLanguage models for historical states of language are becoming increasingly important to allow the optimal digitisation and analysis of old textual sources. Because these historical states are at the same time more complex to process and more scarce in the corpora available, specific efforts are necessary to train natural language processing (NLP) tools adapted to the data. In this paper, we present our efforts to develop NLP tools for Early Modern French (historical French from the 16th to the 18th centuries). We present the FreEMmax corpus of Early Modern French and D'AlemBERT, a RoBERTa-based language model trained on FreEMmax. We evaluate the usefulness of D'AlemBERT by fine-tuning it on a part-of-speech tagging task, outperforming previous work on the test set. Importantly, we find evidence for the transfer learning capacity of the language model, since its performance on lesser-resourced time periods appears to have been boosted by the more resourced ones. We release D'AlemBERT and the open-sourced subpart of the FreEMmax corpus

    Análisis petrográficos de rocas silíceas en el centro-este de la provincia de San Luis

    Get PDF
    En este trabajo se presentan los primeros resultados obtenidos en la caracterización de la Base Regional de Recursos Líticos en las cuencas alta y media del río Quinto (centro-este de la Provincia de San Luis). Los trabajos de campo fueron planificados desde la arqueológica distribucional y la geoarqueología, con el objetivo de localizar fuentes de materia prima lítica potenciales y aquellas utilizadas por los grupos locales. A nivel macroregional, el cuarzo es la principal materia prima lítica registrada en sitios arqueológicos. Sin embargo, los antecedentes arqueológicos mencionan la utilización de rocas silíceas de calidad superior para la talla. En este sentido, solo tres fuentes de este tipo fueron sistemáticamente estudiadas en la provincia. El análisis petrográfico permitió determinar la presencia de tres nuevas fuentes potenciales de rocas silíceas, identificadas microscópicamente como calcedonias.Neste artigo, apresentamos os primeiros resultados obtidos na caracterização da Base Regional de Recursos Lógicos nas bacias superior e média do rio Quinto (leste central da Província de San Luis). Os trabalhos de campo foram planejados a partir da arqueologia e geoarqueologia distributiva, com o objetivo de localizar fontes de matérias-primas líticas potenciais e as utilizadas pelos grupos locais. No nível macrorregional, o quartzo é a principal matéria-prima lítica registrada em sítios arqueológicos. No entanto, o registro arqueológico mencionou o uso de rochas siliciosas de qualidade superior para o tamanho. Nesse sentido, apenas três fontes deste tipo foram estudadas sistematicamente na província. A análise petrográfica permitiu determinar a presença de três novas fontes potenciais de rochas silíceas, identificadas microscopicamente como calcedônia.In this paper we present the first results obtained in the characterization of the Regional Base of Lithic Resources in the upper and middle basins of the Quinto River (east-central of the Province of San Luis). The field works were planned from the distributional archeology and geoarchaeology, with the aim of locating sources of potential lithic raw material and those used by local groups. At the macroregional level, quartz is the main lithic raw material registered in archaeological sites. However, the archaeological record mentioned the use of siliceous rocks of superior quality for the size. In this sense, only three sources of this type were systematically studied in the province. The petrographic analysis allowed to determine the presence of three new potential sources of siliceous rocks, identified microscopically as chalcedonies.Fil: Borgo, Mariangeles. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Geología; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Luis; ArgentinaFil: Ramos, Gabriel Alejandro. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Geología; ArgentinaFil: Heider, Guillermo. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Geología; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Luis; ArgentinaFil: Chiesa, Jorge Orlando. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Geología; ArgentinaFil: Ortiz Suarez, Ariel Emilio. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Geología; ArgentinaFil: Curtoni, Rafael Pedro. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Tandil. Investigaciones Arqueológicas y Paleontológicas del Cuaternario Pampeano. Universidad Nacional del Centro de la Provincia de Buenos Aires. Investigaciones Arqueológicas y Paleontológicas del Cuaternario Pampeano; ArgentinaFil: Gil, Raul Andres. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Luis. Instituto de Química de San Luis. Universidad Nacional de San Luis. Facultad de Química, Bioquímica y Farmacia. Instituto de Química de San Luis; Argentin

    Tokenizer Choice For LLM Training: Negligible or Crucial?

    Full text link
    The recent success of LLMs has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot. Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer algorithms and parameterizations. Our studies highlight that the tokenizer choice can significantly impact the model's downstream performance, training and inference costs. In particular, we find that the common tokenizer evaluation metrics fertility and parity are not always predictive of model downstream performance, rendering these metrics a questionable proxy for the model's downstream performance. Furthermore, we show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English. While English-only tokenizers have been applied to the training of multi-lingual LLMs, we find that this approach results in a severe downstream performance degradation and additional training costs of up to 68%, due to an inefficient tokenization vocabulary

    Geoarchaeological studies of sources and lithic quarries in the Pampean Sierras and adjacent plains

    Get PDF
    El material lítico es el elemento de mayor abundancia en los registros arqueológicos de Sierras Centrales y sus llanuras adyacentes. Los estudios realizados sobre el mismo utilizan diferentes escalas espaciales, metodologías de campo y laboratorio. Sin embargo, los programas de investigación orientados a la detección de fuentes de aprovisionamiento y canteras arqueológicas no tienen en la región un desarrollo similar al de otras regiones del Argentina. En este trabajo se presentan las líneas iniciales de un proyecto de escala macrorregional, específicamente orientado a su estudio. Los resultados alcanzados hasta el momento permitieron identificar numerosas canteras y fuentes en las provincias de San Luis, Córdoba, La Rioja y Catamarca. En ese marco, proponemos un modelo de yacencia de rocas silíceas. El mismo permite entender por un lado la génesis de las rocas identificadas y, por otra parte, se constituye como el primer modelo predictivo de escala amplia para el centro de Argentina.Lithic materials are among the most abundant items in the archaeological record of the Central Ranges and their adjacent plains. The studies carried out with lithic artefacts use different spatial scales, as well as field and laboratory methodologies. However, the research programs oriented to the detection of lithic sources and archaeological quarries do not have in this region a similar development in comparison to other regions of Argentina. This paper presents the initial lines of a macroregional scale project, specifically oriented to their study. The results achieved so far allowed the identification of numerous quarries and lithic sources in San Luis, Córdoba, La Rioja, and Catamarca provinces. We propose a model of deposit of the siliceous rocks which allows to understand the genesis of the identified rocks. On the other hand, it is the first wide-scale predictive occurrence model for the center of Argentina.Fil: Heider, Guillermo. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Geología; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Luis; ArgentinaFil: Ortiz Suarez, Ariel. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Geología; ArgentinaFil: Rivero, Diego Eduardo. Centro de Estudios Históricos "Profesor Carlos S. A. Segreti". Instituto de Estudios Históricos - Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Córdoba. Instituto de Estudios Históricos; ArgentinaFil: Baldo, Edgardo Gaspar Agustin. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Córdoba. Centro de Investigaciones en Ciencias de la Tierra. Universidad Nacional de Córdoba. Facultad de Ciencias Exactas Físicas y Naturales. Centro de Investigaciones en Ciencias de la Tierra; ArgentinaFil: Pastor, Sebastián. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro de Investigaciones y Transferencia de Catamarca. Universidad Nacional de Catamarca. Centro de Investigaciones y Transferencia de Catamarca; ArgentinaFil: Ramos, Gabriel. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Geología; ArgentinaFil: Borgo, Mariangeles. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Geología; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Luis; ArgentinaFil: Gil, Raul Andres. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Luis. Instituto de Química de San Luis. Universidad Nacional de San Luis. Facultad de Química, Bioquímica y Farmacia. Instituto de Química de San Luis; ArgentinaFil: Chiesa, Jorge. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Geología; ArgentinaFil: Costa, Carlos. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Geología; ArgentinaFil: Recalde, Maria Andrea. Centro de Estudios Historicos "prof. Carlos S.a. Segreti". Instituto de Estudios Historicos. - Consejo Nacional de Investigaciones Cientificas y Tecnicas. Centro Cientifico Tecnologico Conicet - Cordoba. Instituto de Estudios Historicos.; ArgentinaFil: Curtoni, Rafael Pedro. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Tandil. Investigaciones Arqueológicas y Paleontológicas del Cuaternario Pampeano. Universidad Nacional del Centro de la Provincia de Buenos Aires. Investigaciones Arqueológicas y Paleontológicas del Cuaternario Pampeano; ArgentinaFil: Capriolo, Ana Julieta. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Geología; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Luis; ArgentinaFil: Muñoz, Lucas. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Geología; Argentin

    Tocilizumab in refractory Caucasian Takayasu's arteritis: a multicenter study of 54 patients and literature review

    Get PDF
    Objective: To assess the efficacy and safety of tocilizumab (TCZ) in Caucasian patients with refractory Takayasu's arteritis (TAK) in clinical practice. Methods: A multicenter study of Caucasian patients with refractory TAK who received TCZ. The outcome variables were remission, glucocorticoid-sparing effect, improvement in imaging techniques, and adverse events. A comparative study between patients who received TCZ as monotherapy (TCZMONO) and combined with conventional disease modifying anti-rheumatic drugs (cDMARDs) (TCZCOMBO) was performed. Results: The study comprised 54 patients (46 women/8 men) with a median [interquartile range (IQR)] age of 42.0 (32.5-50.5) years. TCZ was started after a median (IQR) of 12.0 (3.0-31.5) months since TAK diagnosis. Remission was achieved in 12/54 (22.2%), 19/49 (38.8%), 23/44 (52.3%), and 27/36 (75%) patients at 1, 3, 6, and 12 months, respectively. The prednisone dose was reduced from 30.0 mg/day (12.5-50.0) to 5.0 (0.0-5.6) mg/day at 12 months. An improvement in imaging findings was reported in 28 (73.7%) patients after a median (IQR) of 9.0 (6.0-14.0) months. Twenty-three (42.6%) patients were on TCZMONO and 31 (57.4%) on TCZCOMBO: MTX (n = 28), cyclosporine A (n = 2), azathioprine (n = 1). Patients on TCZCOMBO were younger [38.0 (27.0-46.0) versus 45.0 (38.0-57.0)] years; difference (diff) [95% confidence interval (CI) = -7.0 (-17.9, -0.56] with a trend to longer TAK duration [21.0 (6.0-38.0) versus 6.0 (1.0-23.0)] months; diff 95% CI = 15 (-8.9, 35.5), and higher c-reactive protein [2.4 (0.7-5.6) versus 1.3 (0.3-3.3)] mg/dl; diff 95% CI = 1.1 (-0.26, 2.99). Despite these differences, similar outcomes were observed in both groups (log rank p = 0.862). Relevant adverse events were reported in six (11.1%) patients, but only three developed severe events that required TCZ withdrawal. Conclusion: TCZ in monotherapy, or combined with cDMARDs, is effective and safe in patients with refractory TAK of Caucasian origin.Funding: This work was partially supported by RETICS Programs, RD08/0075 (RIER), RD12/0009/0013 and RD16/0012 from “Instituto de Salud Carlos III” (ISCIII) (Spain)

    Polymorphisms in DNA-repair genes in a cohort of prostate cancer patients from different areas in Spain: heterogeneity between populations as a confounding factor in association studies

    Get PDF
    Background: Differences in the distribution of genotypes between individuals of the same ethnicity are an important confounder factor commonly undervalued in typical association studies conducted in radiogenomics. Objective: To evaluate the genotypic distribution of SNPs in a wide set of Spanish prostate cancer patients for determine the homogeneity of the population and to disclose potential bias. Design, Setting, and Participants: A total of 601 prostate cancer patients from Andalusia, Basque Country, Canary and Catalonia were genotyped for 10 SNPs located in 6 different genes associated to DNA repair: XRCC1 (rs25487, rs25489, rs1799782), ERCC2 (rs13181), ERCC1 (rs11615), LIG4 (rs1805388, rs1805386), ATM (rs17503908, rs1800057) and P53 (rs1042522). The SNP genotyping was made in a Biotrove OpenArrayH NT Cycler. Outcome Measurements and Statistical Analysis: Comparisons of genotypic and allelic frequencies among populations, as well as haplotype analyses were determined using the web-based environment SNPator. Principal component analysis was made using the SnpMatrix and XSnpMatrix classes and methods implemented as an R package. Non-supervised hierarchical cluster of SNP was made using MultiExperiment Viewer. Results and Limitations: We observed that genotype distribution of 4 out 10 SNPs was statistically different among the studied populations, showing the greatest differences between Andalusia and Catalonia. These observations were confirmed in cluster analysis, principal component analysis and in the differential distribution of haplotypes among the populations. Because tumor characteristics have not been taken into account, it is possible that some polymorphisms may influence tumor characteristics in the same way that it may pose a risk factor for other disease characteristics. Conclusion: Differences in distribution of genotypes within different populations of the same ethnicity could be an important confounding factor responsible for the lack of validation of SNPs associated with radiation-induced toxicity, especially when extensive meta-analysis with subjects from different countries are carried out
    corecore