166 research outputs found

    Towards a Cleaner Document-Oriented Multilingual Crawled Corpus

    Get PDF
    12 pages, 6 figures, 2 tablesInternational audienceThe need for raw large raw corpora has dramatically increased in recent years with the introduction of transfer learning and semi-supervised learning methods to Natural Language Processing. And while there have been some recent attempts to manually curate the amount of data necessary to train large language models, the main way to obtain this data is still through automatic web crawling. In this paper we take the existing multilingual web corpus OSCAR and its pipeline Ungoliant that extracts and classifies data from Common Crawl at the line level, and propose a set of improvements and automatic annotations in order to produce a new document-oriented version of OSCAR that could prove more suitable to pre-train large generative language models as well as hopefully other applications in Natural Language Processing and Digital Humanities

    Semi-automatic staging area for high-quality structured data extraction from scientific literature

    Full text link
    In this study, we propose a staging area for ingesting new superconductors' experimental data in SuperCon that is machine-collected from scientific articles. Our objective is to enhance the efficiency of updating SuperCon while maintaining or enhancing the data quality. We present a semi-automatic staging area driven by a workflow combining automatic and manual processes on the extracted database. An anomaly detection automatic process aims to pre-screen the collected data. Users can then manually correct any errors through a user interface tailored to simplify the data verification on the original PDF documents. Additionally, when a record is corrected, its raw data is collected and utilised to improve machine learning models as training data. Evaluation experiments demonstrate that our staging area significantly improves curation quality. We compare the interface with the traditional manual approach of reading PDF documents and recording information in an Excel document. Using the interface boosts the precision and recall by 6% and 50%, respectively to an average increase of 40% in F1-score.Comment: 5 tables, 9 figures, 31 page

    From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French

    Get PDF
    8 pages, 2 figures, 4 tablesInternational audienceLanguage models for historical states of language are becoming increasingly important to allow the optimal digitisation and analysis of old textual sources. Because these historical states are at the same time more complex to process and more scarce in the corpora available, specific efforts are necessary to train natural language processing (NLP) tools adapted to the data. In this paper, we present our efforts to develop NLP tools for Early Modern French (historical French from the 16th to the 18th centuries). We present the FreEMmax corpus of Early Modern French and D'AlemBERT, a RoBERTa-based language model trained on FreEMmax. We evaluate the usefulness of D'AlemBERT by fine-tuning it on a part-of-speech tagging task, outperforming previous work on the test set. Importantly, we find evidence for the transfer learning capacity of the language model, since its performance on lesser-resourced time periods appears to have been boosted by the more resourced ones. We release D'AlemBERT and the open-sourced subpart of the FreEMmax corpus

    Análisis petrográficos de rocas silíceas en el centro-este de la provincia de San Luis

    Get PDF
    En este trabajo se presentan los primeros resultados obtenidos en la caracterización de la Base Regional de Recursos Líticos en las cuencas alta y media del río Quinto (centro-este de la Provincia de San Luis). Los trabajos de campo fueron planificados desde la arqueológica distribucional y la geoarqueología, con el objetivo de localizar fuentes de materia prima lítica potenciales y aquellas utilizadas por los grupos locales. A nivel macroregional, el cuarzo es la principal materia prima lítica registrada en sitios arqueológicos. Sin embargo, los antecedentes arqueológicos mencionan la utilización de rocas silíceas de calidad superior para la talla. En este sentido, solo tres fuentes de este tipo fueron sistemáticamente estudiadas en la provincia. El análisis petrográfico permitió determinar la presencia de tres nuevas fuentes potenciales de rocas silíceas, identificadas microscópicamente como calcedonias.Neste artigo, apresentamos os primeiros resultados obtidos na caracterização da Base Regional de Recursos Lógicos nas bacias superior e média do rio Quinto (leste central da Província de San Luis). Os trabalhos de campo foram planejados a partir da arqueologia e geoarqueologia distributiva, com o objetivo de localizar fontes de matérias-primas líticas potenciais e as utilizadas pelos grupos locais. No nível macrorregional, o quartzo é a principal matéria-prima lítica registrada em sítios arqueológicos. No entanto, o registro arqueológico mencionou o uso de rochas siliciosas de qualidade superior para o tamanho. Nesse sentido, apenas três fontes deste tipo foram estudadas sistematicamente na província. A análise petrográfica permitiu determinar a presença de três novas fontes potenciais de rochas silíceas, identificadas microscopicamente como calcedônia.In this paper we present the first results obtained in the characterization of the Regional Base of Lithic Resources in the upper and middle basins of the Quinto River (east-central of the Province of San Luis). The field works were planned from the distributional archeology and geoarchaeology, with the aim of locating sources of potential lithic raw material and those used by local groups. At the macroregional level, quartz is the main lithic raw material registered in archaeological sites. However, the archaeological record mentioned the use of siliceous rocks of superior quality for the size. In this sense, only three sources of this type were systematically studied in the province. The petrographic analysis allowed to determine the presence of three new potential sources of siliceous rocks, identified microscopically as chalcedonies.Fil: Borgo, Mariangeles. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Geología; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Luis; ArgentinaFil: Ramos, Gabriel Alejandro. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Geología; ArgentinaFil: Heider, Guillermo. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Geología; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Luis; ArgentinaFil: Chiesa, Jorge Orlando. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Geología; ArgentinaFil: Ortiz Suarez, Ariel Emilio. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Geología; ArgentinaFil: Curtoni, Rafael Pedro. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Tandil. Investigaciones Arqueológicas y Paleontológicas del Cuaternario Pampeano. Universidad Nacional del Centro de la Provincia de Buenos Aires. Investigaciones Arqueológicas y Paleontológicas del Cuaternario Pampeano; ArgentinaFil: Gil, Raul Andres. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Luis. Instituto de Química de San Luis. Universidad Nacional de San Luis. Facultad de Química, Bioquímica y Farmacia. Instituto de Química de San Luis; Argentin

    Clinical characteristics and respiratory care in hospitalized vaccinated SARS-CoV-2 patients

    Get PDF
    Background: The main objective of the present study was to analyze both clinical characteristics and evolution during hospitalization of a cohort of patients admitted for COVID-19 pneumonia who were not vaccinated, or with a complete or incomplete vaccination schedule. Methods: This COVID-19 specialized single-center cohort study of 1888 COVID-19 patients hospitalized at the “Enfermera Isabel Zendal” Emergencies Hospital (HEEIZ), Madrid (Spain) was performed between July 1 and September 30, 2021. It compared the results of 1327 hospitalized unvaccinated patients to 209 hospitalized fully vaccinated and 352 hospitalized partially vaccinated patients. The four different COVID-19 vaccines authorized in Spain during the time-period studied were: BNT162b2 (Pfizer); ChAdOx1 nCoV-19 (AstraZeneca), mRNA-1273 (Moderna); Ad26.COV2.S (Janssen). Findings: Hospitalized patients’ median age was 41 years (IQR 33–50) for the unvaccinated and 61 years (IQR 53–67) for the fully vaccinated ones. The main comorbidities were obesity, hypertension and diabetes mellitus. 20% of unvaccinated patients (266) required noninvasive respiratory care, as did 14% (51) of partially and 14% (30) of fully vaccinated; 6% (78) of the unvaccinated patients also needed invasive respiratory care, as did 5% (16) of partially and 11 (5%) fully vaccinated. Interpretation: Fully vaccinated patients were 84% (95% CI: 82–86%) less likely to be admitted to hospital, and protection rose for those aged <50 years. Once hospitalized, vaccinated patients displayed more protection against requiring respiratory care than unvaccinated ones, despite being older and having more comorbidities. No differences appeared for the four studied COVID-19 vaccines and complying with vaccination recommendations proved relevant. Funding: The research was funded by the “Plan Propio de Investigación” Program of the Castilla-La Mancha University /European Regional Development Fund (2021-GRIN-31,039

    Tokenizer Choice For LLM Training: Negligible or Crucial?

    Full text link
    The recent success of LLMs has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot. Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer algorithms and parameterizations. Our studies highlight that the tokenizer choice can significantly impact the model's downstream performance, training and inference costs. In particular, we find that the common tokenizer evaluation metrics fertility and parity are not always predictive of model downstream performance, rendering these metrics a questionable proxy for the model's downstream performance. Furthermore, we show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English. While English-only tokenizers have been applied to the training of multi-lingual LLMs, we find that this approach results in a severe downstream performance degradation and additional training costs of up to 68%, due to an inefficient tokenization vocabulary

    Geoarchaeological studies of sources and lithic quarries in the Pampean Sierras and adjacent plains

    Get PDF
    El material lítico es el elemento de mayor abundancia en los registros arqueológicos de Sierras Centrales y sus llanuras adyacentes. Los estudios realizados sobre el mismo utilizan diferentes escalas espaciales, metodologías de campo y laboratorio. Sin embargo, los programas de investigación orientados a la detección de fuentes de aprovisionamiento y canteras arqueológicas no tienen en la región un desarrollo similar al de otras regiones del Argentina. En este trabajo se presentan las líneas iniciales de un proyecto de escala macrorregional, específicamente orientado a su estudio. Los resultados alcanzados hasta el momento permitieron identificar numerosas canteras y fuentes en las provincias de San Luis, Córdoba, La Rioja y Catamarca. En ese marco, proponemos un modelo de yacencia de rocas silíceas. El mismo permite entender por un lado la génesis de las rocas identificadas y, por otra parte, se constituye como el primer modelo predictivo de escala amplia para el centro de Argentina.Lithic materials are among the most abundant items in the archaeological record of the Central Ranges and their adjacent plains. The studies carried out with lithic artefacts use different spatial scales, as well as field and laboratory methodologies. However, the research programs oriented to the detection of lithic sources and archaeological quarries do not have in this region a similar development in comparison to other regions of Argentina. This paper presents the initial lines of a macroregional scale project, specifically oriented to their study. The results achieved so far allowed the identification of numerous quarries and lithic sources in San Luis, Córdoba, La Rioja, and Catamarca provinces. We propose a model of deposit of the siliceous rocks which allows to understand the genesis of the identified rocks. On the other hand, it is the first wide-scale predictive occurrence model for the center of Argentina.Fil: Heider, Guillermo. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Geología; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Luis; ArgentinaFil: Ortiz Suarez, Ariel. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Geología; ArgentinaFil: Rivero, Diego Eduardo. Centro de Estudios Históricos "Profesor Carlos S. A. Segreti". Instituto de Estudios Históricos - Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Córdoba. Instituto de Estudios Históricos; ArgentinaFil: Baldo, Edgardo Gaspar Agustin. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Córdoba. Centro de Investigaciones en Ciencias de la Tierra. Universidad Nacional de Córdoba. Facultad de Ciencias Exactas Físicas y Naturales. Centro de Investigaciones en Ciencias de la Tierra; ArgentinaFil: Pastor, Sebastián. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro de Investigaciones y Transferencia de Catamarca. Universidad Nacional de Catamarca. Centro de Investigaciones y Transferencia de Catamarca; ArgentinaFil: Ramos, Gabriel. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Geología; ArgentinaFil: Borgo, Mariangeles. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Geología; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Luis; ArgentinaFil: Gil, Raul Andres. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Luis. Instituto de Química de San Luis. Universidad Nacional de San Luis. Facultad de Química, Bioquímica y Farmacia. Instituto de Química de San Luis; ArgentinaFil: Chiesa, Jorge. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Geología; ArgentinaFil: Costa, Carlos. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Geología; ArgentinaFil: Recalde, Maria Andrea. Centro de Estudios Historicos "prof. Carlos S.a. Segreti". Instituto de Estudios Historicos. - Consejo Nacional de Investigaciones Cientificas y Tecnicas. Centro Cientifico Tecnologico Conicet - Cordoba. Instituto de Estudios Historicos.; ArgentinaFil: Curtoni, Rafael Pedro. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Tandil. Investigaciones Arqueológicas y Paleontológicas del Cuaternario Pampeano. Universidad Nacional del Centro de la Provincia de Buenos Aires. Investigaciones Arqueológicas y Paleontológicas del Cuaternario Pampeano; ArgentinaFil: Capriolo, Ana Julieta. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Geología; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Luis; ArgentinaFil: Muñoz, Lucas. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Geología; Argentin

    Tocilizumab in refractory Caucasian Takayasu's arteritis: a multicenter study of 54 patients and literature review

    Get PDF
    Objective: To assess the efficacy and safety of tocilizumab (TCZ) in Caucasian patients with refractory Takayasu's arteritis (TAK) in clinical practice. Methods: A multicenter study of Caucasian patients with refractory TAK who received TCZ. The outcome variables were remission, glucocorticoid-sparing effect, improvement in imaging techniques, and adverse events. A comparative study between patients who received TCZ as monotherapy (TCZMONO) and combined with conventional disease modifying anti-rheumatic drugs (cDMARDs) (TCZCOMBO) was performed. Results: The study comprised 54 patients (46 women/8 men) with a median [interquartile range (IQR)] age of 42.0 (32.5-50.5) years. TCZ was started after a median (IQR) of 12.0 (3.0-31.5) months since TAK diagnosis. Remission was achieved in 12/54 (22.2%), 19/49 (38.8%), 23/44 (52.3%), and 27/36 (75%) patients at 1, 3, 6, and 12 months, respectively. The prednisone dose was reduced from 30.0 mg/day (12.5-50.0) to 5.0 (0.0-5.6) mg/day at 12 months. An improvement in imaging findings was reported in 28 (73.7%) patients after a median (IQR) of 9.0 (6.0-14.0) months. Twenty-three (42.6%) patients were on TCZMONO and 31 (57.4%) on TCZCOMBO: MTX (n = 28), cyclosporine A (n = 2), azathioprine (n = 1). Patients on TCZCOMBO were younger [38.0 (27.0-46.0) versus 45.0 (38.0-57.0)] years; difference (diff) [95% confidence interval (CI) = -7.0 (-17.9, -0.56] with a trend to longer TAK duration [21.0 (6.0-38.0) versus 6.0 (1.0-23.0)] months; diff 95% CI = 15 (-8.9, 35.5), and higher c-reactive protein [2.4 (0.7-5.6) versus 1.3 (0.3-3.3)] mg/dl; diff 95% CI = 1.1 (-0.26, 2.99). Despite these differences, similar outcomes were observed in both groups (log rank p = 0.862). Relevant adverse events were reported in six (11.1%) patients, but only three developed severe events that required TCZ withdrawal. Conclusion: TCZ in monotherapy, or combined with cDMARDs, is effective and safe in patients with refractory TAK of Caucasian origin.Funding: This work was partially supported by RETICS Programs, RD08/0075 (RIER), RD12/0009/0013 and RD16/0012 from “Instituto de Salud Carlos III” (ISCIII) (Spain)
    corecore