49 research outputs found

    BERToldo, the Historical BERT for Italian

    Get PDF
    Recent works in historical language processing have shown that transformer-based models can be successfully created using historical corpora, and that using them for analysing and classifying data from the past can be beneficial compared to standard transformer models. This has led to the creation of BERT-like models for different languages trained with digital repositories from the past. In this work we introduce the Italian version of historical BERT, which we call BERToldo. We evaluate the model on the task of PoS-tagging Dante Alighieri’s works, considering not only the tagger performance but also the model size and the time needed to train it. We also address the problem of duplicated data, which is rather common for languages with a limited availability of historical corpora. We show that deduplication reduces training time without affecting performance. The model and its smaller versions are all made available to the research community

    FBK-DH at SemEval-2020 Task 12: Using Multi-channel BERT for Multilingual Offensive Language Detection

    Get PDF
    In this paper we present our submission to sub-task A at SemEval 2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval2). For Danish, Turkish, Arabic and Greek, we develop an architecture based on transfer learning and relying on a two-channel BERT model, in which the English BERT and the multilingual one are combined after creating a machine-translated parallel corpus for each language in the task. For English, instead, we adopt a more standard, single-channel approach. We find that, in a multilingual scenario, with some languages having small training data, using parallel BERT models with machine translated data can give systems more stability, especially when dealing with noisy data. The fact that machine translation on social media data may not be perfect does not hurt the overall classification performance

    Towards Personalised Simplification based on L2 Learners' Native Language

    Get PDF
    We present an approach to improve the selection of complex words for automatic text simplification, addressing the need of L2 learners to take into account their native language during simplification. In particular, we develop a methodology that automatically identifies ‘difficult’ terms (i.e. false friends) for L2 learners in order to simplify them. We evaluate not only the quality of the detected false friends but also the impact of this methodology on text simplification compared with a standard frequency-based approach

    Sampling variables and their thresholds for the precise estimation of wild felid population density with camera traps and spatial capture–recapture methods

    Get PDF
    1. Robust monitoring, providing information on population status, is fundamental for successful conservation planning. However, this can be hard to achieve for species that are elusive and occur at low densities, such as felids. These are often keystones of functioning ecosystems and are threatened by habitat loss and human persecution. 2. When elusive species can be individually identified by visible characteristics, for example via camera-trapping, observations of individuals can be used in combination with capture–recapture methods to calculate demographic parameters such as population density. In this context, spatial capture–recapture (SCR) outperforms conventional non-spatial methods, but the precision of results is inherently related to the sampling design, which should therefore be optimised. 3. We focussed on territorial felids in different habitats and investigated how the sampling designs implemented in the field affected the precision of population density estimates. We examined 137 studies that combined camera trapping and SCR methods for density estimation. From these, we collectedspatiotemporal parameters of their sampling designs, monitoring results, such as the number of individuals captured and the number of recaptures, as well as SCR detection parameters. We applied generalised linear mixed-effects models and tree-based regression methods to investigate the influence of variables on the precision of population density estimates and provide numerical thresholds. 4. Our analysis shows that the number of individuals, recapture frequency, and capture probability play the most crucial roles. Surveys yielding over 20 captured individuals that were recaptured on average at least once obtain the most precise population density estimates. 5. Based on our findings, we provide practical guidelines for future SCR studies that apply to all territorial felids. Furthermore, we present a standardised reporting protocol for study transparency and comparability. Our results will improve reporting and reproducibility of SCR studies and aid in setting up optimised sampling designs.publishedVersio

    Using Semantic Linking to Understand Persons' Networks Extracted from Text

    Get PDF
    In this work, we describe a methodology to interpret large persons' networks extracted from text by classifying cliques using the DBpedia ontology. The approach relies on a combination of NLP, Semantic web technologies, and network analysis. The classification methodology that first starts from single nodes and then generalizes to cliques is effective in terms of performance and is able to deal also with nodes that are not linked to Wikipedia. The gold standard manually developed for evaluation shows that groups of co-occurring entities share in most of the cases a category that can be automatically assigned. This holds for both languages considered in this study. The outcome of this work may be of interest to enhance the readability of large networks and to provide an additional semantic layer on top of cliques. This would greatly help humanities scholars when dealing with large amounts of textual data that need to be interpreted or categorized. Furthermore, it represents an unsupervised approach to automatically extend DBpedia starting from a corpus

    Innovative In Situ and Ex Situ Conservation Strategies of the Madonie Fir Abies nebrodensis

    Get PDF
    Abies nebrodensis (Lojac.) Mattei is an endemic species of the north-west of Sicily located in an 84 ha area in the Madonie Regional park. The current population is limited to 30 relic adult trees and a fluctuating number of juveniles of natural regeneration. The species is defined as “Critically Endangered” in the Italian list of threatened plants and is classified as CR-D in the 2000 IUCN Red List of Threatened Species. This article reports the key action undertaken by the LIFE4FIR project aimed at preserving A. nebrodensis, and the results obtained so far in three years of activity. OpenArrays SNPs genotyping revealed a high rate of inbreeding in the natural population and that the adult trees are genetically related. Controlled cross-pollination was consequently performed to increase the genetic variability of the progeny. Outbred offspring are currently being grown in the nursery. Reforestation has been planned by using 4000 selected outbred seedlings in 10 areas within Madonie Park to create re-diffusion cores. Support and protection of the relic population have been implemented through regular phytosanitary surveys, as well as new fencing and video surveillance systems against grazing and wild herbivores. A seedbank and cryobank for the long-term germplasm conservation have been established.European Union LIFE18/NAT/IT/00016

    Gli strumenti informatici. Sviluppo e risultati.

    Get PDF
    Per riconoscere i tratti linguistici di interesse su un corpus composto da quasi tremila temi e per annotarli in modo coerente si \ue8 reso necessario lo sviluppo di diversi strumenti informatici. Tali software appartengono a due tipologie: da un lato, si sono sviluppati alcuni moduli per l'analisi del testo, che in modo automatico riconoscono dei tratti o estraggono delle informazioni parziali utili a riconoscere i tratti in modo manuale. Dall'altro, si \ue8 adattata al progetto una piattaforma online che permette di effettuare annotazione linguistica multilivello con diversi annotatori al lavoro in parallelo su porzioni diverse del corpus di temi

    Findings from the Hackathon on Understanding Euroscepticism Through the Lens of Textual Data

    Get PDF
    We present an overview and the results of a shared-task hackathon that took place as part of a research seminar bringing together a variety of experts and young researchers from the fields of political science, natural language processing and computational social science. The task looked at ways to develop novel methods for political text scaling to better quantify political party positions on European integration and Euroscepticism from the transcript of speeches of three legislations of the European Parliament

    The wide-field, multiplexed, spectroscopic facility WEAVE : survey design, overview, and simulated implementation

    Get PDF
    Funding for the WEAVE facility has been provided by UKRI STFC, the University of Oxford, NOVA, NWO, Instituto de Astrofísica de Canarias (IAC), the Isaac Newton Group partners (STFC, NWO, and Spain, led by the IAC), INAF, CNRS-INSU, the Observatoire de Paris, Région Île-de-France, CONCYT through INAOE, Konkoly Observatory (CSFK), Max-Planck-Institut für Astronomie (MPIA Heidelberg), Lund University, the Leibniz Institute for Astrophysics Potsdam (AIP), the Swedish Research Council, the European Commission, and the University of Pennsylvania.WEAVE, the new wide-field, massively multiplexed spectroscopic survey facility for the William Herschel Telescope, will see first light in late 2022. WEAVE comprises a new 2-degree field-of-view prime-focus corrector system, a nearly 1000-multiplex fibre positioner, 20 individually deployable 'mini' integral field units (IFUs), and a single large IFU. These fibre systems feed a dual-beam spectrograph covering the wavelength range 366-959 nm at R ∼ 5000, or two shorter ranges at R ∼ 20,000. After summarising the design and implementation of WEAVE and its data systems, we present the organisation, science drivers and design of a five- to seven-year programme of eight individual surveys to: (i) study our Galaxy's origins by completing Gaia's phase-space information, providing metallicities to its limiting magnitude for ∼ 3 million stars and detailed abundances for ∼ 1.5 million brighter field and open-cluster stars; (ii) survey ∼ 0.4 million Galactic-plane OBA stars, young stellar objects and nearby gas to understand the evolution of young stars and their environments; (iii) perform an extensive spectral survey of white dwarfs; (iv) survey  ∼ 400 neutral-hydrogen-selected galaxies with the IFUs; (v) study properties and kinematics of stellar populations and ionised gas in z 1 million spectra of LOFAR-selected radio sources; (viii) trace structures using intergalactic/circumgalactic gas at z > 2. Finally, we describe the WEAVE Operational Rehearsals using the WEAVE Simulator.PostprintPeer reviewe
    corecore