52 research outputs found

    Tint, the Swiss-Army Tool for Natural Language Processing in Italian

    Get PDF
    In this we paper present the last version of Tint, an opensource, fast and extendable Natural Language Processing suite for Italian based on Stanford CoreNLP. The new release includes a set of text processing components for fine-grained linguistic analysis, from tokenization to relation extraction, including part-of-speech tagging, morphological analysis, lemmatization, multi-word expression recognition, dependency parsing, named-entity recognition, keyword extraction, and much more. Tint is written in Java freely distributed under the GPL license. Although some modules do not perform at a state-of-the-art level, Tint reaches very good accuracy in all modules, and can be easily used out-of-the-box

    It Is MarkIT That Is New: An Italian Treebank of Marked Constructions

    Get PDF
    In this paper we present MarkIT, a treebank of marked constructions in Italian, containing around 800 sentences with dependency annotation. We detail the process to extract the sentences and manually correct them. The resource covers seven types of marked constructions plus some ambiguous sentences, whose syntax can be wrongly classified as marked. We also present a preliminary evaluation of parsing performance, comparing a model trained on existing Italian treebanks with the model obtained by adding MarkIT to the training set

    BERToldo, the Historical BERT for Italian

    Get PDF
    Recent works in historical language processing have shown that transformer-based models can be successfully created using historical corpora, and that using them for analysing and classifying data from the past can be beneficial compared to standard transformer models. This has led to the creation of BERT-like models for different languages trained with digital repositories from the past. In this work we introduce the Italian version of historical BERT, which we call BERToldo. We evaluate the model on the task of PoS-tagging Dante Alighieri’s works, considering not only the tagger performance but also the model size and the time needed to train it. We also address the problem of duplicated data, which is rather common for languages with a limited availability of historical corpora. We show that deduplication reduces training time without affecting performance. The model and its smaller versions are all made available to the research community

    Recognizing Biographical Sections in Wikipedia

    Get PDF
    Wikipedia is the largest collection of encyclopedic data ever written in the history of humanity. Thanks to its coverage and its availability in machine-readable format, it has become a primary resource for largescale research in historical and cultural studies. In this work, we focus on the subset of pages describing persons, and we investigate the task of recognizing biographical sections from them: given a person’s page, we identify the list of sections where information about her/his life is present. We model this as a sequence classification problem, and propose a supervised setting, in which the training data are acquired automatically. Besides, we show that six simple features extracted only from the section titles are very informative and yield good results well above a strong baseline

    Demonstrating the power of streaming and sorting for non-distributed RDF processing: RDFpro

    Get PDF
    We demonstrate RDFpro (RDF Processor), an extensible, generalpurpose, open source tool for processing large RDF datasets on a commodity machine leveraging streaming and sorting techniques. RDFpro provides out-of-thebox implementations – called processors – of common tasks such as data filtering, rule-based inference, smushing, and statistics extraction, as well as easy ways to add new processors and arbitrarily compose processors in complex pipelines

    Extracting Knowledge from Text with PIKES

    Get PDF
    In this demonstration we showcase PIKES, a Semantic Role Labeling (SRL)-powered approach for Knowledge Extraction. PIKES implements a rule-based strategy that reinterprets SRL output in light of other linguistic analyses, such as dependency parsing and co-reference resolution, thus properly capturing and formalizing in RDF important linguistic aspects such as argument nominalization, frame-frame relations, and group entities

    Reconciling Information in DBpedia through a Question Answering System

    Get PDF
    Results obtained querying language-specific DBpedia chapters SPARQL endpoints for the same query can be related by several heterogenous relations, or contain an inconsistent set of information about the same topic. To overcome this issue in question answering systems over language-specific DBpedia chapters, we propose the RADAR framework for information reconciliation. Starting from a categorization of the possible relations among the resulting instances, such framework: (i) classifies such relations, (ii) reconciles the obtained information using argumentation theory, (iii) ranks the alternative results depending on the confidence of the source in case of inconsistencies, and (iv) explains the reasons underlying the proposed ranking

    Towards Personalised Simplification based on L2 Learners' Native Language

    Get PDF
    We present an approach to improve the selection of complex words for automatic text simplification, addressing the need of L2 learners to take into account their native language during simplification. In particular, we develop a methodology that automatically identifies ‘difficult’ terms (i.e. false friends) for L2 learners in order to simplify them. We evaluate not only the quality of the detected false friends but also the impact of this methodology on text simplification compared with a standard frequency-based approach

    Analysing the Evolution of Students’ Writing Skills and the Impact of Neo-standard Italian with the help of Computational Linguistics

    Get PDF
    We present a project aimed at studying the evolution of students’ writing skills in a temporal span of 15 years (from 2001 to 2016), analysing in particular the impact of neo-standard Italian. More than 2,500 essays have been transcribed and annotated by teachers according to 28 different linguistic traits. We present here the annotation process together with the first data analysis supported by NLP tools.In questo contributo presentiamo un progetto finalizzato allo studio dell’evoluzione delle abilità di scrittura negli studenti in un arco temporale di 15 anni (dal 2001 al 2016), e in particolare all’analisi dell’impatto dell’italiano neostandard. In questo contesto, più di 2.500 temi sono stati trascritti e annotati da insegnanti, registrando la presenza di 28 diversi tratti linguistici. Il presente studio illustra il processo di annotazione e le prime analisi dei dati con il supporto di strumenti TAL
    • …
    corecore