82 research outputs found

    Tint, the Swiss-Army Tool for Natural Language Processing in Italian

    Get PDF
    In this we paper present the last version of Tint, an opensource, fast and extendable Natural Language Processing suite for Italian based on Stanford CoreNLP. The new release includes a set of text processing components for fine-grained linguistic analysis, from tokenization to relation extraction, including part-of-speech tagging, morphological analysis, lemmatization, multi-word expression recognition, dependency parsing, named-entity recognition, keyword extraction, and much more. Tint is written in Java freely distributed under the GPL license. Although some modules do not perform at a state-of-the-art level, Tint reaches very good accuracy in all modules, and can be easily used out-of-the-box

    REDIT: A Tool and Dataset for Extraction of Personal Data in Documents of the Public Administration Domain

    Get PDF
    New regulations on transparency and the recent policy for privacy force the public administration (PA) to make their documents available, but also to limit the diffusion of personal data. The present work displays a first approach to the extraction of sensitive data from PA documents in terms of named entities and semantic relations among them, speeding up the process of extraction of these personal data in order to easily select those which need to be hidden. We also present the process of collection and annotation of the dataset

    KIND: an Italian Multi-Domain Dataset for Named Entity Recognition

    Get PDF
    In this paper we present KIND, an Italian dataset for Named-entity recognition. It contains more than one million tokens with annotation covering three classes: person, location, and organization. The dataset (around 600K tokens) mostly contains manual gold annotations in three different domains (news, literature, and political discourses) and a semi-automatically annotated part. The multi-domain feature is the main strength of the present work, offering a resource which covers different styles and language uses, as well as the largest Italian NER dataset with manual gold annotations. It represents an important resource for the training of NER systems in Italian. Texts and annotations are freely downloadable from the Github repository

    It Is MarkIT That Is New: An Italian Treebank of Marked Constructions

    Get PDF
    In this paper we present MarkIT, a treebank of marked constructions in Italian, containing around 800 sentences with dependency annotation. We detail the process to extract the sentences and manually correct them. The resource covers seven types of marked constructions plus some ambiguous sentences, whose syntax can be wrongly classified as marked. We also present a preliminary evaluation of parsing performance, comparing a model trained on existing Italian treebanks with the model obtained by adding MarkIT to the training set

    NERMuD at EVALITA 2023: Overview of the Named-Entities Recognition on Multi-Domain Documents Task

    Get PDF
    In this paper, we describe NERMuD, a Named-Entities Recognition (NER) shared task presented at the EVALITA 2023 evaluation campaign. NERMuD is organized into two different sub-tasks: a domain-agnostic classification and a domainspecific one. We display the evaluation of the system presented by the only task participant, ExtremITA. ExtremITA proposes a unified approach for all the tasks of EVALITA 2023, and it addresses in our case only the domain-agnostic sub-task. We present an updated version of KIND, the dataset distributed for the training of the system. We then provide the baselines proposed, the results of the evaluation, and a brief discussion

    Erase and Rewind: Manual Correction of NLP Output through a Web Interface

    Get PDF
    In this paper, we present Tintful, an NLP annotation software that can be used both to manually annotate texts and to fix mistakes in NLP pipelines, such as Stanford CoreNLP. Using a paradigm similar to wiki-like systems, a user who notices some wrong annotation can easily fix it and submit the resulting (and right) entry back to the tool developers. Moreover, Tintful can be used to easily annotate data from scratch. The input documents do not need to be in a particular format: starting from the plain text, the sentences are first annotated with CoreNLP, then the user can edit the annotations and submit everything back through a user-friendly interface

    BERToldo, the Historical BERT for Italian

    Get PDF
    Recent works in historical language processing have shown that transformer-based models can be successfully created using historical corpora, and that using them for analysing and classifying data from the past can be beneficial compared to standard transformer models. This has led to the creation of BERT-like models for different languages trained with digital repositories from the past. In this work we introduce the Italian version of historical BERT, which we call BERToldo. We evaluate the model on the task of PoS-tagging Dante Alighieri’s works, considering not only the tagger performance but also the model size and the time needed to train it. We also address the problem of duplicated data, which is rather common for languages with a limited availability of historical corpora. We show that deduplication reduces training time without affecting performance. The model and its smaller versions are all made available to the research community

    The impact of phrases on Italian lexical simplification

    Get PDF
    Automated lexical simplification has been performed so far focusing only on the replacement of single tokens with single tokens, and this choice has affected both the development of systems and the creation of benchmarks. In this paper, we argue that lexical simplification in real settings should deal both with single and multi-token terms, and present a benchmark created for the task. Besides, we describe how a freely available system can be tuned to cover also the simplification of phrases, and perform an evaluation comparing different experimental settings

    FBK-DH at SemEval-2020 Task 12: Using Multi-channel BERT for Multilingual Offensive Language Detection

    Get PDF
    In this paper we present our submission to sub-task A at SemEval 2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval2). For Danish, Turkish, Arabic and Greek, we develop an architecture based on transfer learning and relying on a two-channel BERT model, in which the English BERT and the multilingual one are combined after creating a machine-translated parallel corpus for each language in the task. For English, instead, we adopt a more standard, single-channel approach. We find that, in a multilingual scenario, with some languages having small training data, using parallel BERT models with machine translated data can give systems more stability, especially when dealing with noisy data. The fact that machine translation on social media data may not be perfect does not hurt the overall classification performance
    • …
    corecore