Search CORE

82 research outputs found

Tint, the Swiss-Army Tool for Natural Language Processing in Italian

Author: Alessio Palmero Aprosio
Publication venue
Publication date: 01/01/2021
Field of study

In this we paper present the last version of Tint, an opensource, fast and extendable Natural Language Processing suite for Italian based on Stanford CoreNLP. The new release includes a set of text processing components for fine-grained linguistic analysis, from tokenization to relation extraction, including part-of-speech tagging, morphological analysis, lemmatization, multi-word expression recognition, dependency parsing, named-entity recognition, keyword extraction, and much more. Tint is written in Java freely distributed under the GPL license. Although some modules do not perform at a state-of-the-art level, Tint reaches very good accuracy in all modules, and can be easily used out-of-the-box

Archivio della ricerca - Fondazione Bruno Kessler

REDIT: A Tool and Dataset for Extraction of Personal Data in Documents of the Public Administration Domain

Author: Alessio Palmero
Teresa Paccosi
Publication venue
Publication date: 01/01/2022
Field of study

New regulations on transparency and the recent policy for privacy force the public administration (PA) to make their documents available, but also to limit the diffusion of personal data. The present work displays a first approach to the extraction of sensitive data from PA documents in terms of named entities and semantic relations among them, speeding up the process of extraction of these personal data in order to easily select those which need to be hidden. We also present the process of collection and annotation of the dataset

Archivio della ricerca - Fondazione Bruno Kessler

KIND: an Italian Multi-Domain Dataset for Named Entity Recognition

Author: Alessio Palmero
Teresa Paccosi
Publication venue
Publication date: 01/01/2022
Field of study

In this paper we present KIND, an Italian dataset for Named-entity recognition. It contains more than one million tokens with annotation covering three classes: person, location, and organization. The dataset (around 600K tokens) mostly contains manual gold annotations in three different domains (news, literature, and political discourses) and a semi-automatically annotated part. The multi-domain feature is the main strength of the present work, offering a resource which covers different styles and language uses, as well as the largest Italian NER dataset with manual gold annotations. It represents an important resource for the training of NER systems in Italian. Texts and annotations are freely downloadable from the Github repository

arXiv.org e-Print Archive

Archivio della ricerca - Fondazione Bruno Kessler

It Is MarkIT That Is New: An Italian Treebank of Marked Constructions

Author: Paccosi Teresa
Palmero Aprosio Alessio
Publication venue
Publication date: 01/01/2022
Field of study

In this paper we present MarkIT, a treebank of marked constructions in Italian, containing around 800 sentences with dependency annotation. We detail the process to extract the sentences and manually correct them. The resource covers seven types of marked constructions plus some ambiguous sentences, whose syntax can be wrongly classified as marked. We also present a preliminary evaluation of parsing performance, comparing a model trained on existing Italian treebanks with the model obtained by adding MarkIT to the training set

Archivio della ricerca - Fondazione Bruno Kessler

NERMuD at EVALITA 2023: Overview of the Named-Entities Recognition on Multi-Domain Documents Task

Author: Alessio Palmero Aprosio
Teresa Paccosi
Publication venue
Publication date: 01/01/2023
Field of study

In this paper, we describe NERMuD, a Named-Entities Recognition (NER) shared task presented at the EVALITA 2023 evaluation campaign. NERMuD is organized into two different sub-tasks: a domain-agnostic classification and a domainspecific one. We display the evaluation of the system presented by the only task participant, ExtremITA. ExtremITA proposes a unified approach for all the tasks of EVALITA 2023, and it addresses in our case only the domain-agnostic sub-task. We present an updated version of KIND, the dataset distributed for the training of the system. We then provide the baselines proposed, the results of the evaluation, and a brief discussion

Archivio della ricerca - Fondazione Bruno Kessler

Erase and Rewind: Manual Correction of NLP Output through a Web Interface

Author: Alessio Palmero
Lorenzo Bocchi
Valentino Frasnelli
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2021
Field of study

In this paper, we present Tintful, an NLP annotation software that can be used both to manually annotate texts and to fix mistakes in NLP pipelines, such as Stanford CoreNLP. Using a paradigm similar to wiki-like systems, a user who notices some wrong annotation can easily fix it and submit the resulting (and right) entry back to the tool developers. Moreover, Tintful can be used to easily annotate data from scratch. The input documents do not need to be in a particular format: starting from the plain text, the sentences are first annotated with CoreNLP, then the user can edit the annotations and submit everything back through a user-friendly interface

Archivio della ricerca - Fondazione Bruno Kessler

BERToldo, the Historical BERT for Italian

Author: Alessio Palmero Aprosio
Sara Tonelli
Stefano Menini
Publication venue
Publication date: 01/01/2022
Field of study

Recent works in historical language processing have shown that transformer-based models can be successfully created using historical corpora, and that using them for analysing and classifying data from the past can be beneficial compared to standard transformer models. This has led to the creation of BERT-like models for different languages trained with digital repositories from the past. In this work we introduce the Italian version of historical BERT, which we call BERToldo. We evaluate the model on the task of PoS-tagging Dante Alighieri’s works, considering not only the tagger performance but also the model size and the time needed to train it. We also address the problem of duplicated data, which is rather common for languages with a limited availability of historical corpora. We show that deduplication reduces training time without affecting performance. The model and its smaller versions are all made available to the research community

Archivio della ricerca - Fondazione Bruno Kessler

The impact of phrases on Italian lexical simplification

Author: Mazzon Marco
Palmero Aprosio Alessio
Tonelli Sara
Publication venue
Publication date: 01/01/2017
Field of study

Automated lexical simplification has been performed so far focusing only on the replacement of single tokens with single tokens, and this choice has affected both the development of systems and the creation of benchmarks. In this paper, we argue that lexical simplification in real settings should deal both with single and multi-token terms, and present a benchmark created for the task. Besides, we describe how a freely available system can be tuned to cover also the simplification of phrases, and perform an evaluation comparing different experimental settings

Crossref

ZENODO

Archivio della ricerca - Fondazione Bruno Kessler

OpenEdition

FBK-DH at SemEval-2020 Task 12: Using Multi-channel BERT for Multilingual Offensive Language Detection

Author: Alessio Palmero Aprosio
Camilla Casula
Sara Tonelli
Stefano Menini
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2020
Field of study

In this paper we present our submission to sub-task A at SemEval 2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval2). For Danish, Turkish, Arabic and Greek, we develop an architecture based on transfer learning and relying on a two-channel BERT model, in which the English BERT and the multilingual one are combined after creating a machine-translated parallel corpus for each language in the task. For English, instead, we adopt a more standard, single-channel approach. We find that, in a multilingual scenario, with some languages having small training data, using parallel BERT models with machine translated data can give systems more stability, especially when dealing with noisy data. The fact that machine translation on social media data may not be perfect does not hurt the overall classification performance

Archivio della ricerca - Fondazione Bruno Kessler

Open Access Repository