Search CORE

71 research outputs found

Tint, the Swiss-Army Tool for Natural Language Processing in Italian

Author: Alessio Palmero Aprosio
Publication venue
Publication date: 01/01/2021
Field of study

In this we paper present the last version of Tint, an opensource, fast and extendable Natural Language Processing suite for Italian based on Stanford CoreNLP. The new release includes a set of text processing components for fine-grained linguistic analysis, from tokenization to relation extraction, including part-of-speech tagging, morphological analysis, lemmatization, multi-word expression recognition, dependency parsing, named-entity recognition, keyword extraction, and much more. Tint is written in Java freely distributed under the GPL license. Although some modules do not perform at a state-of-the-art level, Tint reaches very good accuracy in all modules, and can be easily used out-of-the-box

Archivio della ricerca - Fondazione Bruno Kessler

It Is MarkIT That Is New: An Italian Treebank of Marked Constructions

Author: Paccosi Teresa
Palmero Aprosio Alessio
Publication venue
Publication date: 01/01/2022
Field of study

In this paper we present MarkIT, a treebank of marked constructions in Italian, containing around 800 sentences with dependency annotation. We detail the process to extract the sentences and manually correct them. The resource covers seven types of marked constructions plus some ambiguous sentences, whose syntax can be wrongly classified as marked. We also present a preliminary evaluation of parsing performance, comparing a model trained on existing Italian treebanks with the model obtained by adding MarkIT to the training set

Archivio della ricerca - Fondazione Bruno Kessler

NERMuD at EVALITA 2023: Overview of the Named-Entities Recognition on Multi-Domain Documents Task

Author: Alessio Palmero Aprosio
Teresa Paccosi
Publication venue
Publication date: 01/01/2023
Field of study

In this paper, we describe NERMuD, a Named-Entities Recognition (NER) shared task presented at the EVALITA 2023 evaluation campaign. NERMuD is organized into two different sub-tasks: a domain-agnostic classification and a domainspecific one. We display the evaluation of the system presented by the only task participant, ExtremITA. ExtremITA proposes a unified approach for all the tasks of EVALITA 2023, and it addresses in our case only the domain-agnostic sub-task. We present an updated version of KIND, the dataset distributed for the training of the system. We then provide the baselines proposed, the results of the evaluation, and a brief discussion

Archivio della ricerca - Fondazione Bruno Kessler

BERToldo, the Historical BERT for Italian

Author: Alessio Palmero Aprosio
Sara Tonelli
Stefano Menini
Publication venue
Publication date: 01/01/2022
Field of study

Recent works in historical language processing have shown that transformer-based models can be successfully created using historical corpora, and that using them for analysing and classifying data from the past can be beneficial compared to standard transformer models. This has led to the creation of BERT-like models for different languages trained with digital repositories from the past. In this work we introduce the Italian version of historical BERT, which we call BERToldo. We evaluate the model on the task of PoS-tagging Dante Alighieri’s works, considering not only the tagger performance but also the model size and the time needed to train it. We also address the problem of duplicated data, which is rather common for languages with a limited availability of historical corpora. We show that deduplication reduces training time without affecting performance. The model and its smaller versions are all made available to the research community

Archivio della ricerca - Fondazione Bruno Kessler

The impact of phrases on Italian lexical simplification

Author: Mazzon Marco
Palmero Aprosio Alessio
Tonelli Sara
Publication venue
Publication date: 01/01/2017
Field of study

Automated lexical simplification has been performed so far focusing only on the replacement of single tokens with single tokens, and this choice has affected both the development of systems and the creation of benchmarks. In this paper, we argue that lexical simplification in real settings should deal both with single and multi-token terms, and present a benchmark created for the task. Besides, we describe how a freely available system can be tuned to cover also the simplification of phrases, and perform an evaluation comparing different experimental settings

Crossref

ZENODO

Archivio della ricerca - Fondazione Bruno Kessler

OpenEdition

Tint 2.0: an All-inclusive Suite for NLP in Italian

Author: Aprosio Alessio Palmero
Moretti Giovanni
Publication venue: 'OpenEdition'
Publication date: 01/01/2018
Field of study

In this we paper present Tint 2.0, an open-source, fast and extendable Natural Language Processing suite for Italian based on Stanford CoreNLP. The new release includes some improvements of the existing NLP modules, and a set of new text processing components for fine-grained linguistic analysis that were not available so far, including multi-word expression recognition, affix analysis, readability and classification of complex verb tenses.In questo articolo presentiamo Tint 2.0, una collezione di moduli opensource veloci e personalizzabili per l’analisi automatica di testi in italiano basata su Stanford CoreNLP. La nuova versione comprende alcune migliorie relative ai moduli standard, e l’integrazione di componenti totalmente nuovi per l’analisi linguistica. Questi includono per esempio il riconoscimento di espressioni polirematiche, l’analisi degli affissi, il calcolo della leggibilità e il riconoscimento dei tempi verbali composti

Crossref

Archivio della ricerca - Fondazione Bruno Kessler

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

OpenEdition

Recognizing Biographical Sections in Wikipedia

Author: Palmero Aprosio Alessio
Tonelli Sara
Publication venue: The Association for Computational Linguistics
Publication date
Field of study

Wikipedia is the largest collection of encyclopedic data ever written in the history of humanity. Thanks to its coverage and its availability in machine-readable format, it has become a primary resource for largescale research in historical and cultural studies. In this work, we focus on the subset of pages describing persons, and we investigate the task of recognizing biographical sections from them: given a person’s page, we identify the list of sections where information about her/his life is present. We model this as a sequence classification problem, and propose a supervised setting, in which the training data are acquired automatically. Besides, we show that six simple features extracted only from the section titles are very informative and yield good results well above a strong baseline

Archivio della ricerca - Fondazione Bruno Kessler

FBK-DH at SemEval-2020 Task 12: Using Multi-channel BERT for Multilingual Offensive Language Detection

Author: Alessio Palmero Aprosio
Camilla Casula
Sara Tonelli
Stefano Menini
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2020
Field of study

In this paper we present our submission to sub-task A at SemEval 2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval2). For Danish, Turkish, Arabic and Greek, we develop an architecture based on transfer learning and relying on a two-channel BERT model, in which the English BERT and the multilingual one are combined after creating a machine-translated parallel corpus for each language in the task. For English, instead, we adopt a more standard, single-channel approach. We find that, in a multilingual scenario, with some languages having small training data, using parallel BERT models with machine translated data can give systems more stability, especially when dealing with noisy data. The fact that machine translation on social media data may not be perfect does not hurt the overall classification performance

Archivio della ricerca - Fondazione Bruno Kessler

Open Access Repository

Demonstrating the power of streaming and sorting for non-distributed RDF processing: RDFpro

Author: Corcoglioniti Francesco
Palmero Aprosio Alessio
Rospocher Marco
Publication venue
Publication date: 01/01/2015
Field of study

We demonstrate RDFpro (RDF Processor), an extensible, generalpurpose, open source tool for processing large RDF datasets on a commodity machine leveraging streaming and sorting techniques. RDFpro provides out-of-thebox implementations – called processors – of common tasks such as data filtering, rule-based inference, smushing, and statistics extraction, as well as easy ways to add new processors and arbitrarily compose processors in complex pipelines

Archivio della ricerca - Fondazione Bruno Kessler