Search CORE

10 research outputs found

GePpeTto Carves Italian into a Language Model

Author: Cafagna Michele
Dell'Orletta Felice
Guerini Marco
Mattei Lorenzo De
Nissim Malvina
Publication venue: CEUR-WS.org
Publication date: 01/01/2020
Field of study

GePpeTto Carves Italian into a Language Model

Author: Cafagna Michele
Dell'Orletta Felice
Guerini Marco
Mattei Lorenzo De
Nissim Malvina
Publication venue: CEUR-WS.org
Publication date: 01/01/2020
Field of study

In the last few years, pre-trained neural architectures have provided impressive improvements across several NLP tasks. Still, generative language models are available mainly for English. We develop GePpeTto, the first generative language model for Italian, built using the GPT-2 architecture. We provide a thorough analysis of GePpeTto's quality by means of both an automatic and a human-based evaluation. The automatic assessment consists in (i) calculating perplexity across different genres and (ii) a profiling analysis over GePpeTto's writing characteristics. We find that GePpeTto's production is a sort of bonsai version of human production, with shorter but yet complex sentences. Human evaluation is performed over a sentence completion task, where GePpeTto's output is judged as natural more often than not, and much closer to the original human texts than to a simpler language model which we take as baseline

arXiv.org e-Print Archive

OpenEdition

Dissertations of the University of Groningen

GePpeTto Carves Italian into a Language Model

Author: Cafagna Michele
Dell'Orletta Felice
Guerini Marco
Mattei Lorenzo De
Nissim Malvina
Publication venue: CEUR-WS.org
Publication date: 01/01/2020
Field of study

ARTS repository - University of Groningen

GePpeTto Carves Italian into a Language Model

Author: Cafagna Michele
Dell'Orletta Felice
Guerini Marco
Mattei Lorenzo De
Nissim Malvina
Publication venue: CEUR-WS.org
Publication date: 01/01/2020
Field of study

arXiv.org e-Print Archive

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Archivio della ricerca - Fondazione Bruno Kessler

OpenEdition

Dissertations of the University of Groningen

Sono solo parole ChatGPT: anatomia e raccomandazioni per l’uso

Author: Caselli Tommaso
Lieto Antonio
Nissim Malvina
Patti Viviana
Publication venue
Publication date: 01/01/2023
Field of study

PhilPapers

GePpeTto Carves Italian into a Language Model

Author: Cafagna Michele
Dell'Orletta Felice
Dell'Orletta Felice
Guerini Marco
Mattei Lorenzo De
Monti Johanna
Nissim Malvina
Tamburini Fabio
Publication venue: CEUR-WS.org
Publication date: 01/01/2020
Field of study

In the last few years, pre-trained neural architectures have provided impressive improvements across several NLP tasks. Still, generative language models are available mainly for English. We develop GePpeTto, the first generative language model for Italian, built using the GPT-2 architecture. We provide a thorough analysis of GePpeTto’s quality by means of both an automatic and a human-based evaluation. The automatic assessment consists in (i) calculating perplexity across different genres and (ii) a profiling analysis over GePpeTto’s writing characteristics. We find that GePpeTto’s production is a sort of bonsai version of human production, with shorter but yet complex sentences. Human evaluation is performed over a sentence completion task, where GePpeTto’s output is judged as natural more often than not, and much closer to the original human texts than to a simpler language model which we take as baseline

GePpeTto Carves Italian into a Language Model

Author
Publication venue
Publication date: 22/03/2021
Field of study

none5In the last few years, pre-trained neural architectures have provided impressive improvements across several NLP tasks. Still, generative language models are available mainly for English. We develop GePpeTto, the first generative language model for Italian, built using the GPT-2 architecture. We provide a thorough analysis of GePpeTto’s quality by means of both an automatic and a humanbased evaluation. The automatic assessment consists in (i) calculating perplexity across different genres and (ii) a profiling analysis over GePpeTto’s writing characteristics. We find that GePpeTto’s production is a sort of bonsai version of human production, with shorter but yet complex sentences. Human evaluation is performed over a sentence completion task, where GePpeTto’s output is judged as natural more often than not, and much closer to the original human texts than to a simpler language model which we take as baseline.noneLorenzo De Mattei, Michele Cafagna, Felice Dell'Orletta, Malvina Nissim, Marco GueriniDe Mattei, Lorenzo; Cafagna, Michele; Dell'Orletta, Felice; Nissim, Malvina; Guerini, Marc

Archivio della ricerca - Fondazione Bruno Kessler

Natural Language Processing for Technology Foresight Summarization and Simplification: the case of patents

Author: CASOLA SILVIA
Publication venue: Università degli studi di Padova
Publication date: 22/09/2023
Field of study

Technology foresight aims to anticipate possible developments, understand trends, and identify technologies of high impact. To this end, monitoring emerging technologies is crucial. Patents -- the legal documents that protect novel inventions -- can be a valuable source for technology monitoring. Millions of patent applications are filed yearly, with 3.4 million applications in 2021 only. Patent documents are primarily textual documents and disclose innovative and potentially valuable inventions. However, their processing is currently underresearched. This is due to several reasons, including the high document complexity: patents are very lengthy and are written in an extremely hard-to-read language, which is a mix of technical and legal jargon. This thesis explores how Natural Language Processing -- the discipline that enables machines to process human language automatically -- can aid patent processing. Specifically, we focus on two tasks: patent summarization (i.e., we try to reduce the document length while preserving its core content) and patent simplification (i.e., we try to reduce the document's linguistic complexity while preserving its original core meaning). We found that older patent summarization approaches were not compared on shared benchmarks (making thus it hard to draw conclusions), and even the most recent abstractive dataset presents important issues that might make comparisons meaningless. We try to fill both gaps: we first document the issues related to the BigPatent dataset and then benchmark extractive, abstraction, and hybrid approaches in the patent domain. We also explore transferring summarization methods from the scientific paper domain with limited success. For the automatic text simplification task, we noticed a lack of simplified text and parallel corpora. We fill this gap by defining a method to generate a silver standard for patent simplification automatically. Lay human judges evaluated the simplified sentences in the corpus as grammatical, adequate, and simpler, and we show that it can be used to train a state-of-the-art simplification model. This thesis describes the first steps toward Natural Language Processing-aided patent summarization and simplification. We hope it will encourage more research on the topic, opening doors for a productive dialog between NLP researchers and domain experts.Technology foresight aims to anticipate possible developments, understand trends, and identify technologies of high impact. To this end, monitoring emerging technologies is crucial. Patents -- the legal documents that protect novel inventions -- can be a valuable source for technology monitoring. Millions of patent applications are filed yearly, with 3.4 million applications in 2021 only. Patent documents are primarily textual documents and disclose innovative and potentially valuable inventions. However, their processing is currently underresearched. This is due to several reasons, including the high document complexity: patents are very lengthy and are written in an extremely hard-to-read language, which is a mix of technical and legal jargon. This thesis explores how Natural Language Processing -- the discipline that enables machines to process human language automatically -- can aid patent processing. Specifically, we focus on two tasks: patent summarization (i.e., we try to reduce the document length while preserving its core content) and patent simplification (i.e., we try to reduce the document's linguistic complexity while preserving its original core meaning). We found that older patent summarization approaches were not compared on shared benchmarks (making thus it hard to draw conclusions), and even the most recent abstractive dataset presents important issues that might make comparisons meaningless. We try to fill both gaps: we first document the issues related to the BigPatent dataset and then benchmark extractive, abstraction, and hybrid approaches in the patent domain. We also explore transferring summarization methods from the scientific paper domain with limited success. For the automatic text simplification task, we noticed a lack of simplified text and parallel corpora. We fill this gap by defining a method to generate a silver standard for patent simplification automatically. Lay human judges evaluated the simplified sentences in the corpus as grammatical, adequate, and simpler, and we show that it can be used to train a state-of-the-art simplification model. This thesis describes the first steps toward Natural Language Processing-aided patent summarization and simplification. We hope it will encourage more research on the topic, opening doors for a productive dialog between NLP researchers and domain experts

Archivio istituzionale della ricerca - Università di Padova

Proceedings of the Eighth Italian Conference on Computational Linguistics CliC-it 2021

Author
Publication venue: 'OpenEdition'
Publication date: 15/12/2022
Field of study

The eighth edition of the Italian Conference on Computational Linguistics (CLiC-it 2021) was held at Università degli Studi di Milano-Bicocca from 26th to 28th January 2022. After the edition of 2020, which was held in fully virtual mode due to the health emergency related to Covid-19, CLiC-it 2021 represented the first moment for the Italian research community of Computational Linguistics to meet in person after more than one year of full/partial lockdown

Directory of Open Access Books (DOAB)