14 research outputs found

    Text Frame Detector: Slot Filling Based On Domain Knowledge Bases

    Get PDF
    In this paper we present a systemcalledText Frame Detector(TFD) whichaims at populating a frame-based ontologyin a graph-based structure. Our systemorganizes textual information into frames,according to a predefined set of semanti-cally informed patterns linking pre-codedinformation such as named entities, sim-ple and complex terms. Given the semi-automatic expansion of such informationwith word embeddings, the system can beeasily adapted to new domains

    FRAQUE: a FRAme-based QUEstion-answering system for the Public Administration domain

    Get PDF
    In this paper, we propose FRAQUE, a question answering system for factoid questions in the Public Administration domain. The system is based on semantic frames, here intended as collections of slots typed with their possible values. FRAQUE is a pattern-base system that queries unstructured data, such as documents, web pages, and social media posts. Our system can exploit the potential of different approaches: it extracts pattern elements from texts which are linguistically analysed by means of statistical methods. FRAQUE allows Italian users to query vast document repositories related to the domain of Public Administration. Given the statistical nature of most of its components such as word embeddings, the system allows for a flexible domain and language adaptation process. FRAQUE’s goal is to associate questions with frames stored into a Knowledge Graph along with relevant document passages, which are returned as the answer. In order to guarantee the system usability, the implementation of FRAQUE is based on a user-centered design process, which allowed us to monitor the linguistic structures employed by users, as well as to find which terms were the most common in users’ questions

    BureauBERTo: adapting UmBERTo to the Italian bureaucratic language

    Get PDF
    In this work, we introduce BureauBERTo, the first transformer-based language model adapted to the Italian Public Administration (PA) and technical-bureaucratic domains. We further pre-trained the general-purpose Italian model UmBERTo on a corpus of PA, banking, and insurance documents, and we expanded UmBERTo’s vocabulary with domain-specific terms. We show that BureauBERTo benefitted from the adaptation by comparing it with UmBERTo in both an intrinsic and extrinsic evaluation. The intrinsic evaluation has been conducted through specific fill-mask experiments. The extrinsic one has been faced with a named entity recognition task on one of the sub-domains in BureauBERTo

    DANKMEMES @ EVALITA 2020: The Memeing of Life: Memes, Multimodality and Politics

    Get PDF
    DANKMEMES is a shared task proposed for the 2020 EVALITA campaign, focusing on the automatic classification of Internet memes. Providing a corpus of 2.361 memes on the 2019 Italian Government Crisis, DANKMEMES features three tasks: A) Meme Detection, B) Hate Speech Identification, and C) Event Clustering. Overall, 5 groups took part in the first task, 2 in the second and 1 in the third. The best system was proposed by the UniTor group and achieved a F1 score of 0.8501 for task A, 0.8235 for task B and 0.2657 for task C. In this report, we describe how the task was set up, we report the system results and we discuss them

    Voices of the great war: A richly annotated corpus of Italian texts on the first world war

    Get PDF
    Voci della Grande Guerra (“Voices of the Great War”) is the first large corpus of Italian historical texts dating back to the period ofFirst World War. This corpus differs from other existing resources in several respects. First, from the linguistic point of view it givesaccount of the wide range of varieties in which Italian was articulated in that period, namely from a diastratic (educated vs. uneducatedwriters), diaphasic (low/informal vs. high/formal registers) and diatopic (regional varieties, dialects) points of view. From the historicalperspective, through a collection of texts belonging to different genres it represents different views on the war and the various styles ofnarrating war events and experiences. The final corpus is balanced along various dimensions, corresponding to the textual genre, thelanguage variety used, the author type and the typology of conveyed contents. The corpus is annotated with lemmas, part-of-speech,terminology, and named entities. Significant corpus samples representative of the different “voices” have also been enriched withmeta-linguistic and syntactic information. The layer of syntactic annotation forms the first nucleus of an Italian historical treebankcomplying with the Universal Dependencies standard. The paper illustrates the final resource, the methodology and tools used to buildit, and the Web Interface for navigating it

    EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020

    Get PDF
    Welcome to EVALITA 2020! EVALITA is the evaluation campaign of Natural Language Processing and Speech Tools for Italian. EVALITA is an initiative of the Italian Association for Computational Linguistics (AILC, http://www.ai-lc.it) and it is endorsed by the Italian Association for Artificial Intelligence (AIxIA, http://www.aixia.it) and the Italian Association for Speech Sciences (AISV, http://www.aisv.it)

    Open Data nelle pubbliche amministrazioni: esplorare gli atti attraverso tecniche di network analysis e data visualization

    No full text
    L'elaborato presentato consiste una tesi progettuale che ha l'obiettivo di implementare un'interfaccia web che consenta la visualizzazione dei dati estratti dai documenti degli atti amministrativi. Lo strumento utilizzato per l'estrazione delle informazioni non strutturate dai documenti è SemplicePA. Nato da un progetto che vede la collaborazione dell'Università di Pisa, SemplicePA è una piattaforma in grado di navigare i documenti attraverso un motore di ricerca semantico, che, tra le altre, comprende una sezione di visualizzazione degli atti in una rete i cui elementi sono dati dalle persone, le aziende e le organizzazioni estratte. In questa tesi si propone quindi un nuovo prototipo per questa sezione, che non solo rispetti i criteri di usabilità e accessibilità dettati dall'Agenzia per l'Italia Digitale (AgID) per le pubbliche amministrazioni, ma che inquadri, all'interno di un processo di analisi progettuale in termini di architettura dell'informazione, vecchie e nuove funzioanlità, queste ultime ispirate a tecniche di network analysis. Nel primo capitolo sarà quindi introdotto lo stato degli open data in Italia, con particolare riferimento all'albo pretorio, l'archivio degli atti di ciascun comune amministrativo. Dopo uno stato dell'arte sulle soluzioni e le piattaforme adottate in Europa e in Italia per la gestione dei documenti attraverso motori di ricerca semantici, saranno introdotte le funzionalità di SemplicePA. Il capitolo successivo, invece, mostra un excursus delle basilari nozioni di network analysis, con particolare riferimento a due tecniche che saranno poi implementate anche nell'interfaccia: il calcolo della backbone e l'individuazione dei graphlet, con relativo stato dell'arte. Nel terzo capitolo sarà descritta nel dettaglio la sezione di SemplicePA esistente relativa alla visualizzazione delle reti, sia dal lato server che da quello client, comprendendo anche un'analisi delle funzionalità e degli strumenti di web-design scelti. Il quarto capitolo propone una visione progettuale della nuova interfaccia e si suddivide in tre parti: la prima relativa all'architettura dell'informazione, in cui si individuano i bisogni dell'utente e relative funzionalità della piattaforma, e le restanti in cui si riportano le regole di usabilità e accessibilità di AgID, si compie un'analisi delle criticità della sezione esistente e si illustrano le soluzioni adottate per superarle. Il quinto capitolo è strettamente legato alla fase di implementazione. Sono illustrati, oltre ai linguaggi, i framework le librerie utilizzate, anche il flusso e l'elaborazione dei dati. Viene poi ripresa l'architettura progettata nelle pagine precedenti, e analizzato punto per punto il processo di sviluppo della piattaforma

    Language Disparity in the Interaction with Chatbots for the Administrative Domain

    No full text
    The high impact of the Internet on citizens’ daily life and the widespread use of mobile devices has led the Italian Public Administrations to communicate through the Web and digital media. Chatbots are one of the most recent technologies adopted by public institutions. This work focuses on the interac- tion of citizens with a chatbot able to answer questions about the administrative domain. In particular, the main objective is to identify the relevant variables involved in the reading comprehension process of texts written in the Italian administrative language. A key element of this research is represented by the target population (i.e., Italian second-language learners, elderly Italians, and Italians with a low-lit- eracy level) to ease the access to administrative texts by people with a lack of reading skills

    Neural readability pairwise ranking for sentences in Italian administrative language

    No full text
    Automatic Readability Assessment aims at assigning a complexity level to a given text, which could help improve the accessibility to information in specific domains, such as the administrative one. In this paper, we investigate the behavior of a Neural Pairwise Ranking Model (NPRM) for sentence-level readability assessment of Italian administrative texts. To deal with data scarcity, we experiment with cross-lingual, cross- and in-domain approaches, and test our models on Admin-It, a new parallel corpus in the Italian administrative language, containing sentences simplified using three different rewriting strategies. We show that NPRMs are effective in zero-shot scenarios (~0.78 ranking accuracy), especially with ranking pairs containing simplifications produced by overall rewriting at the sentence-level, and that the best results are obtained by adding in-domain data (achieving perfect performance for such sentence pairs). Finally, we investigate where NPRMs failed, showing that the characteristics of the training data, rather than its size, have a bigger effect on a model’s performance
    corecore