30 research outputs found

    Towards abstractive summarization in Hungarian

    Get PDF
    We publish an abstractive summarizer for Hungarian, an encoder-decoder model initialized with huBERT, and fine-tuned on the ELTE.DH corpus of former Hungarian news portals. The model produces fluent output in the correct topic, but it hallucinates frequently. Our quantitative evaluation on automatic and human transcripts of news (with automatic and human-made punctuation) shows that the model is robust with respect to errors in either automatic speech recognition or automatic punctuation restoration

    Love Me, Love Me, Say (and Write!) that You Love Me: Enriching the WASABI Song Corpus with Lyrics Annotations

    Get PDF
    Due to COVID 19 pandemic, the 12th edition is cancelled. Next edition, the 13th, LREC 2022 will take place in Pharo on June 16-24, 2022.International audienceWe present the WASABI Song Corpus, a large corpus of songs enriched with metadata extracted from music databases on the Web, and resulting from the processing of song lyrics and from audio analysis. More specifically, given that lyrics encode an important part of the semantics of a song, we focus here on the description of the methods we proposed to extract relevant information from the lyrics, such as their structure segmentation, their topics, the explicitness of the lyrics content, the salient passages of a song and the emotions conveyed. The creation of the resource is still ongoing: so far, the corpus contains 1.73M songs with lyrics (1.41M unique lyrics) annotated at different levels with the output of the above mentioned methods. Such corpus labels and the provided methods can be exploited by music search engines and music professionals (e.g. journalists, radio presenters) to better handle large collections of lyrics, allowing an intelligent browsing, categorization and recommendation of songs. We provide the files of the current version of the WASABI Song Corpus, the models we have built on it as well as updates here: https://github.com/micbuffa/WasabiDataset

    Language Identification as part of the Text Corpus Creation Pipeline at the Language Bank of Finland

    Get PDF
    The Language Bank of Finland hosts text corpora originating from Finland. Two of the most used ones are the Newspaper and Periodical Corpus of the National Library of Finland and the Suomi24 Corpus. The Language Bank has received considerable additions to both corpora and is currently creating new versions of the corpora. We are debuting language identification as part of the corpus creation pipeline. As a language identifier, we are using our recently published HeLI-OTS software. This paper investigates the results and the quality of language identification. We created a new dataset for evaluating the efficacy of language identification by extracting random samples from both corpora and manually annotating their language. We were especially interested in seeing how the relatively low OCR quality of the oldest part of the KLK-fi collection will affect language identification when using an off-the-shelf program like HeLI-OTS. The oldest part of the KLK-fi collection and many parts of the Suomi24 corpus contain Finnish written dialectally or otherwise differing from the standard written Finnish. HeLI-OTS software includes several separate language models for dialectal Finnish, and in this article, we evaluate their usefulness using the new data set. For both corpora, the overall micro F1 scores increase when using the additional dialectal models. Additionally, we take a detailed look at some language identification errors and discuss possible solutions.Peer reviewe

    Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020

    Get PDF
    On behalf of the Program Committee, a very warm welcome to the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020). This edition of the conference is held in Bologna and organised by the University of Bologna. The CLiC-it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after six years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges

    PARSEME-It: an Italian corpus annotated with verbal multiword expressions

    Get PDF
    The paper describes the PARSEME-It corpus, developed within the PARSEME-It project which aims at the development of methods, tools and resources for multiword expressions (MWE) processing for the Italian language. The project is a spin-off of a larger multilingual project for more than 20 languages from several language families, namely the PARSEME COST Action. The first phase of the project was devoted to verbal multiword expressions (VMWEs). They are a particularly interesting lexical phenomenon because of frequent discontinuity and long-distance dependency. Besides they are very challenging for deep parsing and other Natural Language Processing (NLP) tasks. Notably, MWEs are pervasive in natural languages but are particularly difficult to be handled by NLP tools because of their characteristics and idiomaticity. They pose many challenges to their correct identification and processing: they are a linguistic phenomenon on the edge between lexicon and grammar, their meaning is not simply the addition of the meanings of the single constituents of the MWEs and they are ambiguous since in several cases their reading can be literal or idiomatic. Although several studies have been devoted to this topic, to the best of our knowledge, our study is the first attempt to provide a general framework for the identification of VMWEs in running texts and a comprehensive corpus for the Italian language
    corecore