9 research outputs found

    PARSEME corpus release 1.3

    Get PDF
    We present version 1.3 of the PARSEME multilingual corpus annotated with verbal multiword expressions. Since the previous version, new languages have joined the undertaking of creating such a resource, some of the already existing corpora have been enriched with new annotated texts, while others have been enhanced in various ways. The PARSEME multilingual corpus represents 26 languages now. All monolingual corpora therein use Universal Dependencies v.2 tagset. They are (re-)split observing the PARSEME v.1.2 standard, which puts impact on unseen VMWEs. With the current iteration, the corpus release process has been detached from shared tasks; instead, a process for continuous improvement and systematic releases has been introduced

    Automatic extraction and definition of Lithuanian terms

    No full text
    Monografijoje pristatyti naujausi automatizuoto lietuvių kalbos terminų nustatymo ir apibrėžimo tyrimai. Šie tyrimai remiasi deskriptyviosios terminologijos ir tekstynų lingvistikos principais. Knygoje aprašyta, kaip buvo sudarytas specialusis švietimo ir mokslo tekstynas, kokiais metodais remiantis automatiškai nustatyti galimi terminai, kaip iš jų atsirinkti analizuotos srities terminai, kokia jiems būdinga struktūra, su kokiomis problemomis susidurta bandant automatiškai nustatyti terminų antraštines formas. Didelis dėmesys skirtas metodologijai aptarti, kaip pusiau automatiškai iš tekstyno nustatyti dalykinę informaciją apie terminus, kurią būtų galima panaudoti apibrėžtims sudaryti. Monografijoje pristatyti viso tyrimo praktiniai rezultatai: Švietimo ir mokslo terminų žodynas, Švietimo ir mokslo terminų ontologija.This book presents the most recent advances in the field of Lithuanian terminology extraction as well as the first attempt on automatic extraction of Lithuanian term defining contexts. The first work in descriptive terminology by Lithuanian researchers appeared in early 2000s, i.e. R. Marcinkevičienė (2000) and I. Zeller (dissertation "Term recognition and their analysis", 2005). Nevertheless, the larger proportion of research on Lithuanian terminology is still dominated by the prescriptive view, when a lot of attention and research is given to principles and norms of terminology, as well as diachronic aspects of terminology. Chapter 1 describes differences of descriptive and prescriptive terminology. The authors want to emphasize that the prescriptive terminology involves standardisation and approval of terms, while decisions are based on existing terminology dictionaries, documents, standards, lexicons and databases of approved terms. Whereas in the corpus-based terminology management, which is one of the branches of the descriptive terminology, the main focus is placed on the usage of terms in natural language in a corpus, rather than on the standardisation. The empirical research approaches benefit from various automatic term analysis and term extraction tools, which come in handy in corpus-based terminology management. New terminology research has shown that it is very important to harmonize the methods of prescriptive and descriptive terminology. The combination of both methods allows faster processing of evergrowing data, which is very relevant to challenges of the modern lexicography that include quick and efficient creation of dynamic lexicographical sources

    Edition 1.1 of the PARSEME Shared Task on automatic identification of verbal multiword expressions

    No full text
    This paper describes the PARSEME Shared Task 1.1 on automatic identification of verbal multiword expressions. We present the annotation methodology, focusing on changes from last year's shared task. Novel aspects include enhanced annotation guidelines, additional annotated data for most languages, corpora for some new languages, and new evaluation settings. Corpora were created for 20~languages, which are also briefly discussed. We report organizational principles behind the shared task and the evaluation metrics employed for ranking. The 17~participating systems, their methods and obtained results are also presented and analysed

    Tour de CLARIN Volume One

    No full text
    Tour de CLARIN is an initiative started by CLARIN ERIC in 2016 that has been periodically highlighting prominent user involvement activities of CLARIN national consortia in the form of blog posts published on the CLARIN webpage, disseminated through the CLARIN news flash and on social media. By focusing a different national consortium every two months and showcasing their outstanding language resources, text processing tools, user involvement events and researchers, we have been aiming to increase the visibility of the various consortia, reveal the richness of the CLARIN landscape, and display the full range of activities throughout the network that can not only inform and inspire other consortia, but also show what CLARIN has to offer to researchers, teachers, students, professionals and the general public interested in using and processing language data in various forms. In the two years we have been running the initiative, and having visited nearly half of all the CLARIN member countries, we can say that Tour de CLARIN has proved to be one of the flagship user involvement initiatives by CLARIN ERIC; highly valuable for our network and incredibly popular with our readers. This is why have decided to collect the blog posts in a printed volume. The first volume presents all the nine countries which we have visited so far: Finland, Sweden, Austria, the Netherlands, Poland, Belgium, the Czech Republic, Greece and Lithuania

    PARSEME Corpus Release 1.3

    No full text
    We present version 1.3 of the PARSEME multilingual corpus annotated with verbal multiword expressions. Since the previous version, new languages have joined the undertaking of creating such a resource, some of the already existing corpora have been enriched with new annotated texts, while others have been enhanced in various ways. The PARSEME multilingual corpus represents 26 languages now. All monolingual corpora therein use Universal Dependencies v.2 tagset. They are (re-)split observing the PARSEME v.1.2 standard, which puts impact on unseen VMWEs. With the current iteration, the corpus release process has been detached from shared tasks; instead, a process for continuous improvement and systematic releases has been introduced

    PARSEME Corpus Release 1.3

    No full text
    We present version 1.3 of the PARSEME multilingual corpus annotated with verbal multiword expressions. Since the previous version, new languages have joined the undertaking of creating such a resource, some of the already existing corpora have been enriched with new annotated texts, while others have been enhanced in various ways. The PARSEME multilingual corpus represents 26 languages now. All monolingual corpora therein use Universal Dependencies v.2 tagset. They are (re-)split observing the PARSEME v.1.2 standard, which puts impact on unseen VMWEs. With the current iteration, the corpus release process has been detached from shared tasks; instead, a process for continuous improvement and systematic releases has been introduced
    corecore