24 research outputs found
CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages
We present CLASSLA-Stanza, a pipeline for automatic linguistic annotation of
the South Slavic languages, which is based on the Stanza natural language
processing pipeline. We describe the main improvements in CLASSLA-Stanza with
respect to Stanza, and give a detailed description of the model training
process for the latest 2.1 release of the pipeline. We also report performance
scores produced by the pipeline for different languages and varieties.
CLASSLA-Stanza exhibits consistently high performance across all the supported
languages and outperforms or expands its parent pipeline Stanza at all the
supported tasks. We also present the pipeline's new functionality enabling
efficient processing of web data and the reasons that led to its
implementation.Comment: 17 pages, 14 tables, 1 figur
Otvoreni resursi i tehnologije za obradu srpskog jezika
Open language resources and tools are very important for increasing the quality and speeding up the development of technologies for natural language processing. This paper presents a set of open resources available for processing the Serbian language. We describe several manually annotated corpora, as well as a range of computational models, including a web service designed in order to facilitate their use
Sustav za davanje kontekstualiziranih preporuka na temelju rudarenja teksta
U ovom je radu predstavljen jedan naÄin kako poboljÅ”ati pretraživanje po dokumentima pisanim prirodnim jezikom - otkrivanjem kljuÄnih rijeÄi dokumenata. Ukratko se priÄa o obradi prirodnog jezika, važnoj disciplini kod analize dokumenta. Zatim se priÄa o procesu otkrivanja kljuÄnih rijeÄi i podjeli metoda. Detaljnije se obraÄuju metode koriÅ”tene pri izradi aplikacije: TextRank i algoritam. Prije opisa implementacije, navode se i ukratko opisuju koriÅ”teni alati i tehnologije za izradu aplikacije. Zatim se predstavlja postupak izrade aplikacije, koji se sastoji od pretprocesiranja, primjene algoritama (osnovna dva i njihove tri modifikacije) te postprocesiranja. Naposljetku se navodi usporedba rezultata te primjer dokumenta i naÄenih kljuÄnih rijeÄi.This thesis presents one way of improving the process of searching documents written in natural language - by discovering keywords. It starts with a brief description of natural language processing, a sub-field of computer science, information engineering, and artificial intelligence that is very important for text analysis. Next chapter presents keyword extraction and its classification of methods. Two methods used in application are discussed in detail: TextRank and algorithm. Before describing the implementation process, a list and a short description of used tools and technologies in application is given. Then follows the description of implementation process, which consists of pre-processing, application of algorithms (two basic and their three modifications) and post-processing. Finally, comparison between used methods and an example is given
CLARIN. The infrastructure for language resources
CLARIN, the "Common Language Resources and Technology Infrastructure", has established itself as a major player in the field of research infrastructures for the humanities. This volume provides a comprehensive overview of the organization, its members, its goals and its functioning, as well as of the tools and resources hosted by the infrastructure. The many contributors representing various fields, from computer science to law to psychology, analyse a wide range of topics, such as the technology behind the CLARIN infrastructure, the use of CLARIN resources in diverse research projects, the achievements of selected national CLARIN consortia, and the challenges that CLARIN has faced and will face in the future.
The book will be published in 2022, 10 years after the establishment of CLARIN as a European Research Infrastructure Consortium by the European Commission (Decision 2012/136/EU)
CLARIN
The book provides a comprehensive overview of the Common Language Resources and Technology Infrastructure ā CLARIN ā for the humanities. It covers a broad range of CLARIN language resources and services, its underlying technological infrastructure, the achievements of national consortia, and challenges that CLARIN will tackle in the future. The book is published 10 years after establishing CLARIN as an Europ. Research Infrastructure Consortium
CLARIN
The book provides a comprehensive overview of the Common Language Resources and Technology Infrastructure ā CLARIN ā for the humanities. It covers a broad range of CLARIN language resources and services, its underlying technological infrastructure, the achievements of national consortia, and challenges that CLARIN will tackle in the future. The book is published 10 years after establishing CLARIN as an Europ. Research Infrastructure Consortium
MetaLangCORP: PREDSTAVLJANJE PRVOGA KORPUSA MEDIJSKOGA METAJEZIKA NA SLOVENSKOM, HRVATSKOM I SRPSKOM I MOGUÄNOSTI NJEGOVE MEÄUDISCIPLINARNE PRIMJENE
Growing interest in meta-language, in linguistics and other disciplines, has highlighted a gap in metalanguage corpora and analytical resources, which remain among the scarcest in corpus-linguistic developments so far. This paper is aimed at making a step towards filling this gap, both by presenting our own metalanguage corpus resource and using it in a short sample analysis to discuss the applications of such resources in linguistics and social sciences. Specifically, the paper presents for the first time MetaLangCORP, a multielement corpus of contemporary media metalanguage in languages of three post-Yugoslav states, linguistically annotated and made available open-access at the CLARIN repository of linguistic resources. To put the corpus in context, the meaning and relevance of metalanguage research is outlined, the existing efforts at compiling corpora of metalanguage are reviewed, and a sample preliminary analysis of MetaLangCORP keywords is presented to open a broader discussion on the potential applicability of metalanguage corpora. More broadly, it is hoped that making this kind of data available will prompt more nuanced analyses of metalanguage, as well as more corpus-building efforts along similar lines in Slavic and other linguistic scholarship.Sve veÄi interes za metajezik, kako u lingvistici, tako i u drugim disciplinama, naglasio je prazninu koja postoji u metajeziÄnim korpusima i analitiÄkim izvorima koji spadaju meÄu neke od najrjeÄih u sklopu suvremenih dosega korpusne linvistike. Ovaj je rad usmjeren ka popunjavanju te praznine na naÄin da u njemu predstavljamo naÅ” metajeziÄni korpus te ga potom koristimo u kratkoj analizi koja služi kao primjer na temelju kojega raspravljamo o moguÄnostima primjene takvih izvora u lingvistici i druÅ”tvenim znanostima. U radu se prvi put predstavlja MetaLangCorp, viÅ”eelmentni korpus suvremenoga medijskog metajezika prisutnoga u jezicima triju država nastalih raspadom Jugoslavije, koji je lingvistiÄki anotiran i dostupan u slobodnome pristupu u sklopu repozitorija lingvistiÄkih resursa CLARIN. Kako bismo korpus smjestili u kontekst, dajemo kratki prikaz znaÄenja i znaÄaja metajezika, kratki osvrt na postojeÄe napore u sastavljanju metajeziÄnih korpusa te predstavljamo preliminarnu analizu kljuÄnih rijeÄi iz MetaLangCORP-a s ciljem otvaranja Å”ire rasprave o moguÄim primjenama metajeziÄnih korpusa. Nadamo se da Äe dostupnost ovih podataka potaknuti iznijansiranije analize metajezika kao i daljnje sliÄne napore usmjerene na stvaranje korpusa kako za slavenske, tako i za jezike koji pripadaju drugim jeziÄnim porodicama
Sustav za davanje kontekstualiziranih preporuka na temelju rudarenja teksta
U ovom je radu predstavljen jedan naÄin kako poboljÅ”ati pretraživanje po dokumentima pisanim prirodnim jezikom - otkrivanjem kljuÄnih rijeÄi dokumenata. Ukratko se priÄa o obradi prirodnog jezika, važnoj disciplini kod analize dokumenta. Zatim se priÄa o procesu otkrivanja kljuÄnih rijeÄi i podjeli metoda. Detaljnije se obraÄuju metode koriÅ”tene pri izradi aplikacije: TextRank i algoritam. Prije opisa implementacije, navode se i ukratko opisuju koriÅ”teni alati i tehnologije za izradu aplikacije. Zatim se predstavlja postupak izrade aplikacije, koji se sastoji od pretprocesiranja, primjene algoritama (osnovna dva i njihove tri modifikacije) te postprocesiranja. Naposljetku se navodi usporedba rezultata te primjer dokumenta i naÄenih kljuÄnih rijeÄi.This thesis presents one way of improving the process of searching documents written in natural language - by discovering keywords. It starts with a brief description of natural language processing, a sub-field of computer science, information engineering, and artificial intelligence that is very important for text analysis. Next chapter presents keyword extraction and its classification of methods. Two methods used in application are discussed in detail: TextRank and algorithm. Before describing the implementation process, a list and a short description of used tools and technologies in application is given. Then follows the description of implementation process, which consists of pre-processing, application of algorithms (two basic and their three modifications) and post-processing. Finally, comparison between used methods and an example is given
hr500k ā A Reference Training Corpus of Croatian.
In this paper we present hr500k, a Croatian reference training corpus of 500 thousand tokens, segmented at document, sentence and word level, and annotated for morphosyntax, lemmas, dependency syntax, named entities, and semantic roles. We present each annotation layer via basic label statistics and describe the final encoding of the resource in CoNLL and TEI formats. We also give a description of the rather turbulent history of the resource and give insights into the topic and genre distribution in the corpus. Finally, we discuss further enrichments of the corpus with additional layers, which are already underway