24 research outputs found

    CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages

    Full text link
    We present CLASSLA-Stanza, a pipeline for automatic linguistic annotation of the South Slavic languages, which is based on the Stanza natural language processing pipeline. We describe the main improvements in CLASSLA-Stanza with respect to Stanza, and give a detailed description of the model training process for the latest 2.1 release of the pipeline. We also report performance scores produced by the pipeline for different languages and varieties. CLASSLA-Stanza exhibits consistently high performance across all the supported languages and outperforms or expands its parent pipeline Stanza at all the supported tasks. We also present the pipeline's new functionality enabling efficient processing of web data and the reasons that led to its implementation.Comment: 17 pages, 14 tables, 1 figur

    Otvoreni resursi i tehnologije za obradu srpskog jezika

    Get PDF
    Open language resources and tools are very important for increasing the quality and speeding up the development of technologies for natural language processing. This paper presents a set of open resources available for processing the Serbian language. We describe several manually annotated corpora, as well as a range of computational models, including a web service designed in order to facilitate their use

    Sustav za davanje kontekstualiziranih preporuka na temelju rudarenja teksta

    Get PDF
    U ovom je radu predstavljen jedan način kako poboljÅ”ati pretraživanje po dokumentima pisanim prirodnim jezikom - otkrivanjem ključnih riječi dokumenata. Ukratko se priča o obradi prirodnog jezika, važnoj disciplini kod analize dokumenta. Zatim se priča o procesu otkrivanja ključnih riječi i podjeli metoda. Detaljnije se obrađuju metode koriÅ”tene pri izradi aplikacije: TextRank i tfāˆ’idftf-idf algoritam. Prije opisa implementacije, navode se i ukratko opisuju koriÅ”teni alati i tehnologije za izradu aplikacije. Zatim se predstavlja postupak izrade aplikacije, koji se sastoji od pretprocesiranja, primjene algoritama (osnovna dva i njihove tri modifikacije) te postprocesiranja. Naposljetku se navodi usporedba rezultata te primjer dokumenta i nađenih ključnih riječi.This thesis presents one way of improving the process of searching documents written in natural language - by discovering keywords. It starts with a brief description of natural language processing, a sub-field of computer science, information engineering, and artificial intelligence that is very important for text analysis. Next chapter presents keyword extraction and its classification of methods. Two methods used in application are discussed in detail: TextRank and tfāˆ’idftf-idf algorithm. Before describing the implementation process, a list and a short description of used tools and technologies in application is given. Then follows the description of implementation process, which consists of pre-processing, application of algorithms (two basic and their three modifications) and post-processing. Finally, comparison between used methods and an example is given

    CLARIN. The infrastructure for language resources

    Get PDF
    CLARIN, the "Common Language Resources and Technology Infrastructure", has established itself as a major player in the field of research infrastructures for the humanities. This volume provides a comprehensive overview of the organization, its members, its goals and its functioning, as well as of the tools and resources hosted by the infrastructure. The many contributors representing various fields, from computer science to law to psychology, analyse a wide range of topics, such as the technology behind the CLARIN infrastructure, the use of CLARIN resources in diverse research projects, the achievements of selected national CLARIN consortia, and the challenges that CLARIN has faced and will face in the future. The book will be published in 2022, 10 years after the establishment of CLARIN as a European Research Infrastructure Consortium by the European Commission (Decision 2012/136/EU)

    CLARIN

    Get PDF
    The book provides a comprehensive overview of the Common Language Resources and Technology Infrastructure ā€“ CLARIN ā€“ for the humanities. It covers a broad range of CLARIN language resources and services, its underlying technological infrastructure, the achievements of national consortia, and challenges that CLARIN will tackle in the future. The book is published 10 years after establishing CLARIN as an Europ. Research Infrastructure Consortium

    CLARIN

    Get PDF
    The book provides a comprehensive overview of the Common Language Resources and Technology Infrastructure ā€“ CLARIN ā€“ for the humanities. It covers a broad range of CLARIN language resources and services, its underlying technological infrastructure, the achievements of national consortia, and challenges that CLARIN will tackle in the future. The book is published 10 years after establishing CLARIN as an Europ. Research Infrastructure Consortium

    MetaLangCORP: PREDSTAVLJANJE PRVOGA KORPUSA MEDIJSKOGA METAJEZIKA NA SLOVENSKOM, HRVATSKOM I SRPSKOM I MOGUĆNOSTI NJEGOVE MEĐUDISCIPLINARNE PRIMJENE

    Get PDF
    Growing interest in meta-language, in linguistics and other disciplines, has highlighted a gap in metalanguage corpora and analytical resources, which remain among the scarcest in corpus-linguistic developments so far. This paper is aimed at making a step towards filling this gap, both by presenting our own metalanguage corpus resource and using it in a short sample analysis to discuss the applications of such resources in linguistics and social sciences. Specifically, the paper presents for the first time MetaLangCORP, a multielement corpus of contemporary media metalanguage in languages of three post-Yugoslav states, linguistically annotated and made available open-access at the CLARIN repository of linguistic resources. To put the corpus in context, the meaning and relevance of metalanguage research is outlined, the existing efforts at compiling corpora of metalanguage are reviewed, and a sample preliminary analysis of MetaLangCORP keywords is presented to open a broader discussion on the potential applicability of metalanguage corpora. More broadly, it is hoped that making this kind of data available will prompt more nuanced analyses of metalanguage, as well as more corpus-building efforts along similar lines in Slavic and other linguistic scholarship.Sve veći interes za metajezik, kako u lingvistici, tako i u drugim disciplinama, naglasio je prazninu koja postoji u metajezičnim korpusima i analitičkim izvorima koji spadaju među neke od najrjeđih u sklopu suvremenih dosega korpusne linvistike. Ovaj je rad usmjeren ka popunjavanju te praznine na način da u njemu predstavljamo naÅ” metajezični korpus te ga potom koristimo u kratkoj analizi koja služi kao primjer na temelju kojega raspravljamo o mogućnostima primjene takvih izvora u lingvistici i druÅ”tvenim znanostima. U radu se prvi put predstavlja MetaLangCorp, viÅ”eelmentni korpus suvremenoga medijskog metajezika prisutnoga u jezicima triju država nastalih raspadom Jugoslavije, koji je lingvistički anotiran i dostupan u slobodnome pristupu u sklopu repozitorija lingvističkih resursa CLARIN. Kako bismo korpus smjestili u kontekst, dajemo kratki prikaz značenja i značaja metajezika, kratki osvrt na postojeće napore u sastavljanju metajezičnih korpusa te predstavljamo preliminarnu analizu ključnih riječi iz MetaLangCORP-a s ciljem otvaranja Å”ire rasprave o mogućim primjenama metajezičnih korpusa. Nadamo se da će dostupnost ovih podataka potaknuti iznijansiranije analize metajezika kao i daljnje slične napore usmjerene na stvaranje korpusa kako za slavenske, tako i za jezike koji pripadaju drugim jezičnim porodicama

    Sustav za davanje kontekstualiziranih preporuka na temelju rudarenja teksta

    Get PDF
    U ovom je radu predstavljen jedan način kako poboljÅ”ati pretraživanje po dokumentima pisanim prirodnim jezikom - otkrivanjem ključnih riječi dokumenata. Ukratko se priča o obradi prirodnog jezika, važnoj disciplini kod analize dokumenta. Zatim se priča o procesu otkrivanja ključnih riječi i podjeli metoda. Detaljnije se obrađuju metode koriÅ”tene pri izradi aplikacije: TextRank i tfāˆ’idftf-idf algoritam. Prije opisa implementacije, navode se i ukratko opisuju koriÅ”teni alati i tehnologije za izradu aplikacije. Zatim se predstavlja postupak izrade aplikacije, koji se sastoji od pretprocesiranja, primjene algoritama (osnovna dva i njihove tri modifikacije) te postprocesiranja. Naposljetku se navodi usporedba rezultata te primjer dokumenta i nađenih ključnih riječi.This thesis presents one way of improving the process of searching documents written in natural language - by discovering keywords. It starts with a brief description of natural language processing, a sub-field of computer science, information engineering, and artificial intelligence that is very important for text analysis. Next chapter presents keyword extraction and its classification of methods. Two methods used in application are discussed in detail: TextRank and tfāˆ’idftf-idf algorithm. Before describing the implementation process, a list and a short description of used tools and technologies in application is given. Then follows the description of implementation process, which consists of pre-processing, application of algorithms (two basic and their three modifications) and post-processing. Finally, comparison between used methods and an example is given

    hr500k ā€“ A Reference Training Corpus of Croatian.

    Get PDF
    In this paper we present hr500k, a Croatian reference training corpus of 500 thousand tokens, segmented at document, sentence and word level, and annotated for morphosyntax, lemmas, dependency syntax, named entities, and semantic roles. We present each annotation layer via basic label statistics and describe the final encoding of the resource in CoNLL and TEI formats. We also give a description of the rather turbulent history of the resource and give insights into the topic and genre distribution in the corpus. Finally, we discuss further enrichments of the corpus with additional layers, which are already underway
    corecore