    Domain-aware Evaluation of Named Entity Recognition Systems for Croatian

    We provide an evaluation of the currently available named entity recognition systems for Croatian. The evaluation puts special emphasis on domain dependence. To this goal, we manually annotated a dataset of approximately 1 million tokens of Croatian text from various domains within the newspaper text genre. The dataset was annotated using a three-class named entity tagset ā€“ denoting personal names, locations and organizations. We give insight to feature selection, domain sensitivity and effects of increase in training set size for statistical named entity recognition using the state-of-the-art Stanford NER system. We also sketch a comparison of publicly available named entity recognition systems for Croatian considering domain dependence, regardless of their underlying paradigms. Our top-performing system achieved an F1-score of 0.884 in a mixed-domain testing scenario, scoring 0.925 and 0.843 in the two domains separated for the experiment. The system shows consistency in state-of-the-art scores for detecting names of persons, locations and organizations

    Babel Treebank of Public Messages in Croatian

    AbstractThe paper presents the process of constructing a publicly available treebank of public messages written in Croatian. The messages were collected from various electronic sources ā€“ e-mail, blog, Facebook and SMS ā€“ and published on the Zagreb Museum of Contemporary Art LED facade within the Babel art project. The project aimed to use the facade as an open-space blog or social interface for enabling citizens to publicly express their views. Construction and current state of the treebank is presented along with future work plans. A comparison of Babel Treebank with Croatian Dependency Treebank and SETimes.HR treebank regarding differing domains and annotation schemes is briefly sketched. The treebank is used as a test platform for introducing a new standard for syntactic annotation of Croatian texts. An experiment with morphosyntactic tagging and dependency parsing of the treebank is conducted, providing first insight to computational processing of non-standard text in Croatian

    Tagset Reductions in Morphosyntactic Tagging of Croatian Texts

    Morphosyntactic tagging of Croatian texts is performed with stochastic taggersby using a language model built on a manually annotated corpus implementingthe Multext East version 3 specifications for Croatian. Tagging accuracy in thisframework is basically predefined, i.e. proportionally dependent of two things:the size of the training corpus and the number of different morphosyntactic tagsencompassed by that corpus. Being that the 100 kw Croatia Weekly newspapercorpus by definition makes a rather small language model in terms of stochastictagging of free domain texts, the paper presents an approach dealing withtagset reductions. Several meaningful subsets of the Croatian Multext-East version3 morphosyntactic tagset specifications are created and applied on Croatiantexts with the CroTag stochastic tagger, measuring overall tagging accuracyand F1-measures. Obtained results are discussed in terms of applying differentreductions in different natural language processing systems and specifictasks defined by specific user requirements

    A Legal Perspective on Training Models for Natural Language Processing

    A significant concern in processing natural language data is the often unclear legal status of the input and output data/resources. In this paper, we investigate this problem by discussing a typical activity in Natural Language Processing: the training of a machine learning model from an annotated corpus. We examine which legal rules apply at relevant steps and how they affect the legal status of the results, especially in terms of copyright and copyright-related rights

    Kompiliranje korpusa u digitalnim humanističkim znanostima u jezicima s ograničenim resursima: o praksi kompiliranja tematskih korpusa iz digitalnih medija za srpski, hrvatski i slovenski

    The digital era has unlocked unprecedented possibilities of compiling corpora of social discourse, which has brought corpus linguistic methods into closer interaction with other methods of discourse analysis and the humanities. Even when not using any specific techniques of corpus linguistics, drawing on some sort of corpus is increasingly resorted to for empiricallyā€“grounded socialā€“scientific analysis (sometimes dubbed ā€˜corpusā€“assisted discourse analysisā€™ or ā€˜corpusā€“based critical discourse analysisā€™, cf. Hardtā€“Mautner 1995; Baker 2016). In the postā€“Yugoslav space, recent corpus developments have brought tableā€“turning advantages in many areas of discourse research, along with an ongoing proliferation of corpora and tools. Still, for linguists and discourse analysts who embark on collecting specialized corpora for their own research purposes, many questions persist ā€“ partly due to the fastā€“changing background of these issues, but also due to the fact that there is still a gap in the corpus method, and in guidelines for corpus compilation, when applied beyond the anglophone contexts. In this paper we aim to discuss some possible solutions to these difficulties, by presenting one stepā€“byā€“step account of a corpus building procedure specifically for Croatian, Serbian and Slovenian, through an example of compiling a thematic corpus from digital media sources (news articles and reader comments). Following an overview of corpus types, uses and advantages in social sciences and digital humanities, we present the corpus compilation possibilities in the South Slavic language contexts, including data scraping options, permissions and ethical issues, the factors that facilitate or complicate automated collection, and corpus annotation and processing possibilities. The study shows expanding possibilities for work with the given languages, but also some persistently grey areas where researchers need to make decisions based on research expectations. Overall, the paper aims to recapitulate our own corpus compilation experience in the wider context of Southā€“Slavic corpus linguistics and corpus linguistic approaches in the humanities more generallyDigitalno doba otvorilo je nove mogućnosti za sastavljanje korpusa druÅ”tvenog diskursa, Å”to je korpusnolingvističke metode približilo drugim metodama analize diskursa te humanističkim znanostima. Čak i kada se ne koriste nikakve specifične tehnike korpusne lingvistike, danas je za empirijski utemeljenu druÅ”tvenoā€“znanstvenu analizu sve učestalije koriÅ”tenje neke vrste korpusa (ā€˜korpusnoā€“asistirana analiza diskursaā€™ ili ā€˜kritička korpusna analizaā€™, Hardtā€“Mautner 1995; Baker 2016). U postjugoslavenskom prostoru, nedavni razvoj korpusne lingvistike donio je prednosti u mnogim područjima istraživanja. Ipak, za lingviste i analitičare diskursa koji se upuÅ”taju u prikupljanje specijaliziranih korpusa za vlastite istraživačke svrhe, i dalje ostaju otvorena mnoga pitanja ā€“ djelomično zbog pozadine korpusne lingvistike koja se brzo mijenja, ali i zbog činjenice da joÅ” uvijek postoji rascjep u poznavanju korpusnih metoda, kao i metodologije sastavljanja korpusa izvan anglofonskog konteksta. Ovim radom pokuÅ”avamo smanjiti spomenuti rascjep predstavljajući jedan postupni prikaz postupka izgradnje korpusa za hrvatski, srpski i slovenski, kroz primjer sastavljanja tematskog korpusa iz digitalnih medija (novinski članci i komentari čitatelja). Nakon pregleda tipova korpusa, koriÅ”tenja i prednosti u druÅ”tvenim znanostima i digitalnim humanističkim znanostima, predstavljamo mogućnosti sastavljanja korpusa u južnoslavenskim jezičnim kontekstima, uključujući opcije preuzimanja podataka s mreže, dozvola i etičkih pitanja, čimbenika koji olakÅ”avaju ili otežavaju automatizirano prikupljanje i označavanje korpusa i mogućnosti obrade. Studija otkriva sve veće mogućnosti za rad s danim jezicima, ali i neka uporno siva područja u kojima istraživači trebaju donositi odluke na temelju istraživačkih očekivanja. Općenito, rad ima za cilj rekapitulirati vlastito iskustvo sastavljanja korpusa u Å”irem kontekstu južnoslavenske korpusne lingvistike i korpusnih lingvističkih pristupa u humanističkim znanostima općenito


    Growing interest in meta-language, in linguistics and other disciplines, has highlighted a gap in metalanguage corpora and analytical resources, which remain among the scarcest in corpus-linguistic developments so far. This paper is aimed at making a step towards filling this gap, both by presenting our own metalanguage corpus resource and using it in a short sample analysis to discuss the applications of such resources in linguistics and social sciences. Specifically, the paper presents for the first time MetaLangCORP, a multielement corpus of contemporary media metalanguage in languages of three post-Yugoslav states, linguistically annotated and made available open-access at the CLARIN repository of linguistic resources. To put the corpus in context, the meaning and relevance of metalanguage research is outlined, the existing efforts at compiling corpora of metalanguage are reviewed, and a sample preliminary analysis of MetaLangCORP keywords is presented to open a broader discussion on the potential applicability of metalanguage corpora. More broadly, it is hoped that making this kind of data available will prompt more nuanced analyses of metalanguage, as well as more corpus-building efforts along similar lines in Slavic and other linguistic scholarship.Sve veći interes za metajezik, kako u lingvistici, tako i u drugim disciplinama, naglasio je prazninu koja postoji u metajezičnim korpusima i analitičkim izvorima koji spadaju među neke od najrjeđih u sklopu suvremenih dosega korpusne linvistike. Ovaj je rad usmjeren ka popunjavanju te praznine na način da u njemu predstavljamo naÅ” metajezični korpus te ga potom koristimo u kratkoj analizi koja služi kao primjer na temelju kojega raspravljamo o mogućnostima primjene takvih izvora u lingvistici i druÅ”tvenim znanostima. U radu se prvi put predstavlja MetaLangCorp, viÅ”eelmentni korpus suvremenoga medijskog metajezika prisutnoga u jezicima triju država nastalih raspadom Jugoslavije, koji je lingvistički anotiran i dostupan u slobodnome pristupu u sklopu repozitorija lingvističkih resursa CLARIN. Kako bismo korpus smjestili u kontekst, dajemo kratki prikaz značenja i značaja metajezika, kratki osvrt na postojeće napore u sastavljanju metajezičnih korpusa te predstavljamo preliminarnu analizu ključnih riječi iz MetaLangCORP-a s ciljem otvaranja Å”ire rasprave o mogućim primjenama metajezičnih korpusa. Nadamo se da će dostupnost ovih podataka potaknuti iznijansiranije analize metajezika kao i daljnje slične napore usmjerene na stvaranje korpusa kako za slavenske, tako i za jezike koji pripadaju drugim jezičnim porodicama

    Glagolski prefiks o(b)- u hrvatskome i bugarskome: semantička mreža i izazovi korpusno utemeljena istraživanja

    This study compares the verbal prefix o(b)ā€“ in two South Slavic languages, Croatian and Bulgarian, from a cognitive linguistic perspective. We focus on the problems arising when constructing the semantic network of this polysemous prefix, particularly on 1) isolating the prefixā€™s meaning from the meaning of the base verb and 2) identifying core/dominant subā€“meanings for all verbs and giving them corresponding semantic labels. Our approach to morphology is based on extensive databases of verbs collected from dictionaries and a few corpora. However, our work with corpora led to a number of challenges. This study thus has two aims: a) presenting challenges encountered in working out semantic networks of prefixes, and b) presenting challenges related to obtaining reliable (quantitative) results from the corpora.U analizi se iz komparativne perspektive razmatra glagolski prefiks o(b)ā€“ u hrvatskome i bugarskome jeziku. Teorijski je okvir kognitivna lingvistika. Prva je tema na koju se osvrćemo značenjska mreža ovoga prefiksa u svjetlu polisemije. U tom sklopu posebno razmatramo sljedeća pitanja: 1) kako odvojiti značenje prefiksa od značenja osnovnih glagola, 2) kako identificirati srediÅ”nje značenje i osnovna podznačenja i kako ih imenovati. Analiza se temelji na opsežnom inventaru prefigiranih glagola prikupljenom u rječnicima i korpusima. U radu s korpusima bilo je nekih izazova, pa se analiza stoga (uz spomenutu problematiku povezanu s razradom semantičke mreže) osvrće i na pitanje kako doći do kvantitativno relevantnih rezultata na temelju korpusa koji su ili ograničena opsega ili imaju druge vrste ograničenja
