3,713 research outputs found
Domain-aware Evaluation of Named Entity Recognition Systems for Croatian
We provide an evaluation of the currently available named entity recognition systems for Croatian. The evaluation puts special emphasis on domain dependence. To this goal, we manually annotated a dataset of approximately 1 million tokens of Croatian text from various domains within the newspaper text genre. The dataset was annotated using a three-class named entity tagset ā denoting personal names, locations and organizations. We give insight to feature selection, domain sensitivity and effects of increase in training set size for statistical named entity recognition using the state-of-the-art Stanford NER system. We also sketch a comparison of publicly available named entity recognition systems for Croatian considering domain dependence, regardless of their underlying paradigms. Our top-performing system achieved an F1-score of 0.884 in a mixed-domain testing scenario, scoring 0.925 and 0.843 in the two domains separated for the experiment. The system shows consistency in state-of-the-art scores for detecting names of persons, locations and organizations
Babel Treebank of Public Messages in Croatian
AbstractThe paper presents the process of constructing a publicly available treebank of public messages written in Croatian. The messages were collected from various electronic sources ā e-mail, blog, Facebook and SMS ā and published on the Zagreb Museum of Contemporary Art LED facade within the Babel art project. The project aimed to use the facade as an open-space blog or social interface for enabling citizens to publicly express their views. Construction and current state of the treebank is presented along with future work plans. A comparison of Babel Treebank with Croatian Dependency Treebank and SETimes.HR treebank regarding differing domains and annotation schemes is briefly sketched. The treebank is used as a test platform for introducing a new standard for syntactic annotation of Croatian texts. An experiment with morphosyntactic tagging and dependency parsing of the treebank is conducted, providing first insight to computational processing of non-standard text in Croatian
Tagset Reductions in Morphosyntactic Tagging of Croatian Texts
Morphosyntactic tagging of Croatian texts is performed with stochastic taggersby using a language model built on a manually annotated corpus implementingthe Multext East version 3 specifications for Croatian. Tagging accuracy in thisframework is basically predefined, i.e. proportionally dependent of two things:the size of the training corpus and the number of different morphosyntactic tagsencompassed by that corpus. Being that the 100 kw Croatia Weekly newspapercorpus by definition makes a rather small language model in terms of stochastictagging of free domain texts, the paper presents an approach dealing withtagset reductions. Several meaningful subsets of the Croatian Multext-East version3 morphosyntactic tagset specifications are created and applied on Croatiantexts with the CroTag stochastic tagger, measuring overall tagging accuracyand F1-measures. Obtained results are discussed in terms of applying differentreductions in different natural language processing systems and specifictasks defined by specific user requirements
A Legal Perspective on Training Models for Natural Language Processing
A significant concern in processing natural language data is the often unclear legal status of the input and output data/resources. In this paper, we investigate this problem by discussing a typical activity in Natural Language Processing: the training of a machine learning
model from an annotated corpus. We examine which legal rules apply at relevant steps and how they affect the legal status of the results, especially in terms of copyright and copyright-related rights
Kompiliranje korpusa u digitalnim humanistiÄkim znanostima u jezicima s ograniÄenim resursima: o praksi kompiliranja tematskih korpusa iz digitalnih medija za srpski, hrvatski i slovenski
The digital era has unlocked unprecedented possibilities of compiling corpora of social discourse, which
has brought corpus linguistic methods into closer interaction with other methods of discourse analysis and the humanities. Even when not using any specific techniques of corpus linguistics, drawing
on some sort of corpus is increasingly resorted to for empiricallyāgrounded socialāscientific analysis
(sometimes dubbed ācorpusāassisted discourse analysisā or ācorpusābased critical discourse analysisā,
cf. HardtāMautner 1995; Baker 2016). In the postāYugoslav space, recent corpus developments have
brought tableāturning advantages in many areas of discourse research, along with an ongoing proliferation of corpora and tools. Still, for linguists and discourse analysts who embark on collecting specialized corpora for their own research purposes, many questions persist ā partly due to the fastāchanging
background of these issues, but also due to the fact that there is still a gap in the corpus method, and in
guidelines for corpus compilation, when applied beyond the anglophone contexts.
In this paper we aim to discuss some possible solutions to these difficulties, by presenting one
stepābyāstep account of a corpus building procedure specifically for Croatian, Serbian and Slovenian, through an example of compiling a thematic corpus from digital media sources (news articles
and reader comments). Following an overview of corpus types, uses and advantages in social sciences and digital humanities, we present the corpus compilation possibilities in the South Slavic
language contexts, including data scraping options, permissions and ethical issues, the factors that
facilitate or complicate automated collection, and corpus annotation and processing possibilities.
The study shows expanding possibilities for work with the given languages, but also some persistently grey areas where researchers need to make decisions based on research expectations. Overall, the paper aims to recapitulate our own corpus compilation experience in the wider context of
SouthāSlavic corpus linguistics and corpus linguistic approaches in the humanities more generallyDigitalno doba otvorilo je nove moguÄnosti za sastavljanje korpusa druÅ”tvenog diskursa, Å”to je
korpusnolingvistiÄke metode približilo drugim metodama analize diskursa te humanistiÄkim znanostima.
Äak i kada se ne koriste nikakve specifiÄne tehnike korpusne lingvistike, danas je za empirijski utemeljenu
druÅ”tvenoāznanstvenu analizu sve uÄestalije koriÅ”tenje neke vrste korpusa (ākorpusnoāasistirana analiza
diskursaā ili ākritiÄka korpusna analizaā, HardtāMautner 1995; Baker 2016). U postjugoslavenskom
prostoru, nedavni razvoj korpusne lingvistike donio je prednosti u mnogim podruÄjima istraživanja.
Ipak, za lingviste i analitiÄare diskursa koji se upuÅ”taju u prikupljanje specijaliziranih korpusa za vlastite
istraživaÄke svrhe, i dalje ostaju otvorena mnoga pitanja ā djelomiÄno zbog pozadine korpusne lingvistike
koja se brzo mijenja, ali i zbog Äinjenice da joÅ” uvijek postoji rascjep u poznavanju korpusnih metoda, kao
i metodologije sastavljanja korpusa izvan anglofonskog konteksta. Ovim radom pokuŔavamo smanjiti
spomenuti rascjep predstavljajuÄi jedan postupni prikaz postupka izgradnje korpusa za hrvatski, srpski
i slovenski, kroz primjer sastavljanja tematskog korpusa iz digitalnih medija (novinski Älanci i komentari
Äitatelja). Nakon pregleda tipova korpusa, koriÅ”tenja i prednosti u druÅ”tvenim znanostima i digitalnim
humanistiÄkim znanostima, predstavljamo moguÄnosti sastavljanja korpusa u južnoslavenskim jeziÄnim
kontekstima, ukljuÄujuÄi opcije preuzimanja podataka s mreže, dozvola i etiÄkih pitanja, Äimbenika koji
olakÅ”avaju ili otežavaju automatizirano prikupljanje i oznaÄavanje korpusa i moguÄnosti obrade. Studija
otkriva sve veÄe moguÄnosti za rad s danim jezicima, ali i neka uporno siva podruÄja u kojima istraživaÄi
trebaju donositi odluke na temelju istraživaÄkih oÄekivanja. OpÄenito, rad ima za cilj rekapitulirati
vlastito iskustvo sastavljanja korpusa u Ŕirem kontekstu južnoslavenske korpusne lingvistike i korpusnih
lingvistiÄkih pristupa u humanistiÄkim znanostima opÄenito
MetaLangCORP: PREDSTAVLJANJE PRVOGA KORPUSA MEDIJSKOGA METAJEZIKA NA SLOVENSKOM, HRVATSKOM I SRPSKOM I MOGUÄNOSTI NJEGOVE MEÄUDISCIPLINARNE PRIMJENE
Growing interest in meta-language, in linguistics and other disciplines, has highlighted a gap in metalanguage corpora and analytical resources, which remain among the scarcest in corpus-linguistic developments so far. This paper is aimed at making a step towards filling this gap, both by presenting our own metalanguage corpus resource and using it in a short sample analysis to discuss the applications of such resources in linguistics and social sciences. Specifically, the paper presents for the first time MetaLangCORP, a multielement corpus of contemporary media metalanguage in languages of three post-Yugoslav states, linguistically annotated and made available open-access at the CLARIN repository of linguistic resources. To put the corpus in context, the meaning and relevance of metalanguage research is outlined, the existing efforts at compiling corpora of metalanguage are reviewed, and a sample preliminary analysis of MetaLangCORP keywords is presented to open a broader discussion on the potential applicability of metalanguage corpora. More broadly, it is hoped that making this kind of data available will prompt more nuanced analyses of metalanguage, as well as more corpus-building efforts along similar lines in Slavic and other linguistic scholarship.Sve veÄi interes za metajezik, kako u lingvistici, tako i u drugim disciplinama, naglasio je prazninu koja postoji u metajeziÄnim korpusima i analitiÄkim izvorima koji spadaju meÄu neke od najrjeÄih u sklopu suvremenih dosega korpusne linvistike. Ovaj je rad usmjeren ka popunjavanju te praznine na naÄin da u njemu predstavljamo naÅ” metajeziÄni korpus te ga potom koristimo u kratkoj analizi koja služi kao primjer na temelju kojega raspravljamo o moguÄnostima primjene takvih izvora u lingvistici i druÅ”tvenim znanostima. U radu se prvi put predstavlja MetaLangCorp, viÅ”eelmentni korpus suvremenoga medijskog metajezika prisutnoga u jezicima triju država nastalih raspadom Jugoslavije, koji je lingvistiÄki anotiran i dostupan u slobodnome pristupu u sklopu repozitorija lingvistiÄkih resursa CLARIN. Kako bismo korpus smjestili u kontekst, dajemo kratki prikaz znaÄenja i znaÄaja metajezika, kratki osvrt na postojeÄe napore u sastavljanju metajeziÄnih korpusa te predstavljamo preliminarnu analizu kljuÄnih rijeÄi iz MetaLangCORP-a s ciljem otvaranja Å”ire rasprave o moguÄim primjenama metajeziÄnih korpusa. Nadamo se da Äe dostupnost ovih podataka potaknuti iznijansiranije analize metajezika kao i daljnje sliÄne napore usmjerene na stvaranje korpusa kako za slavenske, tako i za jezike koji pripadaju drugim jeziÄnim porodicama
Glagolski prefiks o(b)- u hrvatskome i bugarskome: semantiÄka mreža i izazovi korpusno utemeljena istraživanja
This study compares the verbal prefix o(b)ā in two South Slavic languages, Croatian and
Bulgarian, from a cognitive linguistic perspective. We focus on the problems arising when
constructing the semantic network of this polysemous prefix, particularly on 1) isolating
the prefixās meaning from the meaning of the base verb and 2) identifying core/dominant
subāmeanings for all verbs and giving them corresponding semantic labels. Our approach
to morphology is based on extensive databases of verbs collected from dictionaries and a
few corpora. However, our work with corpora led to a number of challenges. This study
thus has two aims: a) presenting challenges encountered in working out semantic networks
of prefixes, and b) presenting challenges related to obtaining reliable (quantitative) results
from the corpora.U analizi se iz komparativne perspektive razmatra glagolski prefiks o(b)ā u hrvatskome i
bugarskome jeziku. Teorijski je okvir kognitivna lingvistika. Prva je tema na koju se osvrÄemo
znaÄenjska mreža ovoga prefiksa u svjetlu polisemije. U tom sklopu posebno razmatramo sljedeÄa
pitanja: 1) kako odvojiti znaÄenje prefiksa od znaÄenja osnovnih glagola, 2) kako identificirati
srediÅ”nje znaÄenje i osnovna podznaÄenja i kako ih imenovati. Analiza se temelji na opsežnom
inventaru prefigiranih glagola prikupljenom u rjeÄnicima i korpusima. U radu s korpusima bilo je
nekih izazova, pa se analiza stoga (uz spomenutu problematiku povezanu s razradom semantiÄke
mreže) osvrÄe i na pitanje kako doÄi do kvantitativno relevantnih rezultata na temelju korpusa
koji su ili ograniÄena opsega ili imaju druge vrste ograniÄenja
- ā¦