Search CORE

3,713 research outputs found

Domain-aware Evaluation of Named Entity Recognition Systems for Croatian

Author: Bozo Bekavac
Zeljko Agic
Publication venue: 'University of Zagreb - University Computing Centre'
Publication date: 01/01/2013
Field of study

We provide an evaluation of the currently available named entity recognition systems for Croatian. The evaluation puts special emphasis on domain dependence. To this goal, we manually annotated a dataset of approximately 1 million tokens of Croatian text from various domains within the newspaper text genre. The dataset was annotated using a three-class named entity tagset – denoting personal names, locations and organizations. We give insight to feature selection, domain sensitivity and effects of increase in training set size for statistical named entity recognition using the state-of-the-art Stanford NER system. We also sketch a comparison of publicly available named entity recognition systems for Croatian considering domain dependence, regardless of their underlying paradigms. Our top-performing system achieved an F1-score of 0.884 in a mixed-domain testing scenario, scoring 0.925 and 0.843 in the two domains separated for the experiment. The system shows consistency in state-of-the-art scores for detecting names of persons, locations and organizations

Crossref

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia

Babel Treebank of Public Messages in Croatian

Author: Agić Ana
Agić Željko
Merkler Danijela
Publication venue: The Authors. Published by Elsevier Ltd.
Publication date: 25/10/2013
Field of study

AbstractThe paper presents the process of constructing a publicly available treebank of public messages written in Croatian. The messages were collected from various electronic sources – e-mail, blog, Facebook and SMS – and published on the Zagreb Museum of Contemporary Art LED facade within the Babel art project. The project aimed to use the facade as an open-space blog or social interface for enabling citizens to publicly express their views. Construction and current state of the treebank is presented along with future work plans. A comparison of Babel Treebank with Croatian Dependency Treebank and SETimes.HR treebank regarding differing domains and annotation schemes is briefly sketched. The treebank is used as a test platform for introducing a new standard for syntactic annotation of Croatian texts. An experiment with morphosyntactic tagging and dependency parsing of the treebank is conducted, providing first insight to computational processing of non-standard text in Croatian

Elsevier - Publisher Connector

Tagset Reductions in Morphosyntactic Tagging of Croatian Texts

Author: Agić Željko
Dovedan Zdravko
Tadić Marko
Publication venue: Department of Information Sciences, Faculty of Humanities and Social Sciences, University of Zagreb
Publication date: 01/11/2009
Field of study

Morphosyntactic tagging of Croatian texts is performed with stochastic taggersby using a language model built on a manually annotated corpus implementingthe Multext East version 3 specifications for Croatian. Tagging accuracy in thisframework is basically predefined, i.e. proportionally dependent of two things:the size of the training corpus and the number of different morphosyntactic tagsencompassed by that corpus. Being that the 100 kw Croatia Weekly newspapercorpus by definition makes a rather small language model in terms of stochastictagging of free domain texts, the paper presents an approach dealing withtagset reductions. Several meaningful subsets of the Croatian Multext-East version3 morphosyntactic tagset specifications are created and applied on Croatiantexts with the CroTag stochastic tagger, measuring overall tagging accuracyand F1-measures. Obtained results are discussed in terms of applying differentreductions in different natural language processing systems and specifictasks defined by specific user requirements

Repozitorij Filozofskog fakulteta u Zagrebu' at University of Zagreb

Digitalni arhiv Filozofskog fakulteta u Zagrebu

A Legal Perspective on Training Models for Natural Language Processing

Author: Dore Giulia
Eckart de Castilho Richard
Gurevych Iryna
Labropoulou Penny
Margoni Thomas
Publication venue
Publication date: 01/01/2018
Field of study

A significant concern in processing natural language data is the often unclear legal status of the input and output data/resources. In this paper, we investigate this problem by discussing a typical activity in Natural Language Processing: the training of a machine learning model from an annotated corpus. We examine which legal rules apply at relevant steps and how they affect the legal status of the results, especially in terms of copyright and copyright-related rights

TUbiblio

Enlighten

Kompiliranje korpusa u digitalnim humanističkim znanostima u jezicima s ograničenim resursima: o praksi kompiliranja tematskih korpusa iz digitalnih medija za srpski, hrvatski i slovenski

Author: Batanović Vuk
Bogetić Ksenija
Ljubešić Nikola
Publication venue: 'Hrvatsko filolosko drustvo (Croatian Philological Society)'
Publication date: 01/01/2022
Field of study

The digital era has unlocked unprecedented possibilities of compiling corpora of social discourse, which has brought corpus linguistic methods into closer interaction with other methods of discourse analysis and the humanities. Even when not using any specific techniques of corpus linguistics, drawing on some sort of corpus is increasingly resorted to for empirically–grounded social–scientific analysis (sometimes dubbed ‘corpus–assisted discourse analysis’ or ‘corpus–based critical discourse analysis’, cf. Hardt–Mautner 1995; Baker 2016). In the post–Yugoslav space, recent corpus developments have brought table–turning advantages in many areas of discourse research, along with an ongoing proliferation of corpora and tools. Still, for linguists and discourse analysts who embark on collecting specialized corpora for their own research purposes, many questions persist – partly due to the fast–changing background of these issues, but also due to the fact that there is still a gap in the corpus method, and in guidelines for corpus compilation, when applied beyond the anglophone contexts. In this paper we aim to discuss some possible solutions to these difficulties, by presenting one step–by–step account of a corpus building procedure specifically for Croatian, Serbian and Slovenian, through an example of compiling a thematic corpus from digital media sources (news articles and reader comments). Following an overview of corpus types, uses and advantages in social sciences and digital humanities, we present the corpus compilation possibilities in the South Slavic language contexts, including data scraping options, permissions and ethical issues, the factors that facilitate or complicate automated collection, and corpus annotation and processing possibilities. The study shows expanding possibilities for work with the given languages, but also some persistently grey areas where researchers need to make decisions based on research expectations. Overall, the paper aims to recapitulate our own corpus compilation experience in the wider context of South–Slavic corpus linguistics and corpus linguistic approaches in the humanities more generallyDigitalno doba otvorilo je nove mogućnosti za sastavljanje korpusa društvenog diskursa, što je korpusnolingvističke metode približilo drugim metodama analize diskursa te humanističkim znanostima. Čak i kada se ne koriste nikakve specifične tehnike korpusne lingvistike, danas je za empirijski utemeljenu društveno–znanstvenu analizu sve učestalije korištenje neke vrste korpusa (‘korpusno–asistirana analiza diskursa’ ili ‘kritička korpusna analiza’, Hardt–Mautner 1995; Baker 2016). U postjugoslavenskom prostoru, nedavni razvoj korpusne lingvistike donio je prednosti u mnogim područjima istraživanja. Ipak, za lingviste i analitičare diskursa koji se upuštaju u prikupljanje specijaliziranih korpusa za vlastite istraživačke svrhe, i dalje ostaju otvorena mnoga pitanja – djelomično zbog pozadine korpusne lingvistike koja se brzo mijenja, ali i zbog činjenice da još uvijek postoji rascjep u poznavanju korpusnih metoda, kao i metodologije sastavljanja korpusa izvan anglofonskog konteksta. Ovim radom pokušavamo smanjiti spomenuti rascjep predstavljajući jedan postupni prikaz postupka izgradnje korpusa za hrvatski, srpski i slovenski, kroz primjer sastavljanja tematskog korpusa iz digitalnih medija (novinski članci i komentari čitatelja). Nakon pregleda tipova korpusa, korištenja i prednosti u društvenim znanostima i digitalnim humanističkim znanostima, predstavljamo mogućnosti sastavljanja korpusa u južnoslavenskim jezičnim kontekstima, uključujući opcije preuzimanja podataka s mreže, dozvola i etičkih pitanja, čimbenika koji olakšavaju ili otežavaju automatizirano prikupljanje i označavanje korpusa i mogućnosti obrade. Studija otkriva sve veće mogućnosti za rad s danim jezicima, ali i neka uporno siva područja u kojima istraživači trebaju donositi odluke na temelju istraživačkih očekivanja. Općenito, rad ima za cilj rekapitulirati vlastito iskustvo sastavljanja korpusa u širem kontekstu južnoslavenske korpusne lingvistike i korpusnih lingvističkih pristupa u humanističkim znanostima općenito

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia

MetaLangCORP: PREDSTAVLJANJE PRVOGA KORPUSA MEDIJSKOGA METAJEZIKA NA SLOVENSKOM, HRVATSKOM I SRPSKOM I MOGUĆNOSTI NJEGOVE MEĐUDISCIPLINARNE PRIMJENE

Author: Ksenija Bogetić
Publication venue: 'Faculty of Humanities and Social Sciences University of Rijeka'
Publication date: 01/01/2021
Field of study

Growing interest in meta-language, in linguistics and other disciplines, has highlighted a gap in metalanguage corpora and analytical resources, which remain among the scarcest in corpus-linguistic developments so far. This paper is aimed at making a step towards filling this gap, both by presenting our own metalanguage corpus resource and using it in a short sample analysis to discuss the applications of such resources in linguistics and social sciences. Specifically, the paper presents for the first time MetaLangCORP, a multielement corpus of contemporary media metalanguage in languages of three post-Yugoslav states, linguistically annotated and made available open-access at the CLARIN repository of linguistic resources. To put the corpus in context, the meaning and relevance of metalanguage research is outlined, the existing efforts at compiling corpora of metalanguage are reviewed, and a sample preliminary analysis of MetaLangCORP keywords is presented to open a broader discussion on the potential applicability of metalanguage corpora. More broadly, it is hoped that making this kind of data available will prompt more nuanced analyses of metalanguage, as well as more corpus-building efforts along similar lines in Slavic and other linguistic scholarship.Sve veći interes za metajezik, kako u lingvistici, tako i u drugim disciplinama, naglasio je prazninu koja postoji u metajezičnim korpusima i analitičkim izvorima koji spadaju među neke od najrjeđih u sklopu suvremenih dosega korpusne linvistike. Ovaj je rad usmjeren ka popunjavanju te praznine na način da u njemu predstavljamo naš metajezični korpus te ga potom koristimo u kratkoj analizi koja služi kao primjer na temelju kojega raspravljamo o mogućnostima primjene takvih izvora u lingvistici i društvenim znanostima. U radu se prvi put predstavlja MetaLangCorp, višeelmentni korpus suvremenoga medijskog metajezika prisutnoga u jezicima triju država nastalih raspadom Jugoslavije, koji je lingvistički anotiran i dostupan u slobodnome pristupu u sklopu repozitorija lingvističkih resursa CLARIN. Kako bismo korpus smjestili u kontekst, dajemo kratki prikaz značenja i značaja metajezika, kratki osvrt na postojeće napore u sastavljanju metajezičnih korpusa te predstavljamo preliminarnu analizu ključnih riječi iz MetaLangCORP-a s ciljem otvaranja šire rasprave o mogućim primjenama metajezičnih korpusa. Nadamo se da će dostupnost ovih podataka potaknuti iznijansiranije analize metajezika kao i daljnje slične napore usmjerene na stvaranje korpusa kako za slavenske, tako i za jezike koji pripadaju drugim jezičnim porodicama

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia

Serbo-Croatian LVCSR on the dictation and broadcast news domain

Author: Geutner Petra
Scheytt Peter
Waibel Alex
Publication venue
Publication date: 02/08/2007
Field of study

KITopen

Glagolski prefiks o(b)- u hrvatskome i bugarskome: semantička mreža i izazovi korpusno utemeljena istraživanja

Author: Ljiljana Šarić
Svetlana Nedelcheva
Publication venue: Crotian Philological Society
Publication date: 01/01/2015
Field of study

This study compares the verbal prefix o(b)– in two South Slavic languages, Croatian and Bulgarian, from a cognitive linguistic perspective. We focus on the problems arising when constructing the semantic network of this polysemous prefix, particularly on 1) isolating the prefix’s meaning from the meaning of the base verb and 2) identifying core/dominant sub–meanings for all verbs and giving them corresponding semantic labels. Our approach to morphology is based on extensive databases of verbs collected from dictionaries and a few corpora. However, our work with corpora led to a number of challenges. This study thus has two aims: a) presenting challenges encountered in working out semantic networks of prefixes, and b) presenting challenges related to obtaining reliable (quantitative) results from the corpora.U analizi se iz komparativne perspektive razmatra glagolski prefiks o(b)– u hrvatskome i bugarskome jeziku. Teorijski je okvir kognitivna lingvistika. Prva je tema na koju se osvrćemo značenjska mreža ovoga prefiksa u svjetlu polisemije. U tom sklopu posebno razmatramo sljedeća pitanja: 1) kako odvojiti značenje prefiksa od značenja osnovnih glagola, 2) kako identificirati središnje značenje i osnovna podznačenja i kako ih imenovati. Analiza se temelji na opsežnom inventaru prefigiranih glagola prikupljenom u rječnicima i korpusima. U radu s korpusima bilo je nekih izazova, pa se analiza stoga (uz spomenutu problematiku povezanu s razradom semantičke mreže) osvrće i na pitanje kako doći do kvantitativno relevantnih rezultata na temelju korpusa koji su ili ograničena opsega ili imaju druge vrste ograničenja

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia