Search CORE

7 research outputs found

First Steps Toward Developing a System for Terminology Extraction

Author: Bago Petra
Boras Damir
Ljubešić Nikola
Publication venue: Department of Information Sciences, Faculty of Humanities and Social Sciences, University of Zagreb
Publication date: 01/11/2009
Field of study

The aim of this paper is to describe first steps in developing a system for terminology extraction. First a data sample is built from synopses of doctoral theses at the Faculty of Humanities and Social Sciences, University of Zagreb, accepted in the period from 2004 to 2009 written mostly in Croatian language. Data sample consists of 420 documents and 338,706 tokens. A small sample was manually tagged for terminology to be used in an initial experiment. The approach for terminology extraction is knowledge-driven and consists of differential analysis of reference and domain-specific corpora. Specific method used is log-likelihood ratio test. Experiment deals with different reference corpora and linguistic pre-processing. First results are promising. Further research guidelines are discussed

Repozitorij Filozofskog fakulteta u Zagrebu' at University of Zagreb

Digitalni arhiv Filozofskog fakulteta u Zagrebu

Error Analysis in Croatian Morphosyntactic Tagging

Author: Agić Željko
Dovedan Zdravko
Tadić Marko
Publication venue: Srce - University of Zagreb, University Computing Centre
Publication date: 01/01/2009
Field of study

In this paper, we provide detailed insight on properties of errors generated by a stochastic morphosyntactic tagger assigning Multext-East morphosyntactic descriptions to Croatian texts. Tagging the Croatia Weekly newspaper corpus by the CroTag tagger in stochastic mode revealed that approximately 85 percent of all tagging errors occur on nouns, adjectives, pronouns and verbs. Moreover, approximately 50 percent of these are shown to be incorrect assignments of case values. We provide various other distributional properties of errors in assigning morphosyntactic descriptions for these and other parts of speech. On the basis of these properties, we propose rule-based and stochastic strategies which could be integrated in the tagging module, creating a hybrid procedure in order to raise overall tagging accuracy for Croatian

Repozitorij Filozofskog fakulteta u Zagrebu' at University of Zagreb

Crossref

Digitalni arhiv Filozofskog fakulteta u Zagrebu

Tagset Reductions in Morphosyntactic Tagging of Croatian Texts

Author: Agić Željko
Dovedan Zdravko
Tadić Marko
Publication venue: Department of Information Sciences, Faculty of Humanities and Social Sciences, University of Zagreb
Publication date: 01/11/2009
Field of study

Morphosyntactic tagging of Croatian texts is performed with stochastic taggersby using a language model built on a manually annotated corpus implementingthe Multext East version 3 specifications for Croatian. Tagging accuracy in thisframework is basically predefined, i.e. proportionally dependent of two things:the size of the training corpus and the number of different morphosyntactic tagsencompassed by that corpus. Being that the 100 kw Croatia Weekly newspapercorpus by definition makes a rather small language model in terms of stochastictagging of free domain texts, the paper presents an approach dealing withtagset reductions. Several meaningful subsets of the Croatian Multext-East version3 morphosyntactic tagset specifications are created and applied on Croatiantexts with the CroTag stochastic tagger, measuring overall tagging accuracyand F1-measures. Obtained results are discussed in terms of applying differentreductions in different natural language processing systems and specifictasks defined by specific user requirements

Repozitorij Filozofskog fakulteta u Zagrebu' at University of Zagreb

Digitalni arhiv Filozofskog fakulteta u Zagrebu

hrWaC and slWac: Compiling web corpora for Croatian and Slovene.

Author: Nikola Ljubešić
Tomaž Erjavec
Publication venue: Springer.
Publication date: 01/01/2011
Field of study

Abstract. Web corpora have become an attractive source of linguistic content, yet are for many languages still not available. This paper introduces two new annotated web corpora: the Croatian hrWaC and the Slovene slWaC. Both were built using a modified standard "Web as Corpus" pipeline having in mind the limited amount of available web data. The modifications are described in the paper, focusing on the content extraction from HTML pages, which combines high precision of extracted language content with a decent recall. The paper also investigates texttypes of the acquired corpora using topic modeling, comparing the two corpora among themselves and with ukWaC

CiteSeerX

Building a Croatian language stemmer

Author: Ivan Pandžić
Publication venue: 'Institute of Croatian Language and Linguistics'
Publication date: 01/01/2015
Field of study

U radu je prikazana izrada dvaju korjenovateljā za hrvatski jezik (k2 i k3) koji upotrebljavaju tvorbene nastavke imenica, pridjeva i glagola kako bi odredili osnove pojavnica. Pretpostavku da će navedeni korjenovatelji postići bolje rezultate od drugih sličnih korjenovatelja za hrvatski jezik provjerili smo usporedbom njihovih preciznosti, odziva i F1-mjera s istim vrijednostima početnoga korjenovatelja (k1). U tu svrhu upotrijebljen je ručno provjereni korpus od 9775 pojavnica s određenim lemama i morfosintaktičkim oznakama. U radu su također obrađeni problemi povezani s nazivljem koje se upotrebljava u području korjenovanja.The paper presents two conservative Croatian language stemmers, k2 and k3. These stemmers are based on the k1 stemmer, an aggressive Croatian language stemmer presented by Nikola Ljubešić in a 2007 paper. By introducing an expanded set of rules that use derivational morphemes of nouns, verbs, and adjectives to determine the stems of words, we hoped to create a more efficient stemmer. In order to test whether the k2 and k3 stemmers were more efficient than the k1 stemmer, we calculated their precision, recall, and F1-score using a 9775 token corpus, and compared the results with the precision, recall, and F1-score of the k1 stemmer

Directory of Open Access Journals

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia

The Future of Information Sciences : INFuture2009 : Digital Resources and Knowledge Sharing

Author
Publication venue: Department of Information Sciences, Faculty of Humanities and Social Sciences, University of Zagreb
Publication date: 01/11/2009
Field of study

Repozitorij Filozofskog fakulteta u Zagrebu' at University of Zagreb