7 research outputs found
First Steps Toward Developing a System for Terminology Extraction
The aim of this paper is to describe first steps in developing a system for terminology extraction. First a data sample is built from synopses of doctoral theses at the Faculty of Humanities and Social Sciences, University of Zagreb, accepted in the period from 2004 to 2009 written mostly in Croatian language. Data sample consists of 420 documents and 338,706 tokens. A small sample was manually tagged for terminology to be used in an initial experiment. The approach for terminology extraction is knowledge-driven and consists of differential analysis of reference and domain-specific corpora. Specific method used is log-likelihood ratio test. Experiment deals with different reference corpora and linguistic pre-processing. First results are promising. Further research guidelines are discussed
Error Analysis in Croatian Morphosyntactic Tagging
In this paper, we provide detailed
insight on properties of errors generated by a
stochastic morphosyntactic tagger assigning
Multext-East morphosyntactic descriptions to
Croatian texts. Tagging the Croatia Weekly
newspaper corpus by the CroTag tagger in
stochastic mode revealed that approximately 85
percent of all tagging errors occur on nouns,
adjectives, pronouns and verbs. Moreover,
approximately 50 percent of these are shown to
be incorrect assignments of case values. We
provide various other distributional properties of
errors in assigning morphosyntactic descriptions
for these and other parts of speech. On the basis
of these properties, we propose rule-based and
stochastic strategies which could be integrated in
the tagging module, creating a hybrid procedure
in order to raise overall tagging accuracy for
Croatian
Tagset Reductions in Morphosyntactic Tagging of Croatian Texts
Morphosyntactic tagging of Croatian texts is performed with stochastic taggersby using a language model built on a manually annotated corpus implementingthe Multext East version 3 specifications for Croatian. Tagging accuracy in thisframework is basically predefined, i.e. proportionally dependent of two things:the size of the training corpus and the number of different morphosyntactic tagsencompassed by that corpus. Being that the 100 kw Croatia Weekly newspapercorpus by definition makes a rather small language model in terms of stochastictagging of free domain texts, the paper presents an approach dealing withtagset reductions. Several meaningful subsets of the Croatian Multext-East version3 morphosyntactic tagset specifications are created and applied on Croatiantexts with the CroTag stochastic tagger, measuring overall tagging accuracyand F1-measures. Obtained results are discussed in terms of applying differentreductions in different natural language processing systems and specifictasks defined by specific user requirements
hrWaC and slWac: Compiling web corpora for Croatian and Slovene.
Abstract. Web corpora have become an attractive source of linguistic content, yet are for many languages still not available. This paper introduces two new annotated web corpora: the Croatian hrWaC and the Slovene slWaC. Both were built using a modified standard "Web as Corpus" pipeline having in mind the limited amount of available web data. The modifications are described in the paper, focusing on the content extraction from HTML pages, which combines high precision of extracted language content with a decent recall. The paper also investigates texttypes of the acquired corpora using topic modeling, comparing the two corpora among themselves and with ukWaC
Building a Croatian language stemmer
U radu je prikazana izrada dvaju korjenovateljÄ za hrvatski jezik (k2 i k3) koji upotrebljavaju tvorbene nastavke imenica, pridjeva i glagola kako bi odredili osnove pojavnica. Pretpostavku da Äe navedeni korjenovatelji postiÄi bolje rezultate od drugih sliÄnih korjenovatelja za hrvatski jezik provjerili smo usporedbom njihovih preciznosti, odziva i F1-mjera s istim vrijednostima poÄetnoga korjenovatelja (k1). U tu svrhu upotrijebljen je ruÄno provjereni korpus od 9775 pojavnica s odreÄenim lemama i morfosintaktiÄkim oznakama. U radu su takoÄer obraÄeni problemi povezani s nazivljem koje se upotrebljava u podruÄju korjenovanja.The paper presents two conservative Croatian language stemmers, k2 and k3. These stemmers are based on the k1 stemmer, an aggressive Croatian language stemmer presented by Nikola LjubeÅ”iÄ in a 2007 paper. By introducing an expanded set of rules that use derivational morphemes of nouns, verbs, and adjectives to determine the stems of words, we hoped to create a more efficient
stemmer. In order to test whether the k2 and k3 stemmers were more efficient than the k1 stemmer, we calculated their precision, recall, and F1-score using a 9775 token corpus, and compared the results with the precision, recall, and F1-score of the k1 stemmer