    Word Sense Disambiguation Based on Large Scale Polish CLARIN Heterogeneous Lexical Resources

    Word Sense Disambiguation Based on Large Scale Polish CLARIN Heterogeneous Lexical Resources Lexical resources can be applied in many different Natural Language Engineering tasks, but the most fundamental task is the recognition of word senses used in text contexts. The problem is difficult, not yet fully solved and different lexical resources provided varied support for it. Polish CLARIN lexical semantic resources are based on the plWordNet — a very large wordnet for Polish — as a central structure which is a basis for linking together several resources of different types. In this paper, several Word Sense Disambiguation (henceforth WSD) methods developed for Polish that utilise plWordNet are discussed. Textual sense descriptions in the traditional lexicon can be compared with text contexts using Lesk’s algorithm in order to find best matching senses. In the case of a wordnet, lexico-semantic relations provide the main description of word senses. Thus, first, we adapted and applied to Polish a WSD method based on the Page Rank. According to it, text words are mapped on their senses in the plWordNet graph and Page Rank algorithm is run to find senses with the highest scores. The method presents results lower but comparable to those reported for English. The error analysis showed that the main problems are: fine grained sense distinctions in plWordNet and limited number of connections between words of different parts of speech. In the second approach plWordNet expanded with the mapping onto the SUMO ontology concepts was used. Two scenarios for WSD were investigated: two step disambiguation and disambiguation based on combined networks of plWordNet and SUMO. In the former scenario, words are first assigned SUMO concepts and next plWordNet senses are disambiguated. In latter, plWordNet and SUMO are combined in one large network used next for the disambiguation of senses. The additional knowledge sources used in WSD improved the performance. The obtained results and potential further lines of developments were discussed

    Word Sense Disambiguation Based on Large Scale Polish CLARIN Heterogeneous Lexical Resources

    The construction of reference corpus of contemporary Serbian

    U prvom delu rada se razmatraju opˇsta pitanja koja se odnose na definiciju, istorijat, parametre i klasifikaciju korpusa, kao i na korpusnu lingvistiku kao metodologiju istraˇzivanja jezika. Posebna paˇznja je posve´cena pitanjima reprezentativnosti i balansiranosti korpusa kao uzorka jezika. Takode je detaljno razmotren i uticaj Interneta, odnosno veba, na kritiˇcko preispitivanje definicije korpusa. Kao parametri korpusa, posebno su analizirani nosaˇc, domen i namena, obim (veliˇcina), period, izvor/medijum, anotacija i viˇsejeziˇcnost. Na osnovu tih parametara su opisane mogu´ce klasifikacije korpusa i posebno su izdvojeni nacionalni korpusi kao opˇsti, referentni korpusi koji pretenduju da reprezentuju jezik jedne zemlje. Detaljno su analizirani nacionalni korpusi slovenskih jezika. Poseban odeljak je posve´cen istorijatu srpske korpusne lingvistike. Na kraju prvog dela rada su navedeni ciljevi rada: razmatranje mogu´cnosti izgradnje opˇsteg korpusa srpskog jezika koji bi bio elektronski, dinamiˇcki, sinhroni, balansiran, anotiran (morfoloˇski, strukturno, bibliografski), kao i mogu´cnosti izgradnje prate´cih viˇsejeziˇcnih paralelnih korpusa u kojima je srpski izvorni ili ciljni jezik...The problem regarding the methods and tools to construct a corpus of contemporary Serbian as a reference language resource is considered in this thesis. The thesis consists of three parts. General questions related to definition, history, parameters and classification of corpora, as well as to corpus linguistics as a methodology in language research, are considered in the first part of the thesis. The special attention is paid to questions regarding representativeness and balance of corpus as a language sample. The affect of Internet/Web on critical review of corpus definition is considered in detail, too. Corpus parameters (storage medium, domain/purpose, size, time span, mode of communication, annotation and multilinguality) are particularly analysed. Possible classifications of corpora, based on these parameters, are described with emphasis on national corpora as general reference corpora which are supposed to represent the national language of a country. National corpora of Slavic languages are analysed exhaustively. A special section is dedicated to the history of Serbian corpus linguistics. The goals of thesis are listed in the end of the first part of the thesis: considering possibilities for construction of general, electronic, dynamic, synchronous, balanced, morphosyntactically and bibliographically-annotated corpus, as well as the possibilities for construction of multilingual parallel corpora with Serbian as source or target language..