14 research outputs found

    Eesti Wordnet’i struktuuri analüüsist

    Get PDF
    Artikkel pakub üldise lähenemisviisi relatsiooniliste süsteemide suletud hulkade leidmiseks ja korrastamiseks ning demonstreerib selle meetodi kasutamist Eesti Wordnet’il. Lahatakse Eesti Wordnet’i struktuuri, lähtudes semantilistest suhetest. Selgitatakse analüüsiks kasutatava infotöötlusmeetodi ideed ja sõnastatakse lahendatav probleem mitteformaalselt. Esitatakse meetodi rakendamise järjestikused sammud. Andmetöötluse tulemusena tekivad visuaalsed analüüsi objektid/pildid, mis avavad Wordnet’i struktuuri viisil, mis võimaldab leksikograafil hinnata struktuurides peituvaid eripärasid. Artikli lõpuosas antakse näidete põhjal vihjeid võimalikele probleemidele ja nende lahendustele.DOI: http://dx.doi.org/10.5128/ERYa8.09</p

    The Distribution Index Calculator for Estonian

    No full text
    Lexicographers working with such morphologically rich languages as Estonian face the task of detecting the lexicographic status of some word forms that look like case forms of nouns but can behave as function words to a certain degree. Hence, a measurable criterion for making a word form an autonomous headword is needed. The present paper describes the idea and development of a tool called the Distribution Index Calculator (DIC) for Estonian. It is a web-based application which finds the frequency data of word forms and lemmas from an annotated corpus and retrieves a statistic called the Distribution Index (DI). The DI indicates the relative prominence of a word form as compared to its expected normative level of salience. The application is described in detail and some illustrations of its performance are provided. The evaluation of its quality is as follows: a higher than critical level of DI can be trusted as an indicator of the relative autonomy of a word form, while a lower than critical level of DI does not preclude such autonomy. The DIC thus gives relative heuristics rather than absolute ratings or true-value decisions.

    How to create order in large closed subsets of WordNet-type dictionaries

    No full text
    <p><span>This article presents a new two-step method to handle and study large closed subsets of WordNet-type dictionaries with the goal of finding possible structural inconsistencies. The notion of closed subset is explained using a WordNet tree. A novel and very fast method to order large relational systems is described and compared with some other fast methods. All the presented methods have been tested using Estonian1 and Princeton WordNet2 largest closed sets.</span></p><p>DOI: http://dx.doi.org/10.5128/ERYa9.10</p

    Käändevormist sõnaks : mida näitab sagedus?

    No full text
    This study is motivated by the need for a statistical benchmark that would help the lexicographer to judge a morphological form for its grammaticalization stage to the degree of an independent lexeme. The focus of this article is on Estonian substantives and in particular their forms in the 11 semantic cases. The main research question of this study is: is there a statistical sign indicating that a case form of a noun is emerging as a potentially independent lexeme? Based on the normal distribution of nominal case form frequencies, we established a statistic that determines a case form’s elicitation in a corpus – the distribution index (D-index). The D-index can be used as an indicator of the correspondence of a particular form’s actual frequency with the predicted elicitation degree.Artikkel tegeleb nimisõnavormide iseseisvumise küsimusega leksikograafia vajadustest lähtudes. Eeldusel, et abstraktseid käändevorme iseloomustab korpuses üldine püsiv esiletuleku proportsioon, pakume välja statistilise mõõdiku – distributsiooniindeksi, mille abil otsustada, kas sõnavormi kasutussagedus on piisav selleks, et lugeda teda paradigmast emantsipeerunuks ning seega iseseisva märksõna kandidaadiks. Indeks arvestab vormi suhtelist sagedust korpuses, võrreldes tegelikku ja normi põhjal oodatavat kasutussagedust, ning laseb samale skaalale paigutada väga erineva absoluutsagedusega juhtumeid. Artiklis illustreerime distributsiooniindeksi toimivust tavaliste rikkaliku vormistikuga nimisõnade ning ambivorme andvate sõnade vormistike võrdlusena. Seame provisoorse indeksi lävendväärtuse, millest suurema väärtusega vormi võib pidada iseseisvaks lekseemiks. Indeksit ning lävendväärtust testitakse erinevate korpuste (EtTenTen13, ÜK 2019) andmete peal

    From experiments to an application : The first prototype of an adjective detector for Estonian

    No full text
    In this study, we discuss the process of developing a multi-parameter application – the adjective similarity calculator (ASC) – that determines the relative adjectivity of a word or a word form.The tool relates the statistical summary of a word (form)’s corpus behaviour to the most typical and central aspects of the Estonian adjective: the adjectival corpus profile. To establish this profile, we use close-context patterns characterising adjectives and detectable in the corpus (see the experiments in Tuulik et al. 2022, Paulsen et al. 2022, and Vainik et al., 2023). The first prototype of the ASC will be evaluated based on clear cases of adjectives and PoS representatives overlapping with adjectival properties, but also based on words representing more distant classes. The main purpose of the application is to improve lexicographic work in categorisation procedures of the partly overlapping lexical categories to the adjective, particularly in such ambiguous cases as adjectivised participles, nouns and adverbs.

    From experiments to an application : The first prototype of an adjective detector for Estonian

    No full text
    In this study, we discuss the process of developing a multi-parameter application – the adjective similarity calculator (ASC) – that determines the relative adjectivity of a word or a word form.The tool relates the statistical summary of a word (form)’s corpus behaviour to the most typical and central aspects of the Estonian adjective: the adjectival corpus profile. To establish this profile, we use close-context patterns characterising adjectives and detectable in the corpus (see the experiments in Tuulik et al. 2022, Paulsen et al. 2022, and Vainik et al., 2023). The first prototype of the ASC will be evaluated based on clear cases of adjectives and PoS representatives overlapping with adjectival properties, but also based on words representing more distant classes. The main purpose of the application is to improve lexicographic work in categorisation procedures of the partly overlapping lexical categories to the adjective, particularly in such ambiguous cases as adjectivised participles, nouns and adverbs.

    Catching lexemes : The case of Estonian noun-based ambiforms

    No full text
    The aim of this study is to test a statistic relying on corpus data, the distributional index (D-index): a statistical benchmark that helps lexicographers judge if a morphological form has been conventionalised to the degree of becoming an independent lexeme. Our focus is on thedecategorisation type that originates from a case form of a noun and is directed to an adverb,adposition or adjective. The words or inflected forms corresponding to more than one word class interpretation are in this study termed ambiforms. The analysis compares the D-index levels of ambiforms categorised as nouns and another PoS. The results suggest that for the outcome to be most authentic, the noun-based ambiforms should be analysed without the decategorisation influence, i.e. the D-index analysis should be applied in the pre-PoS-disambiguation stage
    corecore