15 research outputs found

    Is It Possible to Create a Very Large WordNet in 100 days? -- an Evaluation

    Get PDF
    Wordnets are large-scale lexical databases of related words and concepts, useful for language-aware software applications. They have recently been built for many languages by using various approaches. The Finnish wordnet, FinnWordNet (FiWN), was created by translating the more than 200,000 word senses in the English Princeton WordNet (PWN) 3.0 in 100 days. To ensure quality, they were translated by professional translators. The direct translation approach was based on the assumption that most synsets in PWN represent language-independent real-world concepts. Thus also the semantic relations between synsets were assumed mostly language-independent, so the structure of PWN could be reused as well. This approach allowed the creation of an extensive Finnish wordnet directly aligned with PWN and also provided us with a translation relation and thus a bilingual wordnet usable as a dictionary. In this paper, we address several concerns raised with regard to  our approach in one single paper, many of them for the first time. We evaluate the craftsmanship of the translators by checking the spelling and translation quality, the viability of the approach by assessing the synonym quality both on the lexeme and concept level, as well as the usefulness of the resulting lexical resource both for humans and in a language-technological task. We discovered no new problems compared with those already known in PWN. As a whole, the paper contributes to the scientific discourse on what it takes to create a very large wordnet. As a side-effect of the evaluation, we extended FiWN to contain 208,645 word senses in 120,449 synsets, effectively making version 2.0 of FiWN the currently largest wordnet in the world by these statistics.Peer reviewe

    Comparing two thesaurus representations for Russian

    Get PDF
    © 2018 Global WordNet Association. All Rights Reserved. In the paper we presented a new Russian wordnet, RuWordNet, which was semi-automatically obtained by transformation of the existing Russian thesaurus RuThes. At the first step, the basic structure of wordnets was reproduced: synsets’ hierarchy for each part of speech and the basic set of relations between synsets (hyponym-hypernym, part-whole, antonyms). At the second stage, we added causation, entailment and domain relations between synsets. Also derivation relations were established for single words and the component structure for phrases included in RuWordNet. The described procedure of transformation highlights the specific features of each type of thesaurus representations

    FIN-CLARIN - a humanities research infrastructure with emphasis on language

    Get PDF
    Miljardvis med ord och tusentals timmar med audio och video behövs som material för humanistisk forskning och i synnerhet sprÄkforskning. Dessutom behöver forskarna redskap för att förÀdla och jÀmföra sina egna datasamlingar med allmÀnna datasamlingar. NÀr ett forskningsprojekt Àr slut behövs det lagrings- och spridningsplatser för att göra rÄdata, redskap och forskningsresultat tillgÀngliga och anvÀndbara. Data, redskap och gemensamma anvÀndningsmöjligheter bildar tillsammans en forskningsinfrastruktur, som gör det möjligt att verifiera tidigare resultat och effektivare göra nya rön, nÀr alla inte behöver starta frÄn noll med att samla data och bygga analysredskap.Non peer reviewe

    FIN-CLARIN – en humanistisk forskningsinfrastruktur med betoning pĂ„ sprĂ„k

    Get PDF
    Miljardvis med ord och tusentals timmar med audio och video behövs som material för humanistisk forskning och i synnerhet sprÄkforskning. Dessutom behöver forskarna redskap för att förÀdla och jÀmföra sina egna datasamlingar med allmÀnna datasamlingar. NÀr ett forskningsprojekt Àr slut behövs det lagrings- och spridningsplatser för att göra rÄdata, redskap och forskningsresultat tillgÀngliga och anvÀndbara. Data, redskap och gemensamma anvÀndningsmöjligheter bildar tillsammans en forskningsinfrastruktur, som gör det möjligt att verifiera tidigare resultat och effektivare göra nya rön, nÀr alla inte behöver starta frÄn noll med att samla data och bygga analysredskap

    Trawling and trolling for terrorists in the digital Gulf of Bothnia : Cross-lingual text mining for the emergence of terrorism in Swedish and Finnish newspapers, 1780—1926

    Get PDF
    In pursuing the historical emergence of the discourse on terrorism, this study trawls the “digital Gulf of Bothnia” in the form of a corpus of combined Swedish and Finnish digitized newspaper texts. Through a cross-lingual exploration of the uses of the concept of terrorism in historical Swedish and Finnish news, we examine meanings anchored in the two culturally close but still decidedly different national political contexts. The study is an outcome of an integrative interdisciplinary effort.Peer reviewe

    FiST – towards a Free Semantic Tagger of Modern Standard Finnish

    Get PDF
    This paper introduces a work in progress for implementing a free full text semantic tagger for Finnish, FiST. The tagger is based on a 46 226 lexeme semantic lexicon of Finnish that was published in 2016. The basis of the semantic lexicon was developed in the early 2000s in an EU funded project Benedict (Löfberg et al., 2005). Löfberg (2017) describes compilation of the lexicon and evaluates a proprietary version of the Finnish Semantic Tagger, the FST2. The FST and its lexicon were developed using the English Semantic Tagger (The EST) of University of Lancaster as a model. This semantic tagger was developed at the University Centre for Corpus Research on Language (UCREL) at Lancaster University as part of the UCREL Semantic Analysis System (USAS3 ) framework. The semantic lexicon of the USAS framework is based on the modified and enriched categories of the Longman Lexicon of Contemporary English (McArthur, 1981). We have implemented a basic working version of a new full text semantic tagger for Finnish based on freely available components. The implementation uses Omorfi and FinnPos for morphological analysis of Finnish words. After the morphological recognition phase words from the 46K semantic lexicon are matched against the morphologically unambiguous base forms. In our comprehensive tests the lexical tagging coverage of the current implementation is around 82–90% with different text types. The present version needs still some enhancements, at least processing of semantic ambiguity of words and analysis of compounds, and perhaps also treatment of multiword expressions. Also a semantically marked ground truth evaluation collection should be established for evaluation of the tagger.Peer reviewe

    Eri meetodeid wordnet-tĂŒĂŒpi sĂ”nastiku kontrolliks Eesti Wordneti nĂ€itel

    Get PDF
    https://www.ester.ee/record=b5358502*es
    corecore