27 research outputs found

    HFST—Framework for Compiling and Applying Morphologies

    Get PDF
    HFST–Helsinki Finite-State Technology ( hfst.sf.net ) is a framework for compiling and applying linguistic descriptions with finite-state methods. HFST currently connects some of the most important finite-state tools for creating morphologies and spellers into one open-source platform and supports extending and improving the descriptions with weights to accommodate the modeling of statistical information. HFST offers a path from language descriptions to efficient language applications in key environments and operating systems. HFST also provides an opportunity to exchange transducers between different software providers in order to get the best out of each finite-state library.Peer reviewe

    Implementation of replace rules using preference operator

    Get PDF
    We explain the implementation of replace rules with the .r-glc. operator and preference relations. Our modular approach combines various preference constraints to form different replace rules. In addition to describing the method, we present illustrative examples.Peer reviewe

    HFST Training Environment and Recent Additions

    Get PDF
    HFST - the Helsinki Finite-State Technology toolkit was launched in 2009 (Lindén & al, 2009) and has since been used for developing a number of rule-based morphologies for processing natural language. To promote the uptake of the toolkit a training environment for linguists to learn how to use HFST has been designed in Jupyter. This paper presents an overview of the training environment and some of the recent features that have been added to HFST to keep the run-time size of the transducer reasonably small despite exceptions and negative constraints that need to be added during practical FST development.Peer reviewe

    FIN-CLARIN – en humanistisk forskningsinfrastruktur med betoning på språk

    Get PDF
    Miljardvis med ord och tusentals timmar med audio och video behövs som material för humanistisk forskning och i synnerhet språkforskning. Dessutom behöver forskarna redskap för att förädla och jämföra sina egna datasamlingar med allmänna datasamlingar. När ett forskningsprojekt är slut behövs det lagrings- och spridningsplatser för att göra rådata, redskap och forskningsresultat tillgängliga och användbara. Data, redskap och gemensamma användningsmöjligheter bildar tillsammans en forskningsinfrastruktur, som gör det möjligt att verifiera tidigare resultat och effektivare göra nya rön, när alla inte behöver starta från noll med att samla data och bygga analysredskap

    A Humor új Fo(r)mája

    Get PDF
    A MorphoLogic Humor morfológiai elemzjéhez az utóbbi évtizedekben számos nyelven készült morfológiai adatbázis. Ezek közül némelyik igen jó lefedettséget és pontosságot ad, mások olyan nyelvekre biztosítják az automatikus morfológiai elemzés lehetségét, amelyekre más hasonló erforrás nem létezik. A Humor elemzszoftver zárt licence azonban nem tette lehetvé ezeknek a nyelvi erforrásoknak a szabad terjesztését. Ugyanakkor a Humor elemz implementációja nem teszi lehetvé az ismeretlen szavak elemzését (morphological guessing), valamint azt sem, hogy az egyes szavakhoz gyakorisági információt rendeljünk, vagy a modellt másképp súlyozzuk. Ezeket a problémákat úgy oldottuk meg, hogy a Humor morfológiai erforrásait olyan véges állapotú leírássá konvertáltuk, amely mindezeket a problémákat megoldja és rendelkezik nyílt forráskódú implementációval is

    Predictive Text Entry for Agglutinative Languages Using Unsupervised Morphological Segmentation

    Get PDF
    Host publication title: Computational Linguistics and Intelligent Text Processing Host publication sub-title: 13th International Conference, CICLing 2012Peer reviewe

    Machine Translation for Crimean Tatar to Turkish

    Get PDF
    In this paper a machine translation system for Crimean Tatar to Turkish is presented. To our knowledge this is the first Machine Translation system made available for public use for Crimean Tatar, and the first such system released as free and open source software. The system was built using Apertium, a free and open source machine translation system, and is currently unidirectional from Crimean Tatar to Turkish. We describe our translation system, evaluate it on parallel corpora and compare its performance with a Neural Machine Translation system, trained on the limited amount of corpora available

    FIN-CLARIN - a humanities research infrastructure with emphasis on language

    Get PDF
    Miljardvis med ord och tusentals timmar med audio och video behövs som material för humanistisk forskning och i synnerhet språkforskning. Dessutom behöver forskarna redskap för att förädla och jämföra sina egna datasamlingar med allmänna datasamlingar. När ett forskningsprojekt är slut behövs det lagrings- och spridningsplatser för att göra rådata, redskap och forskningsresultat tillgängliga och användbara. Data, redskap och gemensamma användningsmöjligheter bildar tillsammans en forskningsinfrastruktur, som gör det möjligt att verifiera tidigare resultat och effektivare göra nya rön, när alla inte behöver starta från noll med att samla data och bygga analysredskap.Non peer reviewe

    Finite-state Relations Between Two Historically Closely Related Languages

    Get PDF
    Regular correspondences between historically related languages can be modelled using finite-state transducers (FST). A new method is presented by demonstrating it with a bidirectional experiment between Finnish and Estonian. An artificial representation (resembling a proto-language) is established between two related languages. This representation, AFE (Aligned Finnish-Estonian) is based on the letter by letter alignment of the two languages and uses mechanically constructed morphophonemes which represent the corresponding characters. By describing the constraints of this AFE using two-level rules, one may construct useful mappings between the languages. In this way, the badly ambiguous FSTs from Finnish and Estonian to AFE can be composed into a practically unambiguous transducer from Finnish to Estonian. The inverse mapping from Estonian to Finnish is mildly ambiguous. Steps according to the proposed method could be repeated as such with dialectal or older written texts. Choosing a set of model words, aligning them, recording the mechanical correspondences and designing rules for the constraints could be done with a limited effort. For the purposes of indexing and searching, the mild ambiguity may be tolerable as such. The ambiguity can be further reduced by composing the resulting FST with a speller or morphological analyser of the standard language.Peer reviewe
    corecore