129 research outputs found

    CORLEONE - Core Linguistic Entity Online Extraction

    Get PDF
    This report presents CORLEONE (Core Linguistic Entity Online Extraction) - a pool of loosely coupled general-purpose basic lightweight linguistic processing resources, which can be independently used to identify core linguistic entities and their features in free texts. Currently, CORLEONE consists of five processing resources: (a) a basic tokenizer, (b) a tokenizer which performs fine-grained token classification, (c) a component for performing morphological analysis, and (d) a memory-efficient database-like dictionary look-up component, and (e) sentence splitter. Linguistic resources for several languages are provided. Additionally, CORLEONE includes a comprehensive library of string distance metrics relevant for the task of name variant matching. CORLEONE has been developed in the Java programming language and heavily deploys state-of-the-art finite-state techniques. Noteworthy, CORLEONE components are used as basic linguistic processing resources in ExPRESS, a pattern matching engine based on regular expressions over feature structures and in the real-time news event extraction system, which were developed by the Web Mining and Intelligence Group of the Support to External Security Unit of IPSC. This report constitutes an end-user guide for COLREONE and provides scientifically interesting details of how it was implemented.JRC.G.2-Support to external securit

    Painolliset äärellistilaiset menetelmät oikaisulukuun

    Get PDF
    This dissertation is a large-scale study of spell-checking and correction using finite-state technology. Finite-state spell-checking is a key method for handling morphologically complex languages in a computationally efficient manner. This dissertation discusses the technological and practical considerations that are required for finite-state spell-checkers to be at the same level as state-of-the-art non-finite-state spell-checkers. Three aspects of spell-checking are considered in the thesis: modelling of correctly written words and word-forms with finite-state language models, applying statistical information to finite-state language models with a specific focus on morphologically complex languages, and modelling misspellings and typing errors using finite-state automata-based error models. The usability of finite-state spell-checkers as a viable alternative to traditional non-finite-state solutions is demonstrated in a large-scale evaluation of spell-checking speed and the quality using languages with morphologically different natures. The selected languages display a full range of typological complexity, from isolating English to polysynthetic Greenlandic with agglutinative Finnish and the Saami languages somewhere in between.Tässä väitöskirjassa tutkin äärellistilaisten menetelmien käyttöä oikaisuluvussa. Äärellistilaiset menetelmät mahdollistavat sananmuodostukseltaan monimutkaisempien kielten, kuten suomen tai grönlannin, sanaston sujuvan käsittelyn oikaisulukusovelluksissa. Käsittelen tutkielmassani tieteellisiä ja käytännöllisiä toteutuksia, jotka ovat tarpeen, jotta tällaisia sananmuodostukseltaan monimutkallisempia kieliä voisi käsitellä oikaisuluvussa yhtä tehokkaasti kuin yksinkertaisempia kieliä, kuten englantia tai muita indo-eurooppalaisia kieliä nyt käsitellään. Tutkielmassa esitellään kolme keskeistä tutkimusongelmaa, jotka koskevat oikaisuluvun toteuttamista sanarakenteeltaan monimutkaisemmille kielille: miten mallintaa oikeinkirjoitetut sanamuodot äärellistilaisin mallein, miten soveltaa tilastollista mallinnusta monimutkaisiin sanarakenteisiin kuten yhdyssanoihin, ja miten mallintaa kirjoitusvirheitä äärellistilaisin mentelmin. Tutkielman tuloksena esitän äärellistilaisia oikaisulukumenetelmiä soveltuvana vaihtoehtona nykyisille oikaisulukimille, tämän todisteena esitän mittaustuloksia, jotka näyttävät, että käyttämäni menetelmät toimivat niin rakenteellisesti yksinkertaisille kielille kuten englannille yhtä hyvin kuin nykyiset menetelmät että rakenteellisesti monimutkaisemmille kielille kuten suomelle, saamelle ja jopa grönlannille riittävän hyvin tullakseen käytetyksi tyypillisissä oikaisulukimissa

    Regional versus global finite-state error repair

    Get PDF
    [Abstract] We focus on the domain of a regional least-cost strategy in order to illustrate the viability of non-global repair models over finitestate architectures. Our interest is justified by the difficulty, shared by all repair proposals, to determine how far to validate. A short validation may fail to gather sufficient information, and in a long one most of the effort can be wasted. The goal is to prove that our approach can provide, in practice, a performance and quality comparable to that attained by global criteria, with a significant saving in time and space. To the best of our knowledge, this is the first discussion of its kind.Ministerio de Educación y Ciencia; TIN2004-07246-C03-02Ministerio de Educación y Ciencia; HP2002-0081Xunta de Galcia; PGIDIT03SIN30501PRXunta de Galcia; PGIDIT02SIN01

    Modelling of a Gazetteer Look-up Component

    Get PDF
    Abstract This paper compares two storage models for gazetteers, nameley the standard one based on numbered indexing automata associated with an auxiliary storage device against a pure finite-state model, the latter being superior in terms of space and time complexity.

    Analysing the Efficiency of Algorithms for Compiling Finite-State Morphologies

    Get PDF
    Äärellistilaiset morfologiat ovat tietokoneohjelmia, jotka mallintavat kielen sanojen rakennetta (morfologiaa) merkkijonopareja sisältävillä tietorakenteilla (äärellistilaisilla transduktoreilla). Äärellistilaisia morfologioita voidaan käyttää esimerkiksi hakuohjelmissa, jotka löytävät tekstistä kaikki annetun perusmuotoisen sanan esiintymät eri taivutusmuodoissaan. Äärellistilaiset morfologiat ovat myös hyödyllisiä, kun tekstistä tehdään tilastoja siitä kuinka usein kukin sana esiintyy ja missä taivutusmuodoissa. Äärellistilaisten morfologioiden rakentaminen on monimutkainen prosessi, johon kuuluu useita tehtäviä, joista yksi on transduktorin minimointi. Yleisiä minimointialgoritmeja ovat Brzozowskin (BRZ) ja Hopcroftin algoritmit (HOP). Kirjallisuudessa esiintyy väitteitä, joiden mukaan BRZ:n ja HOP:n välinen ero on merkityksettömän pieni morfologioita käännettäessä. Kuitenkaan BRZ:n suorituskykyä ei ole järjestelmällisesti testattu tai verrattu HOP:iin missään tutkimuksessa. Tässä diplomityössä käännettiin HFST-ohjelmistolla kaksi avoimen lähdekoodin morfologiaa, suomelle kirjoitettu OMorFi ja saksalle kirjoitettu Morphisto. HFST perustuu kahteen avoimen lähdekoodin transduktoriohjelmistopakettiin, SFST:hen ja OpenFst:hen, joista edellinen käyttää BRZ:ia ja jälkimmäinen HOP:ia minimointialgoritmina. BRZ osoittautui paljon hitaammaksi kuin HOP sekä suomen että saksan morfologioilla. BRZ:n hitaus oli ilmeistä transduktoreissa, jotka sisälsivät suuren mittakaavan syklisyyttä eli niissä oli siirtymiä, jotka johtivat lopputilojen läheisyydestä alkutilan läheisyyteen. Tällaisia transduktoreita esiintyy usein morfologioissa, joissa on yhdyssanamekanismi. Jos HOP:n ja BRZ:n välillä on valittava, edellinen on parempi vaihtoehto minimointi-algoritmiksi. BRZ on joskus nopeampi kuin HOP, mutta siinä tapauksessa algoritmien ero on melko pieni. Niissä tapauksissa joissa BRZ on hitaampi kuin HOP, ero on huomattavasti suurempi: BRZ on joskus jopa 50 kertaa hitaampi kuin HOP. BRZ on kuitenkin paljon helpompi toteuttaa, koska se perustuu kahteen perusoperaatioon, determinisointiin ja reversioon. Jos HOP:n toteuttaminen on liian vaativa tehtävä, avoimen lähdekoodin transduktorikirjaston kehittäjät voivat käyttää OpenFst:n minimointialgoritmia. Transduktorit voidaan muuntaa OpenFst:n muotoon, minimoida OpenFst:llä ja muuntaa takaisin alkuperäiseen muotoon. Tätä ratkaisua on tarkoitus käyttää myös HFST:n tulevissa versioissa.Finite-state morphologies (FSMs) are computer programs that model the structure of words in a language (morphology) with networks containing a number of string pairs (finite-state transducers). FSMs can be used e.g. to implement search programs that can find all forms of a word in a document if they are given only the base form. FSMs are also useful in compiling statistics on a text, i.e. finding out how often a word occurs and in which forms. Constructing FSMs is a complex process involving many tasks, one of which is transducer minimisation. Common minimisation algorithms include Brzozowski's (BRZ) and Hopcroft's algorithm (HOP). There have been claims in the literature that often the difference between BRZ and HOP is insignificant when compiling FSMs. However, no studies have been carried out where the performance of BRZ would have been systematically tested or compared with HOP. In this thesis, we compiled two open-source morphologies, OMorFi for Finnish and Morphisto for German, with the HFST software. HFST is based on two open-source transducer software packages, SFST and OpenFst, the former using BRZ and the latter HOP as a minimisation algorithm. BRZ turned out to be much slower than HOP both on Finnish and German morphologies. The slowness of BRZ was evident in transducers that contained large-scale cyclicity, i.e. had transitions leading from the nearness of the final states to the nearness of initial states. These kinds of transducers often occur in morphologies that have a compounding mechanism. If a choice must be made between HOP and BRZ, the previous is a better choice for a minimisation algorithm. BRZ is sometimes faster than HOP, but in that case their difference is quite small. In the cases where BRZ is slower than HOP, their difference is much bigger, BRZ sometimes being 50 times slower than HOP. Of course, BRZ is much easier to implement since it uses two basic operations, determinisation and reversion. If the implementation of HOP is considered too demanding a task, the developers of free-source transducer libraries can use OpenFst's minimisation algorithm. The transducers can be converted to OpenFst format, minimised with OpenFst and converted back to the original format. This solution will also be used in future versions of HFST

    Proceedings of the Eindhoven FASTAR Days 2004 : Eindhoven, The Netherlands, September 3-4, 2004

    Get PDF
    The Eindhoven FASTAR Days (EFD) 2004 were organized by the Software Construction group of the Department of Mathematics and Computer Science at the Technische Universiteit Eindhoven. On September 3rd and 4th 2004, over thirty participants|hailing from the Czech Republic, Finland, France, The Netherlands, Poland and South Africa|gathered at the Department to attend the EFD. The EFD were organized in connection with the research on finite automata by the FASTAR Research Group, which is centered in Eindhoven and at the University of Pretoria, South Africa. FASTAR (Finite Automata Systems|Theoretical and Applied Research) is an in- ternational research group that aims to lead in all areas related to finite state systems. The work in FASTAR includes both core and applied parts of this field. The EFD therefore focused on the field of finite automata, with an emphasis on practical aspects and applications. Eighteen presentations, mostly on subjects within this field, were given, by researchers as well as students from participating universities and industrial research facilities. This report contains the proceedings of the conference, in the form of papers for twelve of the presentations at the EFD. Most of them were initially reviewed and distributed as handouts during the EFD. After the EFD took place, the papers were revised for publication in these proceedings. We would like to thank the participants for their attendance and presentations, making the EFD 2004 as successful as they were. Based on this success, it is our intention to make the EFD into a recurring event. Eindhoven, December 2004 Loek Cleophas Bruce W. Watso

    Proceedings of the Eindhoven FASTAR Days 2004 : Eindhoven, The Netherlands, September 3-4, 2004

    Get PDF
    The Eindhoven FASTAR Days (EFD) 2004 were organized by the Software Construction group of the Department of Mathematics and Computer Science at the Technische Universiteit Eindhoven. On September 3rd and 4th 2004, over thirty participants|hailing from the Czech Republic, Finland, France, The Netherlands, Poland and South Africa|gathered at the Department to attend the EFD. The EFD were organized in connection with the research on finite automata by the FASTAR Research Group, which is centered in Eindhoven and at the University of Pretoria, South Africa. FASTAR (Finite Automata Systems|Theoretical and Applied Research) is an in- ternational research group that aims to lead in all areas related to finite state systems. The work in FASTAR includes both core and applied parts of this field. The EFD therefore focused on the field of finite automata, with an emphasis on practical aspects and applications. Eighteen presentations, mostly on subjects within this field, were given, by researchers as well as students from participating universities and industrial research facilities. This report contains the proceedings of the conference, in the form of papers for twelve of the presentations at the EFD. Most of them were initially reviewed and distributed as handouts during the EFD. After the EFD took place, the papers were revised for publication in these proceedings. We would like to thank the participants for their attendance and presentations, making the EFD 2004 as successful as they were. Based on this success, it is our intention to make the EFD into a recurring event. Eindhoven, December 2004 Loek Cleophas Bruce W. Watso
    corecore