41 research outputs found

    Damerau Levenshtein Distance for Indonesian Spelling Correction

    Get PDF
    Word correction used to find an incorrect word in writing. Levenshtein distance is one of algorithm to correcting typing error. It is an algorithm that calculates a difference between two strings. The operations that used to the calculation are insert, delete, and substitution. However, this algorithm has a disadvantage that it cannot overcome two switched letters in the same word. The algorithm that can solve those issues is a damerau Levenshtein. This research aims to analyse a damerau Levenshtein algorithm that used to correcting Indonesian spelling. The dataset in this research consists of two fairy tale stories with a total of 1266 words and 100 typing errors. From these two algorithms, the accuracy is up to 73% on Levenshtein distance and 75% on damerau Levenshtein

    Proceedings of the Eindhoven FASTAR Days 2004 : Eindhoven, The Netherlands, September 3-4, 2004

    Get PDF
    The Eindhoven FASTAR Days (EFD) 2004 were organized by the Software Construction group of the Department of Mathematics and Computer Science at the Technische Universiteit Eindhoven. On September 3rd and 4th 2004, over thirty participants|hailing from the Czech Republic, Finland, France, The Netherlands, Poland and South Africa|gathered at the Department to attend the EFD. The EFD were organized in connection with the research on finite automata by the FASTAR Research Group, which is centered in Eindhoven and at the University of Pretoria, South Africa. FASTAR (Finite Automata Systems|Theoretical and Applied Research) is an in- ternational research group that aims to lead in all areas related to finite state systems. The work in FASTAR includes both core and applied parts of this field. The EFD therefore focused on the field of finite automata, with an emphasis on practical aspects and applications. Eighteen presentations, mostly on subjects within this field, were given, by researchers as well as students from participating universities and industrial research facilities. This report contains the proceedings of the conference, in the form of papers for twelve of the presentations at the EFD. Most of them were initially reviewed and distributed as handouts during the EFD. After the EFD took place, the papers were revised for publication in these proceedings. We would like to thank the participants for their attendance and presentations, making the EFD 2004 as successful as they were. Based on this success, it is our intention to make the EFD into a recurring event. Eindhoven, December 2004 Loek Cleophas Bruce W. Watso

    Proceedings of the Eindhoven FASTAR Days 2004 : Eindhoven, The Netherlands, September 3-4, 2004

    Get PDF
    The Eindhoven FASTAR Days (EFD) 2004 were organized by the Software Construction group of the Department of Mathematics and Computer Science at the Technische Universiteit Eindhoven. On September 3rd and 4th 2004, over thirty participants|hailing from the Czech Republic, Finland, France, The Netherlands, Poland and South Africa|gathered at the Department to attend the EFD. The EFD were organized in connection with the research on finite automata by the FASTAR Research Group, which is centered in Eindhoven and at the University of Pretoria, South Africa. FASTAR (Finite Automata Systems|Theoretical and Applied Research) is an in- ternational research group that aims to lead in all areas related to finite state systems. The work in FASTAR includes both core and applied parts of this field. The EFD therefore focused on the field of finite automata, with an emphasis on practical aspects and applications. Eighteen presentations, mostly on subjects within this field, were given, by researchers as well as students from participating universities and industrial research facilities. This report contains the proceedings of the conference, in the form of papers for twelve of the presentations at the EFD. Most of them were initially reviewed and distributed as handouts during the EFD. After the EFD took place, the papers were revised for publication in these proceedings. We would like to thank the participants for their attendance and presentations, making the EFD 2004 as successful as they were. Based on this success, it is our intention to make the EFD into a recurring event. Eindhoven, December 2004 Loek Cleophas Bruce W. Watso

    Spell checkers and correctors : a unified treatment

    Get PDF
    The aim of this dissertation is to provide a unified treatment of various spell checkers and correctors. Firstly, the spell checking and correcting problems are formally described in mathematics in order to provide a better understanding of these tasks. An approach that is similar to the way in which denotational semantics used to describe programming languages is adopted. Secondly, the various attributes of existing spell checking and correcting techniques are discussed. Extensive studies on selected spell checking/correcting algorithms and packages are then performed. Lastly, an empirical investigation of various spell checking/correcting packages is presented. It provides a comparison and suggests a classification of these packages in terms of their functionalities, implementation strategies, and performance. The investigation was conducted on packages for spell checking and correcting in English as well as in Northern Sotho and Chinese. The classification provides a unified presentation of the strengths and weaknesses of the techniques studied in the research. The findings provide a better understanding of these techniques in order to assist in improving some existing spell checking/correcting applications and future spell checking/correcting package designs and implementations.Dissertation (MSc)--University of Pretoria, 2009.Computer Scienceunrestricte

    Äriprotsessimudelite ühildamine

    Get PDF
    Väitekirja elektrooniline versioon ei sisalda publikatsioone.Ettevõtted, kellel on aastatepikkune kogemus äriprotsesside haldamises, omavad sageli protsesside repositooriumeid, mis võivad endas sisaldada sadu või isegi tuhandeid äriprotsessimudeleid. Need mudelid pärinevad erinevatest allikatest ja need on loonud ning neid on muutnud erinevad osapooled, kellel on erinevad modelleerimise oskused ning praktikad. üheks sagedaseks praktikaks on uute mudelite loomine, kasutades olemasolevaid mudeleid, kopeerides neist fragmente ning neid seejärel muutes. See omakorda loob olukorra, kus protsessimudelite repositoorium sisaldab mudeleid, milles on identseid mudeli fragmente, mis viitavad samale alamprotsessile. Kui sellised fragmendid jätta konsolideerimata, siis võib see põhjustada repositooriumis ebakõlasid -- üks ja sama alamprotsess võib olla erinevates protsessides erinevalt kirjeldatud. Sageli on ettevõtetel mudelid, millel on sarnased eesmärgid, kuid mis on mõeldud erinevate klientide, toodete, äriüksuste või geograafiliste regioonide jaoks. Näiteks on äriprotsessid kodukindlustuse ja autokindlustuse jaoks sama ärilise eesmärgiga. Loomulikult sisaldavad nende protsesside mudelid mitmeid identseid alamfragmente (nagu näiteks poliisi andmete kontrollimine), samas on need protsessid mitmes punktis erinevad. Nende protsesside eraldi haldamine on ebaefektiivne ning tekitab liiasusi. Doktoritöös otsisime vastust küsimusele: kuidas identifitseerida protsessimudelite repositooriumis korduvaid mudelite fragmente, ning üldisemalt -- kuidas leida ning konsolideerida sarnasusi suurtes äriprotsessimudelite repositooriumites? Doktoritöös on sisse toodud kaks üksteist täiendavat meetodit äriprotsessimudelite konsolideerimiseks, täpsemalt protsessimudelite ühildamine üheks mudeliks ning mudelifragmentide ekstraktimine. Esimene neist võtab sisendiks kaks või enam protsessimudelit ning konstrueerib neist ühe konsolideeritud protsessimudeli, mis sisaldab kõikide sisendmudelite käitumist. Selline lähenemine võimaldab analüütikutel hallata korraga tervet perekonda sarnaseid mudeleid ning neid muuta sünkroniseeritud viisil. Teine lähenemine, alamprotsesside ekstraktimine, sisaldab endas sagedasti esinevate fragmentide identifitseerimist (protsessimudelites kloonide leidmist) ning nende kapseldamist alamprotsessideks

    Dealing with spelling variation in Early Modern English texts

    Get PDF
    Early English Books Online contains facsimiles of virtually every English work printed between 1473 and 1700; some 125,000 publications. In September 2009, the Text Creation Partnership released the second instalment of transcriptions of the EEBO collection, bringing the total number of transcribed works to 25,000. It has been estimated that this transcribed portion contains 1 billion words of running text. With such large datasets and the increasing variety of historical corpora available from the Early Modern English period, the opportunities for historial corpus linguistic research have never been greater. However, it has been observed in prior research, and quantified on a large-scale for the first time in this thesis, that texts from this period contain significant amounts of spelling variation until the eventual standardisation of orthography in the 18th century. The problems caused by this historical spelling variation are the focus of this thesis. It will be shown that the high levels of spelling variation found have a significant impact on the accuracy of two widely used automatic corpus linguistic methods - Part-of-Speech annotation and key word analysis. The development of historical spelling normalisation methods which can alleviate these issues will then be presented. Methods will be based on techniques used in modern spellchecking, with various analyses of Early Modern English spelling variation dictating how the techniques are applied. With the methods combined into a single procedure, automatic normalisation can be performed on an entire corpus of any size. Evaluation of the normalisation performance shows that after training, 62% of required normalisations are made, with a precision rate of 95%

    Representation and Processing of Composition, Variation and Approximation in Language Resources and Tools

    Get PDF
    In my habilitation dissertation, meant to validate my capacity of and maturity for directingresearch activities, I present a panorama of several topics in computational linguistics, linguisticsand computer science.Over the past decade, I was notably concerned with the phenomena of compositionalityand variability of linguistic objects. I illustrate the advantages of a compositional approachto the language in the domain of emotion detection and I explain how some linguistic objects,most prominently multi-word expressions, defy the compositionality principles. I demonstratethat the complex properties of MWEs, notably variability, are partially regular and partiallyidiosyncratic. This fact places the MWEs on the frontiers between different levels of linguisticprocessing, such as lexicon and syntax.I show the highly heterogeneous nature of MWEs by citing their two existing taxonomies.After an extensive state-of-the art study of MWE description and processing, I summarizeMultiflex, a formalism and a tool for lexical high-quality morphosyntactic description of MWUs.It uses a graph-based approach in which the inflection of a MWU is expressed in function ofthe morphology of its components, and of morphosyntactic transformation patterns. Due tounification the inflection paradigms are represented compactly. Orthographic, inflectional andsyntactic variants are treated within the same framework. The proposal is multilingual: it hasbeen tested on six European languages of three different origins (Germanic, Romance and Slavic),I believe that many others can also be successfully covered. Multiflex proves interoperable. Itadapts to different morphological language models, token boundary definitions, and underlyingmodules for the morphology of single words. It has been applied to the creation and enrichmentof linguistic resources, as well as to morphosyntactic analysis and generation. It can be integratedinto other NLP applications requiring the conflation of different surface realizations of the sameconcept.Another chapter of my activity concerns named entities, most of which are particular types ofMWEs. Their rich semantic load turned them into a hot topic in the NLP community, which isdocumented in my state-of-the art survey. I present the main assumptions, processes and resultsissued from large annotation tasks at two levels (for named entities and for coreference), parts ofthe National Corpus of Polish construction. I have also contributed to the development of bothrule-based and probabilistic named entity recognition tools, and to an automated enrichment ofProlexbase, a large multilingual database of proper names, from open sources.With respect to multi-word expressions, named entities and coreference mentions, I pay aspecial attention to nested structures. This problem sheds new light on the treatment of complexlinguistic units in NLP. When these units start being modeled as trees (or, more generally, asacyclic graphs) rather than as flat sequences of tokens, long-distance dependencies, discontinu-ities, overlapping and other frequent linguistic properties become easier to represent. This callsfor more complex processing methods which control larger contexts than what usually happensin sequential processing. Thus, both named entity recognition and coreference resolution comesvery close to parsing, and named entities or mentions with their nested structures are analogous3to multi-word expressions with embedded complements.My parallel activity concerns finite-state methods for natural language and XML processing.My main contribution in this field, co-authored with 2 colleagues, is the first full-fledged methodfor tree-to-language correction, and more precisely for correcting XML documents with respectto a DTD. We have also produced interesting results in incremental finite-state algorithmics,particularly relevant to data evolution contexts such as dynamic vocabularies or user updates.Multilingualism is the leitmotif of my research. I have applied my methods to several naturallanguages, most importantly to Polish, Serbian, English and French. I have been among theinitiators of a highly multilingual European scientific network dedicated to parsing and multi-word expressions. I have used multilingual linguistic data in experimental studies. I believethat it is particularly worthwhile to design NLP solutions taking declension-rich (e.g. Slavic)languages into account, since this leads to more universal solutions, at least as far as nominalconstructions (MWUs, NEs, mentions) are concerned. For instance, when Multiflex had beendeveloped with Polish in mind it could be applied as such to French, English, Serbian and Greek.Also, a French-Serbian collaboration led to substantial modifications in morphological modelingin Prolexbase in its early development stages. This allowed for its later application to Polishwith very few adaptations of the existing model. Other researchers also stress the advantages ofNLP studies on highly inflected languages since their morphology encodes much more syntacticinformation than is the case e.g. in English.In this dissertation I am also supposed to demonstrate my ability of playing an active rolein shaping the scientific landscape, on a local, national and international scale. I describemy: (i) various scientific collaborations and supervision activities, (ii) roles in over 10 regional,national and international projects, (iii) responsibilities in collective bodies such as program andorganizing committees of conferences and workshops, PhD juries, and the National UniversityCouncil (CNU), (iv) activity as an evaluator and a reviewer of European collaborative projects.The issues addressed in this dissertation open interesting scientific perspectives, in whicha special impact is put on links among various domains and communities. These perspectivesinclude: (i) integrating fine-grained language data into the linked open data, (ii) deep parsingof multi-word expressions, (iii) modeling multi-word expression identification in a treebank as atree-to-language correction problem, and (iv) a taxonomy and an experimental benchmark fortree-to-language correction approaches

    Incorporating translation quality-oriented features into log-linear models of machine translation

    Get PDF
    The current state-of-the-art approach to Machine Translation (MT) has limitations which could be alleviated by the use of syntax-based models. Although the benefits of syntax use in MT are becoming clear with the ongoing improvements in string-to-tree and tree-to-string systems, tree-to-tree systems such as Data Oriented Translation (DOT) have, until recently, suffered from lack of training resources, and as a consequence are currently immature, lacking key features compared to Phrase-Based Statistical MT (PB-SMT) systems. In this thesis we propose avenues to bridge the gap between our syntax-based DOT model and state-of-the-art PB-SMT systems. Noting that both types of systems score translations using probabilities not necessarily related to the quality of the translations they produce, we introduce a training mechanism which takes translation quality into account by averaging the edit distance between a translation unit and translation units used in oracle translations. This training mechanism could in principle be adapted to a very broad class of MT systems. In particular, we show how when translating Spanish sentences into English, it leads to improvements in the translation quality of both PB-SMT and DOT. In addition, we show how our method leads to a PB-SMT system which uses significantly less resources and translates significantly faster than the original, while maintaining the improvements in translation quality. We then address the issue of the limited feature set in DOT by defining a new DOT model which is able to exploit features of the complete source sentence. We introduce a feature into this new model which conditions each target word to the source-context it is associated with, and we also make the first attempt at incorporating a language model (LM) to a DOT system. We investigate different estimation methods for our lexical feature (namely Maximum Entropy and improved Kneser-Ney), reporting on their empirical performance. After describing methods which enable us to improve the efficiency of our system, and which allows us to scale to larger training data sizes, we evaluate the performance of our new model on English-to-Spanish translation, obtaining significant translation quality improvements compared to the original DOT system
    corecore