9 research outputs found

    SQL extensions for domain agnostic data representational consistency

    Get PDF

    Painolliset äärellistilaiset menetelmät oikaisulukuun

    Get PDF
    This dissertation is a large-scale study of spell-checking and correction using finite-state technology. Finite-state spell-checking is a key method for handling morphologically complex languages in a computationally efficient manner. This dissertation discusses the technological and practical considerations that are required for finite-state spell-checkers to be at the same level as state-of-the-art non-finite-state spell-checkers. Three aspects of spell-checking are considered in the thesis: modelling of correctly written words and word-forms with finite-state language models, applying statistical information to finite-state language models with a specific focus on morphologically complex languages, and modelling misspellings and typing errors using finite-state automata-based error models. The usability of finite-state spell-checkers as a viable alternative to traditional non-finite-state solutions is demonstrated in a large-scale evaluation of spell-checking speed and the quality using languages with morphologically different natures. The selected languages display a full range of typological complexity, from isolating English to polysynthetic Greenlandic with agglutinative Finnish and the Saami languages somewhere in between.Tässä väitöskirjassa tutkin äärellistilaisten menetelmien käyttöä oikaisuluvussa. Äärellistilaiset menetelmät mahdollistavat sananmuodostukseltaan monimutkaisempien kielten, kuten suomen tai grönlannin, sanaston sujuvan käsittelyn oikaisulukusovelluksissa. Käsittelen tutkielmassani tieteellisiä ja käytännöllisiä toteutuksia, jotka ovat tarpeen, jotta tällaisia sananmuodostukseltaan monimutkallisempia kieliä voisi käsitellä oikaisuluvussa yhtä tehokkaasti kuin yksinkertaisempia kieliä, kuten englantia tai muita indo-eurooppalaisia kieliä nyt käsitellään. Tutkielmassa esitellään kolme keskeistä tutkimusongelmaa, jotka koskevat oikaisuluvun toteuttamista sanarakenteeltaan monimutkaisemmille kielille: miten mallintaa oikeinkirjoitetut sanamuodot äärellistilaisin mallein, miten soveltaa tilastollista mallinnusta monimutkaisiin sanarakenteisiin kuten yhdyssanoihin, ja miten mallintaa kirjoitusvirheitä äärellistilaisin mentelmin. Tutkielman tuloksena esitän äärellistilaisia oikaisulukumenetelmiä soveltuvana vaihtoehtona nykyisille oikaisulukimille, tämän todisteena esitän mittaustuloksia, jotka näyttävät, että käyttämäni menetelmät toimivat niin rakenteellisesti yksinkertaisille kielille kuten englannille yhtä hyvin kuin nykyiset menetelmät että rakenteellisesti monimutkaisemmille kielille kuten suomelle, saamelle ja jopa grönlannille riittävän hyvin tullakseen käytetyksi tyypillisissä oikaisulukimissa

    Translation Alignment and Extraction Within a Lexica-Centered Iterative Workflow

    Get PDF
    This thesis addresses two closely related problems. The first, translation alignment, consists of identifying bilingual document pairs that are translations of each other within multilingual document collections (document alignment); identifying sentences, titles, etc, that are translations of each other within bilingual document pairs (sentence alignment); and identifying corresponding word and phrase translations within bilingual sentence pairs (phrase alignment). The second is extraction of bilingual pairs of equivalent word and multi-word expressions, which we call translation equivalents (TEs), from sentence- and phrase-aligned parallel corpora. While these same problems have been investigated by other authors, their focus has been on fully unsupervised methods based mostly or exclusively on parallel corpora. Bilingual lexica, which are basically lists of TEs, have not been considered or given enough importance as resources in the treatment of these problems. Human validation of TEs, which consists of manually classifying TEs as correct or incorrect translations, has also not been considered in the context of alignment and extraction. Validation strengthens the importance of infrequent TEs (most of the entries of a validated lexicon) that otherwise would be statistically unimportant. The main goal of this thesis is to revisit the alignment and extraction problems in the context of a lexica-centered iterative workflow that includes human validation. Therefore, the methods proposed in this thesis were designed to take advantage of knowledge accumulated in human-validated bilingual lexica and translation tables obtained by unsupervised methods. Phrase-level alignment is a stepping stone for several applications, including the extraction of new TEs, the creation of statistical machine translation systems, and the creation of bilingual concordances. Therefore, for phrase-level alignment, the higher accuracy of human-validated bilingual lexica is crucial for achieving higher quality results in these downstream applications. There are two main conceptual contributions. The first is the coverage maximization approach to alignment, which makes direct use of the information contained in a lexicon, or in translation tables when this is small or does not exist. The second is the introduction of translation patterns which combine novel and old ideas and enables precise and productive extraction of TEs. As material contributions, the alignment and extraction methods proposed in this thesis have produced source materials for three lines of research, in the context of three PhD theses (two of them already defended), all sharing with me the supervision of my advisor. The topics of these lines of research are statistical machine translation, algorithms and data structures for indexing and querying phrase-aligned parallel corpora, and bilingual lexica classification and generation. Four publications have resulted directly from the work presented in this thesis and twelve from the collaborative lines of research

    The Persistent Effects of Brief Interactions: Evidence from Immigrant Ships

    Get PDF
    This paper shows that brief social interactions can have a large impact on economic outcomes when they occur in high-stakes decision contexts. I study this question using a high frequency and detailed geolocalized dataset of matched immigrants-ships from the age of mass migration. Individuals exogenously travelling with (previously unrelated) higher quality shipmates end up being employed in higher quality jobs at destination. Several findings suggest that shipmates provide access and/or information about employment opportunities. Firstly, immigrants' sector of employment and place of residence are affected by those of their shipmates' contacts. Secondly, the baseline effects are stronger for individuals travelling alone and with fewer connections at destination. Thirdly, immigrants are affected more strongly by shipmates who share their language. These findings underline the sizeable effects of even brief social connections, provided that they occur during critical life junctures
    corecore