24 research outputs found

    Tagging and morphological disambiguation of Turkish text

    Get PDF
    Ankara : Department of Computer Engineering and Information Science and Institute of Engineering and Science, Bilkent University, 1994.Thesis (Master's) -- -Bilkent University, 1994.Includes bibliographical refences.A part-of-speech (POS) tagger is a system that uses various sources of information to assign possibly unique POS to words. Automatic text tagging is an important component in higher level analysis of text corpora. Its output can also be used in many natural language processing applications. In languages like Turkish or Finnish, with agglutinative morphology, morphological disambiguation is a very crucial process in tagging as the structures of many lexical forms are morphologically ambiguous. This thesis present a POS tagger for Turkish text based on a full-scale two-level specification of Turkish morphology. The tagger is augmented with a multi-word and idiomatic construct recognizer, and most importantly morphological disambiguator based on local lexical neighborhood constraints, heuristics and limited amount of statistical information. The tagger also has additional functionality for statistics compilation and fine tuning of the morphological analyzer, such as logging erroneous morphological parses, commonly used roots, etc. Test results indicate that the tagger can tag about 97/% to 99% of the texts accurately with very minimal user intervention. Furthermore for sentences morphologically disambiguated with the tagger, an LFG parser developed for Turkish, on the average, generates 50% less ambiguous parses almost 2.5 times faster.Kuruöz, İlkerM.S

    Mimicking Word Embeddings using Subword RNNs

    Full text link
    Word embeddings improve generalization over lexical features by placing each word in a lower-dimensional space, using distributional information obtained from unlabeled data. However, the effectiveness of word embeddings for downstream NLP tasks is limited by out-of-vocabulary (OOV) words, for which embeddings do not exist. In this paper, we present MIMICK, an approach to generating OOV word embeddings compositionally, by learning a function from spellings to distributional embeddings. Unlike prior work, MIMICK does not require re-training on the original word embedding corpus; instead, learning is performed at the type level. Intrinsic and extrinsic evaluations demonstrate the power of this simple approach. On 23 languages, MIMICK improves performance over a word-based baseline for tagging part-of-speech and morphosyntactic attributes. It is competitive with (and complementary to) a supervised character-based model in low-resource settings.Comment: EMNLP 201

    Statistical Morphological Disambiguation for Kazakh Language

    Get PDF
    This paper presents the results of developing a statistical model for morphological disambiguation of Kazakh text. Starting with basic assumptions we tried to cope with the complex morphology of Kazakh language by breaking up lexical forms across their derivational boundaries into inflectional groups and modeling their behavior with statistical methods. We also provide maximum likelihood estimates for the parameters and an effective way to perform disambiguation with the Viterbi algorithm

    Error-tolerant Finite State Recognition with Applications to Morphological Analysis and Spelling Correction

    Get PDF
    Error-tolerant recognition enables the recognition of strings that deviate mildly from any string in the regular set recognized by the underlying finite state recognizer. Such recognition has applications in error-tolerant morphological processing, spelling correction, and approximate string matching in information retrieval. After a description of the concepts and algorithms involved, we give examples from two applications: In the context of morphological analysis, error-tolerant recognition allows misspelled input word forms to be corrected, and morphologically analyzed concurrently. We present an application of this to error-tolerant analysis of agglutinative morphology of Turkish words. The algorithm can be applied to morphological analysis of any language whose morphology is fully captured by a single (and possibly very large) finite state transducer, regardless of the word formation processes and morphographemic phenomena involved. In the context of spelling correction, error-tolerant recognition can be used to enumerate correct candidate forms from a given misspelled string within a certain edit distance. Again, it can be applied to any language with a word list comprising all inflected forms, or whose morphology is fully described by a finite state transducer. We present experimental results for spelling correction for a number of languages. These results indicate that such recognition works very efficiently for candidate generation in spelling correction for many European languages such as English, Dutch, French, German, Italian (and others) with very large word lists of root and inflected forms (some containing well over 200,000 forms), generating all candidate solutions within 10 to 45 milliseconds (with edit distance 1) on a SparcStation 10/41. For spelling correction in Turkish, error-tolerantComment: Replaces 9504031. gzipped, uuencoded postscript file. To appear in Computational Linguistics Volume 22 No:1, 1996, Also available as ftp://ftp.cs.bilkent.edu.tr/pub/ko/clpaper9512.ps.

    A free/open-source hybrid morphological disambiguation tool for Kazakh

    Get PDF
    This paper presents the results of developing a morphological disambiguation tool for Kazakh. Starting with a previously developed rule-based approach, we tried to cope with the complex morphology of Kazakh by breaking up lexical forms across their derivational boundaries into inflectional groups and modeling their behavior with statistical methods. A hybrid rule-based/statistical approach appears to benefit morphological disambiguation demonstrating a per-token accuracy of 91% in running text
    corecore