24 research outputs found
Tagging and morphological disambiguation of Turkish text
Ankara : Department of Computer Engineering and Information Science and Institute of Engineering and Science, Bilkent University, 1994.Thesis (Master's) -- -Bilkent University, 1994.Includes bibliographical refences.A part-of-speech (POS) tagger is a system that uses various sources of information to assign possibly unique POS to words. Automatic text tagging is an important component in higher level analysis of text corpora. Its output can also be used in many natural language processing applications. In languages like Turkish or Finnish, with agglutinative morphology, morphological disambiguation is a very crucial process in tagging as the structures of many lexical forms are morphologically ambiguous. This thesis present a POS tagger for Turkish text based on a full-scale two-level specification of Turkish morphology. The tagger is augmented with a multi-word and idiomatic construct recognizer, and most importantly morphological disambiguator based on local lexical neighborhood constraints, heuristics and limited amount of statistical information. The tagger also has additional functionality for statistics compilation and fine tuning of the morphological analyzer, such as logging erroneous morphological parses, commonly used roots, etc. Test results indicate that the tagger can tag about 97/% to 99% of the texts accurately with very minimal user intervention. Furthermore for sentences morphologically disambiguated with the tagger, an LFG parser developed for Turkish, on the average, generates 50% less ambiguous parses almost 2.5 times faster.Kuruöz, İlkerM.S
Mimicking Word Embeddings using Subword RNNs
Word embeddings improve generalization over lexical features by placing each
word in a lower-dimensional space, using distributional information obtained
from unlabeled data. However, the effectiveness of word embeddings for
downstream NLP tasks is limited by out-of-vocabulary (OOV) words, for which
embeddings do not exist. In this paper, we present MIMICK, an approach to
generating OOV word embeddings compositionally, by learning a function from
spellings to distributional embeddings. Unlike prior work, MIMICK does not
require re-training on the original word embedding corpus; instead, learning is
performed at the type level. Intrinsic and extrinsic evaluations demonstrate
the power of this simple approach. On 23 languages, MIMICK improves performance
over a word-based baseline for tagging part-of-speech and morphosyntactic
attributes. It is competitive with (and complementary to) a supervised
character-based model in low-resource settings.Comment: EMNLP 201
Statistical Morphological Disambiguation for Kazakh Language
This paper presents the results of developing a statistical model for morphological
disambiguation of Kazakh text. Starting with basic assumptions we tried
to cope with the complex morphology of Kazakh language by breaking up lexical
forms across their derivational boundaries into inflectional groups and modeling
their behavior with statistical methods. We also provide maximum likelihood estimates
for the parameters and an effective way to perform disambiguation with
the Viterbi algorithm
Error-tolerant Finite State Recognition with Applications to Morphological Analysis and Spelling Correction
Error-tolerant recognition enables the recognition of strings that deviate
mildly from any string in the regular set recognized by the underlying finite
state recognizer. Such recognition has applications in error-tolerant
morphological processing, spelling correction, and approximate string matching
in information retrieval. After a description of the concepts and algorithms
involved, we give examples from two applications: In the context of
morphological analysis, error-tolerant recognition allows misspelled input word
forms to be corrected, and morphologically analyzed concurrently. We present an
application of this to error-tolerant analysis of agglutinative morphology of
Turkish words. The algorithm can be applied to morphological analysis of any
language whose morphology is fully captured by a single (and possibly very
large) finite state transducer, regardless of the word formation processes and
morphographemic phenomena involved. In the context of spelling correction,
error-tolerant recognition can be used to enumerate correct candidate forms
from a given misspelled string within a certain edit distance. Again, it can be
applied to any language with a word list comprising all inflected forms, or
whose morphology is fully described by a finite state transducer. We present
experimental results for spelling correction for a number of languages. These
results indicate that such recognition works very efficiently for candidate
generation in spelling correction for many European languages such as English,
Dutch, French, German, Italian (and others) with very large word lists of root
and inflected forms (some containing well over 200,000 forms), generating all
candidate solutions within 10 to 45 milliseconds (with edit distance 1) on a
SparcStation 10/41. For spelling correction in Turkish, error-tolerantComment: Replaces 9504031. gzipped, uuencoded postscript file. To appear in
Computational Linguistics Volume 22 No:1, 1996, Also available as
ftp://ftp.cs.bilkent.edu.tr/pub/ko/clpaper9512.ps.
A free/open-source hybrid morphological disambiguation tool for Kazakh
This paper presents the results of developing a
morphological disambiguation tool for Kazakh. Starting with a
previously developed rule-based approach, we tried to cope with
the complex morphology of Kazakh by breaking up lexical forms
across their derivational boundaries into inflectional groups
and modeling their behavior with statistical methods. A hybrid
rule-based/statistical approach appears to benefit morphological
disambiguation demonstrating a per-token accuracy of 91% in
running text