35 research outputs found
A free/open-source hybrid morphological disambiguation tool for Kazakh
This paper presents the results of developing a
morphological disambiguation tool for Kazakh. Starting with a
previously developed rule-based approach, we tried to cope with
the complex morphology of Kazakh by breaking up lexical forms
across their derivational boundaries into inflectional groups
and modeling their behavior with statistical methods. A hybrid
rule-based/statistical approach appears to benefit morphological
disambiguation demonstrating a per-token accuracy of 91% in
running text
A free/open-source hybrid morphological disambiguation tool for Kazakh
This paper presents the results of developing a
morphological disambiguation tool for Kazakh. Starting with a
previously developed rule-based approach, we tried to cope with
the complex morphology of Kazakh by breaking up lexical forms
across their derivational boundaries into inflectional groups
and modeling their behavior with statistical methods. A hybrid
rule-based/statistical approach appears to benefit morphological
disambiguation demonstrating a per-token accuracy of 91% in
running text
A morphological-syntactical analysis approach for Arabic textual tagging
Part-of-Speech (POS) tagging is the process of labeling or classifying each word in
written text with its grammatical category or part-of-speech, i.e. noun, verb, preposition,
adjective, etc. It is the most common disambiguation process in the field of
Natural Language Processing (NLP). POS tagging systems are often preprocessors in
many NLP applications.
The Arabic language has a valuable and an important feature, called diacritics, which
are marks placed over and below the letters of the word. An Arabic text is partiallyvocalisedl
when the diacritical mark is assigned to one or maximum two letters in the
word.
Diacritics in Arabic texts are extremely important especially at the end of the word.
They help determining not only the correct POS tag for each word in the sentence,
but also in providing full information regarding the inflectional features, such as tense,
number, gender, etc. for the sentence words. They add semantic information to words
which helps with resolving ambiguity in the meaning of words. Furthermore, diacritics
ascribe grammatical functions to the words, differentiating the word from other words,
and determining the syntactic position of the word in the sentence.
1. Vocalisation (also referred as diacritisation or vowelisation).
This thesis presents a rule-based Part-of-Speech tagging system called AMT - short
for Arabic Morphosyntactic Tagger. The main function of the AMT system is to assign
the correct tag to each word in an untagged raw partially-vocalised Arabic corpus,
and to produce a POS tagged corpus without using a manually tagged or untagged
lexicon (dictionary) for training. Two different techniques were used in this work, the
pattem-based technique and the lexical and contextual technique.
The rules in the pattem-based technique technique are based on the pattern of the
testing word. A novel algorithm, Pattern-Matching Algorithm (PMA), has been designed
and introduced in this work. The aim of this algorithm is to match the testing
word with its correct pattern in pattern lexicon.
The lexical and contextual technique on the other hand is used to assist the pattembased
technique technique to assign the correct tag to those words not have a pattern to
follow. The rules in the lexical and contextual technique are based on the character(s),
the last diacritical mark, the word itself, and the tags of the surrounding words.
The importance of utilizing the diacritic feature of the Arabic language to reduce the
lexical ambiguity in POS tagging has been addressed. In addition, a new Arabic tag
set and a new partially-vocalised Arabic corpus to test AMT have been compiled and
presented in this work. The AMT system has achieved an average accuracy of 91 %
Using Multiple Sources of Information for Constraint-Based Morphological Disambiguation
This thesis presents a constraint-based morphological disambiguation approach
that is applicable to languages with complex morphology--specifically
agglutinative languages with productive inflectional and derivational
morphological phenomena. For morphologically complex languages like Turkish,
automatic morphological disambiguation involves selecting for each token
morphological parse(s), with the right set of inflectional and derivational
markers. Our system combines corpus independent hand-crafted constraint rules,
constraint rules that are learned via unsupervised learning from a training
corpus, and additional statistical information obtained from the corpus to be
morphologically disambiguated. The hand-crafted rules are linguistically
motivated and tuned to improve precision without sacrificing recall. In certain
respects, our approach has been motivated by Brill's recent work, but with the
observation that his transformational approach is not directly applicable to
languages like Turkish. Our approach also uses a novel approach to unknown word
processing by employing a secondary morphological processor which recovers any
relevant inflectional and derivational information from a lexical item whose
root is unknown. With this approach, well below 1% of the tokens remains as
unknown in the texts we have experimented with. Our results indicate that by
combining these hand-crafted, statistical and learned information sources, we
can attain a recall of 96 to 97% with a corresponding precision of 93 to 94%,
and ambiguity of 1.02 to 1.03 parses per token.Comment: M.Sc. Thesis submitted to the Department of Computer Engineering and
Information Science, Bilkent University, Ankara, Turkey. Also available as:
ftp://ftp.cs.bilkent.edu.tr/pub/tech-reports/1996/BU-CEIS-9615ps.
Statistical and Computational Models for Whole Word Morphology
Das Ziel dieser Arbeit ist die Formulierung eines Ansatzes zum maschinellen Lernen von Sprachmorphologie, in dem letztere als Zeichenkettentransformationen auf ganzen Wörtern, und nicht als Zerlegung von Wörtern in kleinere stukturelle Einheiten, modelliert wird. Der Beitrag besteht aus zwei wesentlichen Teilen: zum einen wird ein Rechenmodell formuliert, in dem morphologische Regeln als Funktionen auf Zeichenketten definiert sind. Solche Funktionen lassen sich leicht zu endlichen Transduktoren übersetzen, was eine solide algorithmische Grundlage für den Ansatz liefert. Zum anderen wird ein statistisches Modell für Graphen von Wortab\-leitungen eingeführt. Die Inferenz in diesem Modell erfolgt mithilfe des Monte Carlo Expectation Maximization-Algorithmus und die Erwartungswerte über Graphen werden durch einen Metropolis-Hastings-Sampler approximiert. Das Modell wird auf einer Reihe von praktischen Aufgaben evaluiert: Clustering flektierter Formen, Lernen von Lemmatisierung, Vorhersage von Wortart für unbekannte Wörter, sowie Generierung neuer Wörter
Unsupervised multilingual learning
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 241-254).For centuries, scholars have explored the deep links among human languages. In this thesis, we present a class of probabilistic models that exploit these links as a form of naturally occurring supervision. These models allow us to substantially improve performance for core text processing tasks, such as morphological segmentation, part-of-speech tagging, and syntactic parsing. Besides these traditional NLP tasks, we also present a multilingual model for lost language deciphersment. We test this model on the ancient Ugaritic language. Our results show that we can automatically uncover much of the historical relationship between Ugaritic and Biblical Hebrew, a known related language.by Benjamin Snyder.Ph.D
Proceedings
Proceedings of the Ninth International Workshop
on Treebanks and Linguistic Theories.
Editors: Markus Dickinson, Kaili Müürisep and Marco Passarotti.
NEALT Proceedings Series, Vol. 9 (2010), 268 pages.
© 2010 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/15891
Developing Methods and Resources for Automated Processing of the African Language Igbo
Natural Language Processing (NLP) research is still in its infancy in Africa. Most of
languages in Africa have few or zero NLP resources available, of which Igbo is among those
at zero state. In this study, we develop NLP resources to support NLP-based research in
the Igbo language. The springboard is the development of a new part-of-speech (POS)
tagset for Igbo (IgbTS) based on a slight adaptation of the EAGLES guideline as a result
of language internal features not recognized in EAGLES. The tagset consists of three
granularities: fine-grain (85 tags), medium-grain (70 tags) and coarse-grain (15 tags). The
medium-grained tagset is to strike a balance between the other two grains for practical
purpose. Following this is the preprocessing of Igbo electronic texts through normalization
and tokenization processes. The tokenizer is developed in this study using the tagset
definition of a word token and the outcome is an Igbo corpus (IgbC) of about one million
tokens.
This IgbTS was applied to a part of the IgbC to produce the first Igbo tagged corpus
(IgbTC). To investigate the effectiveness, validity and reproducibility of the IgbTS, an
inter-annotation agreement (IAA) exercise was undertaken, which led to the revision of the
IgbTS where necessary. A novel automatic method was developed to bootstrap a manual
annotation process through exploitation of the by-products of this IAA exercise, to improve
IgbTC. To further improve the quality of the IgbTC, a committee of taggers approach
was adopted to propose erroneous instances on IgbTC for correction. A novel automatic
method that uses knowledge of affixes to flag and correct all morphologically-inflected
words in the IgbTC whose tags violate their status as not being morphologically-inflected
was also developed and used.
Experiments towards the development of an automatic POS tagging system for Igbo
using IgbTC show good accuracy scores comparable to other languages that these taggers
have been tested on, such as English. Accuracy on the words previously unseen during
the taggers’ training (also called unknown words) is considerably low, and much lower
on the unknown words that are morphologically-complex, which indicates difficulty in
handling morphologically-complex words in Igbo. This was improved by adopting a
morphological reconstruction method (a linguistically-informed segmentation into stems
and affixes) that reformatted these morphologically-complex words into patterns learnable
by machines. This enables taggers to use the knowledge of stems and associated affixes
of these morphologically-complex words during the tagging process to predict their
appropriate tags. Interestingly, this method outperforms other methods that existing
taggers use in handling unknown words, and achieves an impressive increase for the
accuracy of the morphologically-inflected unknown words and overall unknown words.
These developments are the first NLP toolkit for the Igbo language and a step towards
achieving the objective of Basic Language Resources Kits (BLARK) for the language. This
IgboNLP toolkit will be made available for the NLP community and should encourage
further research and development for the language