915 research outputs found
An automatic part-of-speech tagger for Middle Low German
Syntactically annotated corpora are highly important for enabling large-scale diachronic and diatopic language research. Such corpora have recently been developed for a variety of historical languages, or are still under development. One of those under development is the fully tagged and parsed Corpus of Historical Low German (CHLG), which is aimed at facilitating research into the highly under-researched diachronic syntax of Low German. The present paper reports on a crucial step in creating the corpus, viz. the creation of a part-of-speech tagger for Middle Low German (MLG). Having been transmitted in several non-standardised written varieties, MLG poses a challenge to standard POS taggers, which usually rely on normalized spelling. We outline the major issues faced in the creation of the tagger and present our solutions to them
Morphological Disambiguation by Voting Constraints
We present a constraint-based morphological disambiguation system in which
individual constraints vote on matching morphological parses, and
disambiguation of all the tokens in a sentence is performed at the end by
selecting parses that receive the highest votes. This constraint application
paradigm makes the outcome of the disambiguation independent of the rule
sequence, and hence relieves the rule developer from worrying about potentially
conflicting rule sequencing. Our results for disambiguating Turkish indicate
that using about 500 constraint rules and some additional simple statistics, we
can attain a recall of 95-96% and a precision of 94-95% with about 1.01 parses
per token. Our system is implemented in Prolog and we are currently
investigating an efficient implementation based on finite state transducers.Comment: 8 pages, Latex source. To appear in Proceedings of ACL/EACL'97
Compressed postscript also available as
ftp://ftp.cs.bilkent.edu.tr/pub/ko/acl97.ps.
Dependency parsing of Turkish
The suitability of different parsing methods for different languages is an important topic in
syntactic parsing. Especially lesser-studied languages, typologically different from the languages
for which methods have originally been developed, poses interesting challenges in this respect.
This article presents an investigation of data-driven dependency parsing of Turkish, an agglutinative
free constituent order language that can be seen as the representative of a wider class
of languages of similar type. Our investigations show that morphological structure plays an
essential role in finding syntactic relations in such a language. In particular, we show that
employing sublexical representations called inflectional groups, rather than word forms, as the
basic parsing units improves parsing accuracy. We compare two different parsing methods, one
based on a probabilistic model with beam search, the other based on discriminative classifiers and
a deterministic parsing strategy, and show that the usefulness of sublexical units holds regardless
of parsing method.We examine the impact of morphological and lexical information in detail and
show that, properly used, this kind of information can improve parsing accuracy substantially.
Applying the techniques presented in this article, we achieve the highest reported accuracy for
parsing the Turkish Treebank
Improving the quality of Gujarati-Hindi Machine Translation through part-of-speech tagging and stemmer-assisted transliteration
Machine Translation for Indian languages is an emerging research area. Transliteration is one such module that we design while designing a translation system. Transliteration means mapping of source language text into the target language. Simple mapping decreases the efficiency of overall translation system. We propose the use of stemming and part-of-speech tagging for transliteration. The effectiveness of translation can be improved if we use part-of-speech tagging and stemming assisted transliteration.We have shown that much of the content in Gujarati gets transliterated while being processed for translation to Hindi language
Morphological annotation of Korean with Directly Maintainable Resources
This article describes an exclusively resource-based method of morphological
annotation of written Korean text. Korean is an agglutinative language. Our
annotator is designed to process text before the operation of a syntactic
parser. In its present state, it annotates one-stem words only. The output is a
graph of morphemes annotated with accurate linguistic information. The
granularity of the tagset is 3 to 5 times higher than usual tagsets. A
comparison with a reference annotated corpus showed that it achieves 89% recall
without any corpus training. The language resources used by the system are
lexicons of stems, transducers of suffixes and transducers of generation of
allomorphs. All can be easily updated, which allows users to control the
evolution of the performances of the system. It has been claimed that
morphological annotation of Korean text could only be performed by a
morphological analysis module accessing a lexicon of morphemes. We show that it
can also be performed directly with a lexicon of words and without applying
morphological rules at annotation time, which speeds up annotation to 1,210
word/s. The lexicon of words is obtained from the maintainable language
resources through a fully automated compilation process
A free/open-source hybrid morphological disambiguation tool for Kazakh
This paper presents the results of developing a
morphological disambiguation tool for Kazakh. Starting with a
previously developed rule-based approach, we tried to cope with
the complex morphology of Kazakh by breaking up lexical forms
across their derivational boundaries into inflectional groups
and modeling their behavior with statistical methods. A hybrid
rule-based/statistical approach appears to benefit morphological
disambiguation demonstrating a per-token accuracy of 91% in
running text
Recommended from our members
Minimally supervised induction of morphology through bitexts
textA knowledge of morphology can be useful for many natural language processing systems. Thus, much effort has been expended in developing accurate computational tools for morphology that lemmatize, segment and generate new forms. The most powerful and accurate of these have been manually encoded, such endeavors being without exception expensive and time-consuming. There have been consequently many attempts to reduce this cost in the development of morphological systems through the development of unsupervised or minimally supervised algorithms and learning methods for acquisition of morphology. These efforts have yet to produce a tool that approaches the performance of manually encoded systems.
Here, I present a strategy for dealing with morphological clustering and segmentation in a minimally supervised manner but one that will be more linguistically informed than previous unsupervised approaches. That is, this study will attempt to induce clusters of words from an unannotated text that are inflectional variants of each other. Then a set of inflectional suffixes by part-of-speech will be induced from these clusters. This level of detail is made possible by a method known as alignment and transfer (AT), among other names, an approach that uses aligned bitexts to transfer linguistic resources developed for one language–the source language–to another language–the target. This approach has a further advantage in that it allows a reduction in the amount of training data without a significant degradation in performance making it useful in applications targeted at data collected from endangered languages. In the current study, however, I use English as the source and German as the target for ease of evaluation and for certain typlogical properties of German. The two main tasks, that of clustering and segmentation, are approached as sequential tasks with the clustering informing the segmentation to allow for greater accuracy in morphological analysis.
While the performance of these methods does not exceed the current roster of unsupervised or minimally supervised approaches to morphology acquisition, it attempts to integrate more learning methods than previous studies. Furthermore, it attempts to learn inflectional morphology as opposed to derivational morphology, which is a crucial distinction in linguistics.Linguistic
- …