3 research outputs found
Morphological annotation of Korean with Directly Maintainable Resources
This article describes an exclusively resource-based method of morphological
annotation of written Korean text. Korean is an agglutinative language. Our
annotator is designed to process text before the operation of a syntactic
parser. In its present state, it annotates one-stem words only. The output is a
graph of morphemes annotated with accurate linguistic information. The
granularity of the tagset is 3 to 5 times higher than usual tagsets. A
comparison with a reference annotated corpus showed that it achieves 89% recall
without any corpus training. The language resources used by the system are
lexicons of stems, transducers of suffixes and transducers of generation of
allomorphs. All can be easily updated, which allows users to control the
evolution of the performances of the system. It has been claimed that
morphological annotation of Korean text could only be performed by a
morphological analysis module accessing a lexicon of morphemes. We show that it
can also be performed directly with a lexicon of words and without applying
morphological rules at annotation time, which speeds up annotation to 1,210
word/s. The lexicon of words is obtained from the maintainable language
resources through a fully automated compilation process
Machine Aided Error-Correction Environment for Korean Morphological Analysis and Part-of-Speech Tagging
Statistical methods require very large corpus with high quality. But building large and fault-less annotated corpus is a very difficult job. This paper proposes an efficient method to con-struct part-of-speech tagged corpus. A rule-based error correction method is proposed to find and correct errors semi-automatically by user-defined rules. We also make use of user's correction log to reflect feedback. Experiments were carried out to show the efficiency of error correction process of this workbench. The re-sult shows that about 63.2 % of tagging errors can be corrected.
Robust Part of Speech Tagging
Generally, NLP tools use well-formed and annotated data to learn patterns by using
machine learning techniques. However, in this work we will focus on the language
used in an on-line platform for machine translation. In this area it is usual to have a
framework such the following: a web-page which offer a service of translation between
pairs of languages. The problem is that the casual users utilize the service to translate
any type of text (cut and paste, single words, bad formatting, snipets, informal
language, pre-traductions, etc.). Hence, in this situation we will find very often words
with mistakes that make the system provides a bad translation because it is not able
to understand the input.The main goal of our work is, once we have identified the problem of dealing with
non-standard-input is to develop a robust PoS tagger from the SVMTagger