2,696 research outputs found
Automated Morphological Segmentation and Evaluation
In this paper we introduce (i) a new method for morphological segmentation of part of speech labelled German words and (ii) some measures related to the MDL principle for evaluation of morphological segmentations. The segmentation algorithm is capable to discover hierarchical structure and to retrieve new morphemes. It achieved 75 % recall and 99 % precision. Regarding MDL based evaluation, a linear combination of vocabulary size and size of reduced deterministic finite state automata matching exactly the segmentation output turned out to be an appropriate measure to rank segmentation models according to their quality
Morphology-Syntax interface for Turkish LFG
This paper investigates the use of sublexical units as a solution to handling the complex morphology with productive derivational processes, in the development of a lexical functional grammar for Turkish. Such sublexical units make it possible to expose the internal structure of words with multiple derivations to the grammar rules in a uniform manner. This in turn leads to more succinct and manageable rules. Further, the semantics of the derivations can also be systematically reflected in a compositional way by constructing PRED values on the fly. We illustrate how we use sublexical units for handling simple productive derivational morphology and more interesting cases such as causativization, etc., which change verb valency. Our priority is to handle several linguistic phenomena in order to observe the effects of our approach on both the c-structure and the f-structure representation, and grammar writing, leaving the coverage and evaluation issues aside for the moment
Linguistically Motivated Vocabulary Reduction for Neural Machine Translation from Turkish to English
The necessity of using a fixed-size word vocabulary in order to control the
model complexity in state-of-the-art neural machine translation (NMT) systems
is an important bottleneck on performance, especially for morphologically rich
languages. Conventional methods that aim to overcome this problem by using
sub-word or character-level representations solely rely on statistics and
disregard the linguistic properties of words, which leads to interruptions in
the word structure and causes semantic and syntactic losses. In this paper, we
propose a new vocabulary reduction method for NMT, which can reduce the
vocabulary of a given input corpus at any rate while also considering the
morphological properties of the language. Our method is based on unsupervised
morphology learning and can be, in principle, used for pre-processing any
language pair. We also present an alternative word segmentation method based on
supervised morphological analysis, which aids us in measuring the accuracy of
our model. We evaluate our method in Turkish-to-English NMT task where the
input language is morphologically rich and agglutinative. We analyze different
representation methods in terms of translation accuracy as well as the semantic
and syntactic properties of the generated output. Our method obtains a
significant improvement of 2.3 BLEU points over the conventional vocabulary
reduction technique, showing that it can provide better accuracy in open
vocabulary translation of morphologically rich languages.Comment: The 20th Annual Conference of the European Association for Machine
Translation (EAMT), Research Paper, 12 page
Morphological Analysis as Classification: an Inductive-Learning Approach
Morphological analysis is an important subtask in text-to-speech conversion,
hyphenation, and other language engineering tasks. The traditional approach to
performing morphological analysis is to combine a morpheme lexicon, sets of
(linguistic) rules, and heuristics to find a most probable analysis. In
contrast we present an inductive learning approach in which morphological
analysis is reformulated as a segmentation task. We report on a number of
experiments in which five inductive learning algorithms are applied to three
variations of the task of morphological analysis. Results show (i) that the
generalisation performance of the algorithms is good, and (ii) that the lazy
learning algorithm IB1-IG performs best on all three tasks. We conclude that
lazy learning of morphological analysis as a classification task is indeed a
viable approach; moreover, it has the strong advantages over the traditional
approach of avoiding the knowledge-acquisition bottleneck, being fast and
deterministic in learning and processing, and being language-independent.Comment: 11 pages, 5 encapsulated postscript figures, uses non-standard NeMLaP
proceedings style nemlap.sty; inputs ipamacs (international phonetic
alphabet) and epsf macro
A Uniform Approach to Analogies, Synonyms, Antonyms, and Associations
Recognizing analogies, synonyms, antonyms, and associations appear to be four\ud
distinct tasks, requiring distinct NLP algorithms. In the past, the four\ud
tasks have been treated independently, using a wide variety of algorithms.\ud
These four semantic classes, however, are a tiny sample of the full\ud
range of semantic phenomena, and we cannot afford to create ad hoc algorithms\ud
for each semantic phenomenon; we need to seek a unified approach.\ud
We propose to subsume a broad range of phenomena under analogies.\ud
To limit the scope of this paper, we restrict our attention to the subsumption\ud
of synonyms, antonyms, and associations. We introduce a supervised corpus-based\ud
machine learning algorithm for classifying analogous word pairs, and we\ud
show that it can solve multiple-choice SAT analogy questions, TOEFL\ud
synonym questions, ESL synonym-antonym questions, and similar-associated-both\ud
questions from cognitive psychology
Word Classes in Indonesian: A Linguistic Reality or a Convenient Fallacy in Natural Language Processing?
This paper looks at Indonesian (Bahasa Indonesia), and the claim that there is no noun-verb distinction within the language as it is spoken in regions such as Riau and Jakarta. We test this claim for the language as it is written by a variety of Indonesian speakers using empirical methods traditionally used in part-of-speech induction. In this study we use only morphological patterns that we generate from a pre-existing morphological analyser. We find that once the distribution of the data points in our experiments match the distribution of the text from which we gather our data, we obtain significant results that show a distinction between the class of nouns and the class of verbs in Indonesian. Furthermore it shows promise that the labelling of word classes may be achieved only with morphological features, which could be applied to out-of-vocabulary items
- …