2,696 research outputs found

    Automated Morphological Segmentation and Evaluation

    Get PDF
    In this paper we introduce (i) a new method for morphological segmentation of part of speech labelled German words and (ii) some measures related to the MDL principle for evaluation of morphological segmentations. The segmentation algorithm is capable to discover hierarchical structure and to retrieve new morphemes. It achieved 75 % recall and 99 % precision. Regarding MDL based evaluation, a linear combination of vocabulary size and size of reduced deterministic finite state automata matching exactly the segmentation output turned out to be an appropriate measure to rank segmentation models according to their quality

    Morphology-Syntax interface for Turkish LFG

    Get PDF
    This paper investigates the use of sublexical units as a solution to handling the complex morphology with productive derivational processes, in the development of a lexical functional grammar for Turkish. Such sublexical units make it possible to expose the internal structure of words with multiple derivations to the grammar rules in a uniform manner. This in turn leads to more succinct and manageable rules. Further, the semantics of the derivations can also be systematically reflected in a compositional way by constructing PRED values on the fly. We illustrate how we use sublexical units for handling simple productive derivational morphology and more interesting cases such as causativization, etc., which change verb valency. Our priority is to handle several linguistic phenomena in order to observe the effects of our approach on both the c-structure and the f-structure representation, and grammar writing, leaving the coverage and evaluation issues aside for the moment

    Linguistically Motivated Vocabulary Reduction for Neural Machine Translation from Turkish to English

    Get PDF
    The necessity of using a fixed-size word vocabulary in order to control the model complexity in state-of-the-art neural machine translation (NMT) systems is an important bottleneck on performance, especially for morphologically rich languages. Conventional methods that aim to overcome this problem by using sub-word or character-level representations solely rely on statistics and disregard the linguistic properties of words, which leads to interruptions in the word structure and causes semantic and syntactic losses. In this paper, we propose a new vocabulary reduction method for NMT, which can reduce the vocabulary of a given input corpus at any rate while also considering the morphological properties of the language. Our method is based on unsupervised morphology learning and can be, in principle, used for pre-processing any language pair. We also present an alternative word segmentation method based on supervised morphological analysis, which aids us in measuring the accuracy of our model. We evaluate our method in Turkish-to-English NMT task where the input language is morphologically rich and agglutinative. We analyze different representation methods in terms of translation accuracy as well as the semantic and syntactic properties of the generated output. Our method obtains a significant improvement of 2.3 BLEU points over the conventional vocabulary reduction technique, showing that it can provide better accuracy in open vocabulary translation of morphologically rich languages.Comment: The 20th Annual Conference of the European Association for Machine Translation (EAMT), Research Paper, 12 page

    Morphological Analysis as Classification: an Inductive-Learning Approach

    Full text link
    Morphological analysis is an important subtask in text-to-speech conversion, hyphenation, and other language engineering tasks. The traditional approach to performing morphological analysis is to combine a morpheme lexicon, sets of (linguistic) rules, and heuristics to find a most probable analysis. In contrast we present an inductive learning approach in which morphological analysis is reformulated as a segmentation task. We report on a number of experiments in which five inductive learning algorithms are applied to three variations of the task of morphological analysis. Results show (i) that the generalisation performance of the algorithms is good, and (ii) that the lazy learning algorithm IB1-IG performs best on all three tasks. We conclude that lazy learning of morphological analysis as a classification task is indeed a viable approach; moreover, it has the strong advantages over the traditional approach of avoiding the knowledge-acquisition bottleneck, being fast and deterministic in learning and processing, and being language-independent.Comment: 11 pages, 5 encapsulated postscript figures, uses non-standard NeMLaP proceedings style nemlap.sty; inputs ipamacs (international phonetic alphabet) and epsf macro

    A Uniform Approach to Analogies, Synonyms, Antonyms, and Associations

    Get PDF
    Recognizing analogies, synonyms, antonyms, and associations appear to be four\ud distinct tasks, requiring distinct NLP algorithms. In the past, the four\ud tasks have been treated independently, using a wide variety of algorithms.\ud These four semantic classes, however, are a tiny sample of the full\ud range of semantic phenomena, and we cannot afford to create ad hoc algorithms\ud for each semantic phenomenon; we need to seek a unified approach.\ud We propose to subsume a broad range of phenomena under analogies.\ud To limit the scope of this paper, we restrict our attention to the subsumption\ud of synonyms, antonyms, and associations. We introduce a supervised corpus-based\ud machine learning algorithm for classifying analogous word pairs, and we\ud show that it can solve multiple-choice SAT analogy questions, TOEFL\ud synonym questions, ESL synonym-antonym questions, and similar-associated-both\ud questions from cognitive psychology

    Word classes in Indonesian: A linguistic reality or a convenient fallacy in natural language processing?

    Get PDF

    Word Classes in Indonesian: A Linguistic Reality or a Convenient Fallacy in Natural Language Processing?

    Get PDF
    This paper looks at Indonesian (Bahasa Indonesia), and the claim that there is no noun-verb distinction within the language as it is spoken in regions such as Riau and Jakarta. We test this claim for the language as it is written by a variety of Indonesian speakers using empirical methods traditionally used in part-of-speech induction. In this study we use only morphological patterns that we generate from a pre-existing morphological analyser. We find that once the distribution of the data points in our experiments match the distribution of the text from which we gather our data, we obtain significant results that show a distinction between the class of nouns and the class of verbs in Indonesian. Furthermore it shows promise that the labelling of word classes may be achieved only with morphological features, which could be applied to out-of-vocabulary items
    corecore