Search CORE

756 research outputs found

Building Morphological Chains for Agglutinative Languages

Author: B Can
H Ishwaran
J Goldsmith
J Hankamer
K Narasimhan
Publication venue
Publication date: 23/04/2017
Field of study

In this paper, we build morphological chains for agglutinative languages by using a log-linear model for the morphological segmentation task. The model is based on the unsupervised morphological segmentation system called MorphoChains. We extend MorphoChains log linear model by expanding the candidate space recursively to cover more split points for agglutinative languages such as Turkish, whereas in the original model candidates are generated by considering only binary segmentation of each word. The results show that we improve the state-of-art Turkish scores by 12% having a F-measure of 72% and we improve the English scores by 3% having a F-measure of 74%. Eventually, the system outperforms both MorphoChains and other well-known unsupervised morphological segmentation systems. The results indicate that candidate generation plays an important role in such an unsupervised log-linear model that is learned using contrastive estimation with negative samples.Comment: 10 pages, accepted and presented at the CICLing 2017 (18th International Conference on Intelligent Text Processing and Computational Linguistics

arXiv.org e-Print Archive

Crossref

OpenMETU (Middle East Technical University)

Paradigm Completion for Derivational Morphology

Author: Cotterell Ryan
Khayrallah Huda
Kirov Christo
Vylomova Ekaterina
Yarowsky David
Publication venue
Publication date: 01/01/2017
Field of study

The generation of complex derived word forms has been an overlooked problem in NLP; we fill this gap by applying neural sequence-to-sequence models to the task. We overview the theoretical motivation for a paradigmatic treatment of derivational morphology, and introduce the task of derivational paradigm completion as a parallel to inflectional paradigm completion. State-of-the-art neural models, adapted from the inflection task, are able to learn a range of derivation patterns, and outperform a non-neural baseline by 16.4%. However, due to semantic, historical, and lexical considerations involved in derivational morphology, future work will be needed to achieve performance parity with inflection-generating systems.Comment: EMNLP 201

arXiv.org e-Print Archive

Crossref

Tree Structured Dirichlet Processes for Hierarchical Morphological Segmentation

Author: Burcu Can
Can Burcu
Creutz Mathias
Dreyer Markus
Kurimo Mikko
Kurimo Mikko
Lignos Constantine
Mikolov Tomas
Nicolas Lionel
Snyder Benjamin
Suresh Manandhar
Teh Y. W.
Publication venue: 'MIT Press - Journals'
Publication date: 01/01/2018
Field of study

This article presents a probabilistic hierarchical clustering model for morphological segmentation In contrast to existing approaches to morphology learning, our method allows learning hierarchical organization of word morphology as a collection of tree structured paradigms. The model is fully unsupervised and based on the hierarchical Dirichlet process. Tree hierarchies are learned along with the corresponding morphological paradigms simultaneously. Our model is evaluated on Morpho Challenge and shows competitive performance when compared to state-of-the-art unsupervised morphological segmentation systems. Although we apply this model for morphological segmentation, the model itself can also be used for hierarchical clustering of other types of data

Hacettepe University Institutional Repository

Crossref

White Rose Research Online

Wolverhampton Intellectual Repository and E-theses

Optimization of the Morpher Morphology Engine Using Knowledge Base Reduction Techniques

Author: Kovács László
Szabó Gábor
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 30/12/2019
Field of study

Morpher is a novel morphological rule induction engine designed and developed for agglutinative languages. The Morpher engine models inflection using general string-based transformation rules and it can learn multiple arbitrary affix types, too. In order to scale the engine to training sets containing millions of examples, we need an efficient management of the generated rule base. In this paper we investigate and present several optimization techniques using rule elimination based on context length, support and cardinality parameters. The performed evaluation tests show that using the proposed optimization techniques, we can reduce the average inflection time to 0.52 %, the average lemmatization time to 2.59 % and the number of rules to 2.25 % of the original values, while retaining a high correctness ratio of 98 %. The optimized model can execute inflection and lemmatization in acceptable time after training millions of items, unlike other existing methods like Morfessor, MORSEL or MorphoChain

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

Incorporating word embeddings in unsupervised morphological segmentation

Author: Can Burcu
Ustun Ahmet
Publication venue
Publication date: 01/09/2021
Field of study

We investigate the usage of semantic information for morphological segmentation since words that are derived from each other will remain semantically related. We use mathematical models such as maximum likelihood estimate (MLE) and maximum a posteriori estimate (MAP) by incorporating semantic information obtained from dense word vector representations. Our approach does not require any annotated data which make it fully unsupervised and require only a small amount of raw data together with pretrained word embeddings for training purposes. The results show that using dense vector representations helps in morphological segmentation especially for low-resource languages. We present results for Turkish, English, and German. Our semantic MLE model outperforms other unsupervised models for Turkish language. Our proposed models could be also used for any other low-resource language with concatenative morphology.</p

Proceedings - University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

Modeling morpheme triplets with a three-level hierarchical Dirichlet process

Author: Can Burcu
Kumyol Serkan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 30/09/2016
Field of study

This is an accepted manuscript of an article published by IEEE in 2016 International Conference on Asian Language Processing (IALP) on 13/03/2017, available online: https://ieeexplore.ieee.org/document/7876007 The accepted version of the publication may differ from the final published version.Morphemes are not independent units and attached to each other based on morphotactics. However, they are assumed to be independent from each other to cope with the complexity in most of the models in the literature. We introduce a language independent model for unsupervised morphological segmentation using hierarchical Dirichlet process (HDP). We model the morpheme dependencies in terms of morpheme trigrams in each word. Trigrams, bigrams and unigrams are modeled within a three-level HDP, where the trigram Dirichlet process (DP) uses the bigram DP and bigram DP uses unigram DP as the base distribution. The results show that modeling morpheme dependencies improve the F-measure noticeably in English, Turkish and Finnish.Published versio

Wolverhampton Intellectual Repository and E-theses

OpenMETU (Middle East Technical University)