5 research outputs found
A Trie-Structured Bayesian Model for Unsupervised Morphological Segmentation
In this paper, we introduce a trie-structured Bayesian model for unsupervised
morphological segmentation. We adopt prior information from different sources
in the model. We use neural word embeddings to discover words that are
morphologically derived from each other and thereby that are semantically
similar. We use letter successor variety counts obtained from tries that are
built by neural word embeddings. Our results show that using different
information sources such as neural word embeddings and letter successor variety
as prior information improves morphological segmentation in a Bayesian model.
Our model outperforms other unsupervised morphological segmentation models on
Turkish and gives promising results on English and German for scarce resources.Comment: 12 pages, accepted and presented at the CICLING 2017 - 18th
International Conference on Intelligent Text Processing and Computational
Linguistic
Unsupervised learning of allomorphs in Turkish
© 2017 The Author. Published by The Scientific and Technological Research Council of Turkey. This is an open access article available under a Creative Commons licence.
The published version can be accessed at the following link on the publisher’s website: https://journals.tubitak.gov.tr/elektrik/issues/elk-17-25-4/elk-25-4-57-1605-216.pdfOne morpheme may have several surface forms that correspond to allomorphs. In English, ed and d are
surface forms of the past tense morpheme, and s, es, and ies are surface forms of the plural or present tense morpheme.
Turkish has a large number of allomorphs due to its morphophonemic processes. One morpheme can have tens of different
surface forms in Turkish. This leads to a sparsity problem in natural language processing tasks in Turkish. Detection
of allomorphs has not been studied much because of its difficulty. For example, t¨u and di are Turkish allomorphs (i.e.
past tense morpheme), but all of their letters are different. This paper presents an unsupervised model to extract the
allomorphs in Turkish. We are able to obtain an F-measure of 73.71% in the detection of allomorphs, and our model
outperforms previous unsupervised models on morpheme clustering.Published versio
Modeling morpheme triplets with a three-level hierarchical Dirichlet process
This is an accepted manuscript of an article published by IEEE in 2016 International Conference on Asian Language Processing (IALP) on 13/03/2017, available online: https://ieeexplore.ieee.org/document/7876007
The accepted version of the publication may differ from the final published version.Morphemes are not independent units and attached to each other based on morphotactics. However, they are assumed to be independent from each other to cope with the complexity in most of the models in the literature. We introduce a language independent model for unsupervised morphological segmentation using hierarchical Dirichlet process (HDP). We model the morpheme dependencies in terms of morpheme trigrams in each word. Trigrams, bigrams and unigrams are modeled within a three-level HDP, where the trigram Dirichlet process (DP) uses the bigram DP and bigram DP uses unigram DP as the base distribution. The results show that modeling morpheme dependencies improve the F-measure noticeably in English, Turkish and Finnish.Published versio
Methods and algorithms for unsupervised learning of morphology
This is an accepted manuscript of a chapter published by Springer in Computational Linguistics and Intelligent Text Processing. CICLing 2014. Lecture Notes in Computer Science, vol 8403 in 2014 available online: https://doi.org/10.1007/978-3-642-54906-9_15
The accepted version of the publication may differ from the final published version.This paper is a survey of methods and algorithms for unsupervised learning of morphology. We provide a description of the methods and algorithms used for morphological segmentation from a computational linguistics point of view. We survey morphological segmentation methods covering methods based on MDL (minimum description length), MLE (maximum likelihood estimation), MAP (maximum a posteriori), parametric and non-parametric Bayesian approaches. A review of the evaluation schemes for unsupervised morphological segmentation is also provided along with a summary of evaluation results on the Morpho Challenge evaluations.Published versio
Probabilistic Hierarchical Clustering Of Morphological Paradigms
We propose a novel method for learning morphological paradigms that are structured within a hierarchy. The hierarchical structuring of paradigms groups morphologically similar words close to each other in a tree structure. This allows detecting morphological similarities easily leading to improved morphological segmentation. Our evaluation using (Kurimo et al., 2011a; Kurimo et al., 2011b) dataset shows that our method performs competitively when compared with current state-of- art systems