704 research outputs found
MORSE: Semantic-ally Drive-n MORpheme SEgment-er
We present in this paper a novel framework for morpheme segmentation which
uses the morpho-syntactic regularities preserved by word representations, in
addition to orthographic features, to segment words into morphemes. This
framework is the first to consider vocabulary-wide syntactico-semantic
information for this task. We also analyze the deficiencies of available
benchmarking datasets and introduce our own dataset that was created on the
basis of compositionality. We validate our algorithm across datasets and
present state-of-the-art results
A Trie-Structured Bayesian Model for Unsupervised Morphological Segmentation
In this paper, we introduce a trie-structured Bayesian model for unsupervised
morphological segmentation. We adopt prior information from different sources
in the model. We use neural word embeddings to discover words that are
morphologically derived from each other and thereby that are semantically
similar. We use letter successor variety counts obtained from tries that are
built by neural word embeddings. Our results show that using different
information sources such as neural word embeddings and letter successor variety
as prior information improves morphological segmentation in a Bayesian model.
Our model outperforms other unsupervised morphological segmentation models on
Turkish and gives promising results on English and German for scarce resources.Comment: 12 pages, accepted and presented at the CICLING 2017 - 18th
International Conference on Intelligent Text Processing and Computational
Linguistic
Building Morphological Chains for Agglutinative Languages
In this paper, we build morphological chains for agglutinative languages by
using a log-linear model for the morphological segmentation task. The model is
based on the unsupervised morphological segmentation system called
MorphoChains. We extend MorphoChains log linear model by expanding the
candidate space recursively to cover more split points for agglutinative
languages such as Turkish, whereas in the original model candidates are
generated by considering only binary segmentation of each word. The results
show that we improve the state-of-art Turkish scores by 12% having a F-measure
of 72% and we improve the English scores by 3% having a F-measure of 74%.
Eventually, the system outperforms both MorphoChains and other well-known
unsupervised morphological segmentation systems. The results indicate that
candidate generation plays an important role in such an unsupervised log-linear
model that is learned using contrastive estimation with negative samples.Comment: 10 pages, accepted and presented at the CICLing 2017 (18th
International Conference on Intelligent Text Processing and Computational
Linguistics
Unsupervised morpheme segmentation in a non-parametric Bayesian framework
Learning morphemes from any plain text is an emerging research area in the natural language processing. Knowledge about the process of word formation is helpful in devising automatic segmentation of words into their constituent morphemes. This thesis applies unsupervised morpheme induction method, based on the statistical behavior of words, to induce morphemes for word segmentation. The morpheme cache for the purpose is based on the Dirichlet Process (DP) and stores frequency information of the induced morphemes and their occurrences in a Zipfian distribution.
This thesis uses a number of empirical, morpheme-level grammar models to classify the induced morphemes under the labels prefix, stem and suffix. These grammar models capture the different structural relationships among the morphemes. Furthermore, the morphemic categorization reduces the problems of over segmentation. The output of the strategy demonstrates a significant improvement on the baseline system.
Finally, the thesis measures the performance of the unsupervised morphology learning system for Nepali
Automated Morphological Segmentation and Evaluation
In this paper we introduce (i) a new method for morphological segmentation of part of speech labelled German words and (ii) some measures related to the MDL principle for evaluation of morphological segmentations. The segmentation algorithm is capable to discover hierarchical structure and to retrieve new morphemes. It achieved 75 % recall and 99 % precision. Regarding MDL based evaluation, a linear combination of vocabulary size and size of reduced deterministic finite state automata matching exactly the segmentation output turned out to be an appropriate measure to rank segmentation models according to their quality
- …