53 research outputs found
A Case Study of Algorithms for Morphosyntactic Tagging of Polish Language
The paper presents an evaluation of several part-of-speech taggers, representing main tagging algorithms, applied to corpus of frequency dictionary of the contemporary Polish language. We report our results considering two tagging schemes: IPI PAN positional tagset and its simplified version. Tagging accuracy is calculated for different training sets and takes into account many subcategories (accuracy on known and unknown tokens, word segments, sentences etc.) The comparison of results with other inflecting and analytic languages is done. Performance aspects (time demands) of used tagging tools are also discussed
Learning morphology with Morfette
Morfette is a modular, data-driven, probabilistic system which learns to perform joint morphological tagging and lemmatization from morphologically annotated corpora. The system is composed of two learning modules which are trained to predict morphological tags and lemmas using the Maximum Entropy classifier. The third module dynamically combines the predictions of the Maximum-Entropy models and outputs a probability distribution over tag-lemma pair sequences. The lemmatization module exploits the idea of recasting lemmatization as a classification task by using class labels which encode mappings from wordforms to lemmas. Experimental evaluation results and error analysis on three morphologically rich languages show that the system achieves high accuracy with no language-specific
feature engineering or additional resources
A free/open-source hybrid morphological disambiguation tool for Kazakh
This paper presents the results of developing a
morphological disambiguation tool for Kazakh. Starting with a
previously developed rule-based approach, we tried to cope with
the complex morphology of Kazakh by breaking up lexical forms
across their derivational boundaries into inflectional groups
and modeling their behavior with statistical methods. A hybrid
rule-based/statistical approach appears to benefit morphological
disambiguation demonstrating a per-token accuracy of 91% in
running text
A free/open-source hybrid morphological disambiguation tool for Kazakh
This paper presents the results of developing a
morphological disambiguation tool for Kazakh. Starting with a
previously developed rule-based approach, we tried to cope with
the complex morphology of Kazakh by breaking up lexical forms
across their derivational boundaries into inflectional groups
and modeling their behavior with statistical methods. A hybrid
rule-based/statistical approach appears to benefit morphological
disambiguation demonstrating a per-token accuracy of 91% in
running text
Comparison of Latent Semantic Analysis and Probabilistic Latent Semantic Analysis for Documents Clustering
In this paper we compare usefulness of statistical techniques of dimensionality reduction for improving clustering of documents in Polish. We start with partitional and agglomerative algorithms applied to Vector Space Model. Then we investigate two transformations: Latent Semantic Analysis and Probabilistic Latent Semantic Analysis. The obtained results showed advantage of Latent Semantic Analysis technique over probabilistic model. We also analyse time and memory consumption aspects of these transformations and present runtime details for IBM BladeCenter HS21 machine
Voted Approach for Part of Speech Tagging in Bengali
PACLIC 23 / City University of Hong Kong / 3-5 December 200
Methods and algorithms for unsupervised learning of morphology
This is an accepted manuscript of a chapter published by Springer in Computational Linguistics and Intelligent Text Processing. CICLing 2014. Lecture Notes in Computer Science, vol 8403 in 2014 available online: https://doi.org/10.1007/978-3-642-54906-9_15
The accepted version of the publication may differ from the final published version.This paper is a survey of methods and algorithms for unsupervised learning of morphology. We provide a description of the methods and algorithms used for morphological segmentation from a computational linguistics point of view. We survey morphological segmentation methods covering methods based on MDL (minimum description length), MLE (maximum likelihood estimation), MAP (maximum a posteriori), parametric and non-parametric Bayesian approaches. A review of the evaluation schemes for unsupervised morphological segmentation is also provided along with a summary of evaluation results on the Morpho Challenge evaluations.Published versio
- …