269 research outputs found
A Case Study of Algorithms for Morphosyntactic Tagging of Polish Language
The paper presents an evaluation of several part-of-speech taggers, representing main tagging algorithms, applied to corpus of frequency dictionary of the contemporary Polish language. We report our results considering two tagging schemes: IPI PAN positional tagset and its simplified version. Tagging accuracy is calculated for different training sets and takes into account many subcategories (accuracy on known and unknown tokens, word segments, sentences etc.) The comparison of results with other inflecting and analytic languages is done. Performance aspects (time demands) of used tagging tools are also discussed
Benchmarking High Performance Architectures With Natural Language Processing Algorithms
Natural Language Processing algorithms are resource demanding, especially when tuning toinflective language like Polish is needed. The paper presents time and memory requirementsof part of speech tagging and clustering algorithms applied to two corpora of the Polishlanguage. The algorithms are benchmarked on three high performance platforms of differentarchitectures. Additionally sequential versions and OpenMP implementations of clusteringalgorithms were compared
Application of Weighted Voting Taggers to Languages Described with Large Tagsets
The paper presents baseline and complex part-of-speech taggers applied to the modified corpus of Frequency Dictionary of Contemporary Polish, annotated with a large tagset. First, the paper examines accuracy of 6 baseline part-of-speech taggers. The main part of the work presents simple weighted voting and complex voting taggers. Special attention is paid to lexical voting methods and issues of ties and fallbacks. TagPair and WPDV voting methods achieve the top accuracy among all considered methods. Error reduction 10.8 % with respect to the best baseline tagger for the large tagset is comparable with other author's results for small tagsets
Increasing Quality of the Corpus of Frequency Dictionary of Contemporary Polish for Morphosyntactic Tagging of the Polish Language
The paper is devoted to the issue of correction of the erroneous and ambiguous corpus of Frequency Dictionary of Contemporary Polish (FDCP) and its application to morphosyntactic tagging of the Polish language. Several stages of corpus transformation are presented and baseline part-of-speech tagging algorithms are evaluated, too
Comparison of Latent Semantic Analysis and Probabilistic Latent Semantic Analysis for Documents Clustering
In this paper we compare usefulness of statistical techniques of dimensionality reduction for improving clustering of documents in Polish. We start with partitional and agglomerative algorithms applied to Vector Space Model. Then we investigate two transformations: Latent Semantic Analysis and Probabilistic Latent Semantic Analysis. The obtained results showed advantage of Latent Semantic Analysis technique over probabilistic model. We also analyse time and memory consumption aspects of these transformations and present runtime details for IBM BladeCenter HS21 machine
- …