Data sparsity in highly inflected languages: the case of morphosyntactic tagging in Polish

Ustaszewski, Michael

Data sparsity in highly inflected languages: the case of morphosyntactic tagging in Polish

Authors: Michael Ustaszewski
Publication date: 1 January 2016
Publisher

Abstract

In morphologically complex languages, many high-level tasks in natural language processing rely on accurate morphosyntactic analyses of the input. However, in light of the risk of error propagation in present-day pipeline architectures for basic linguistic pre-processing, the state of the art for morphosyntactic tagging is still not satisfactory. The main obstacle here is data sparsity inherent to natural lan- guage in general and highly inflected languages in particular. In this work, we investigate whether semi-supervised systems may alleviate the data sparsity problem. Our approach uses word clusters obtained from large amounts of unlabelled text in an unsupervised manner in order to provide a su- pervised probabilistic tagger with morphologically informed features. Our evalua- tions on a number of datasets for the Polish language suggest that this simple technique improves tagging accuracy, especially with regard to out-of-vocabulary words. This may prove useful to increase cross-domain performance of taggers, and to alleviate the dependency on large amounts of supervised training data, which is especially important from the perspective of less-resourced languages

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Archivo Digital para la Docencia y la Investigación

oai:addi.ehu.es:10810/19647

Last time updated on 08/02/2017

Archivo Digital para la Docencia y la Investigación

oai:addi.ehu.eus:10810/19647

Last time updated on 03/12/2022