Data sparsity in highly inflected languages: the case of morphosyntactic tagging in Polish

Abstract

In morphologically complex languages, many high-level tasks in natural language processing rely on accurate morphosyntactic analyses of the input. However, in light of the risk of error propagation in present-day pipeline architectures for basic linguistic pre-processing, the state of the art for morphosyntactic tagging is still not satisfactory. The main obstacle here is data sparsity inherent to natural lan- guage in general and highly inflected languages in particular. In this work, we investigate whether semi-supervised systems may alleviate the data sparsity problem. Our approach uses word clusters obtained from large amounts of unlabelled text in an unsupervised manner in order to provide a su- pervised probabilistic tagger with morphologically informed features. Our evalua- tions on a number of datasets for the Polish language suggest that this simple technique improves tagging accuracy, especially with regard to out-of-vocabulary words. This may prove useful to increase cross-domain performance of taggers, and to alleviate the dependency on large amounts of supervised training data, which is especially important from the perspective of less-resourced languages

    Similar works