Search CORE

173 research outputs found

Data sparsity in highly inflected languages: the case of morphosyntactic tagging in Polish

Author: Ustaszewski Michael
Publication venue
Publication date: 01/01/2016
Field of study

In morphologically complex languages, many high-level tasks in natural language processing rely on accurate morphosyntactic analyses of the input. However, in light of the risk of error propagation in present-day pipeline architectures for basic linguistic pre-processing, the state of the art for morphosyntactic tagging is still not satisfactory. The main obstacle here is data sparsity inherent to natural lan- guage in general and highly inflected languages in particular. In this work, we investigate whether semi-supervised systems may alleviate the data sparsity problem. Our approach uses word clusters obtained from large amounts of unlabelled text in an unsupervised manner in order to provide a su- pervised probabilistic tagger with morphologically informed features. Our evalua- tions on a number of datasets for the Polish language suggest that this simple technique improves tagging accuracy, especially with regard to out-of-vocabulary words. This may prove useful to increase cross-domain performance of taggers, and to alleviate the dependency on large amounts of supervised training data, which is especially important from the perspective of less-resourced languages

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Archivo Digital para la Docencia y la Investigación

Modeling Target-Side Inflection in Neural Machine Translation

Author: Fraser Alexander
Marco Marion Weller-Di
Tamchyna Aleš
Publication venue
Publication date: 01/01/2017
Field of study

NMT systems have problems with large vocabulary sizes. Byte-pair encoding (BPE) is a popular approach to solving this problem, but while BPE allows the system to generate any target-side word, it does not enable effective generalization over the rich vocabulary in morphologically rich languages with strong inflectional phenomena. We introduce a simple approach to overcome this problem by training a system to produce the lemma of a word and its morphologically rich POS tag, which is then followed by a deterministic generation step. We apply this strategy for English-Czech and English-German translation scenarios, obtaining improvements in both settings. We furthermore show that the improvement is not due to only adding explicit morphological information.Comment: Accepted as a research paper at WMT17. (Updated version with corrected references.

arXiv.org e-Print Archive

Crossref

A Case Study of Algorithms for Morphosyntactic Tagging of Polish Language

Author: Chrzaszcz Paweł
Kitowski Jacek
Kuta Marcin
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 30/01/2012
Field of study

The paper presents an evaluation of several part-of-speech taggers, representing main tagging algorithms, applied to corpus of frequency dictionary of the contemporary Polish language. We report our results considering two tagging schemes: IPI PAN positional tagset and its simplified version. Tagging accuracy is calculated for different training sets and takes into account many subcategories (accuracy on known and unknown tokens, word segments, sentences etc.) The comparison of results with other inflecting and analytic languages is done. Performance aspects (time demands) of used tagging tools are also discussed

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

BLOCKING AND EXTENDED EXPONENCE OF SUFFIX PRONOUNS IN ARABIC PERFECTIVE VERB CONJUGATION

Author: Hizbullah Nur
Mardiah Zaqiatul
Niswah Awaliyah Ainun
Publication venue: 'Universitas Indonesia, Directorate of Research and Public Service'
Publication date: 31/07/2020
Field of study

Noyer (1997) utilized blocking and extended exponence to encode pronouns in the conjugation of imperfect verbs in Arabic. His findings were criticized by Stump (2001) and Xu (2010), because the formulation was considered too complex. Xu (2010) offered a unified integrated account based on Optimality Theory while still relying on blocking and extended exponence. However, their for-mulation only focuses on the pronouns of imperfect verb conjugations. So far, the optimality of conjugations of perfective Arabic verbs which are also complex in nature, have not been considered yet in their studies. This study extends the work of Xu (2010) by developing the formulation of the optimal forms of the suffix pronouns of the Arabic perfective verb conjugations. The results of study reveal that several exponences which in different situations, each can realize several assingments. Instead, there is an assignment that is realized by more than one exponenc

International Review of Humanities Studies (IRHS)