173 research outputs found

    Data sparsity in highly inflected languages: the case of morphosyntactic tagging in Polish

    Get PDF
    In morphologically complex languages, many high-level tasks in natural language processing rely on accurate morphosyntactic analyses of the input. However, in light of the risk of error propagation in present-day pipeline architectures for basic linguistic pre-processing, the state of the art for morphosyntactic tagging is still not satisfactory. The main obstacle here is data sparsity inherent to natural lan- guage in general and highly inflected languages in particular. In this work, we investigate whether semi-supervised systems may alleviate the data sparsity problem. Our approach uses word clusters obtained from large amounts of unlabelled text in an unsupervised manner in order to provide a su- pervised probabilistic tagger with morphologically informed features. Our evalua- tions on a number of datasets for the Polish language suggest that this simple technique improves tagging accuracy, especially with regard to out-of-vocabulary words. This may prove useful to increase cross-domain performance of taggers, and to alleviate the dependency on large amounts of supervised training data, which is especially important from the perspective of less-resourced languages

    Modeling Target-Side Inflection in Neural Machine Translation

    Full text link
    NMT systems have problems with large vocabulary sizes. Byte-pair encoding (BPE) is a popular approach to solving this problem, but while BPE allows the system to generate any target-side word, it does not enable effective generalization over the rich vocabulary in morphologically rich languages with strong inflectional phenomena. We introduce a simple approach to overcome this problem by training a system to produce the lemma of a word and its morphologically rich POS tag, which is then followed by a deterministic generation step. We apply this strategy for English-Czech and English-German translation scenarios, obtaining improvements in both settings. We furthermore show that the improvement is not due to only adding explicit morphological information.Comment: Accepted as a research paper at WMT17. (Updated version with corrected references.

    A Case Study of Algorithms for Morphosyntactic Tagging of Polish Language

    Get PDF
    The paper presents an evaluation of several part-of-speech taggers, representing main tagging algorithms, applied to corpus of frequency dictionary of the contemporary Polish language. We report our results considering two tagging schemes: IPI PAN positional tagset and its simplified version. Tagging accuracy is calculated for different training sets and takes into account many subcategories (accuracy on known and unknown tokens, word segments, sentences etc.) The comparison of results with other inflecting and analytic languages is done. Performance aspects (time demands) of used tagging tools are also discussed

    BLOCKING AND EXTENDED EXPONENCE OF SUFFIX PRONOUNS IN ARABIC PERFECTIVE VERB CONJUGATION

    Get PDF
    Noyer (1997) utilized blocking and extended exponence to encode pronouns in the conjugation of imperfect verbs in Arabic. His findings were criticized by Stump (2001) and Xu (2010), because the formulation was considered too complex. Xu (2010) offered a unified integrated account based on Optimality Theory while still relying on blocking and extended exponence. However, their for-mulation only focuses on the pronouns of imperfect verb conjugations. So far, the optimality of conjugations of perfective Arabic verbs which are also complex in nature, have not been considered yet in their studies. This study extends the work of Xu (2010) by developing the formulation of the optimal forms of the suffix pronouns of the Arabic perfective verb conjugations. The results of study reveal that several exponences which in different situations, each can realize several assingments. Instead, there is an assignment that is realized by more than one exponenc

    Language. An introduction to the study of speech

    Get PDF
    Mode of access: Internet
    corecore