7 research outputs found

    Neural morphosyntactic tagging for Rusyn

    Get PDF
    The paper presents experiments on part-of-speech and full morphological tagging of the Slavic minority language Rusyn. The proposed approach relies on transfer learning and uses only annotated resources from related Slavic languages, namely Russian, Ukrainian, Slovak, Polish, and Czech. It does not require any annotated Rusyn training data, nor parallel data or bilingual dictionaries involving Rusyn. Compared to earlier work, we improve tagging performance by using a neural network tagger and larger training data from the neighboring Slavic languages.We experiment with various data preprocessing and sampling strategies and evaluate the impact of multitask learning strategies and of pretrained word embeddings. Overall, while genre discrepancies between training and test data have a negative impact, we improve full morphological tagging by 9% absolute micro-averaged F1 as compared to previous research.Peer reviewe

    Natural language processing for similar languages, varieties, and dialects: A survey

    Get PDF
    There has been a lot of recent interest in the natural language processing (NLP) community in the computational processing of language varieties and dialects, with the aim to improve the performance of applications such as machine translation, speech recognition, and dialogue systems. Here, we attempt to survey this growing field of research, with focus on computational methods for processing similar languages, varieties, and dialects. In particular, we discuss the most important challenges when dealing with diatopic language variation, and we present some of the available datasets, the process of data collection, and the most common data collection strategies used to compile datasets for similar languages, varieties, and dialects. We further present a number of studies on computational methods developed and/or adapted for preprocessing, normalization, part-of-speech tagging, and parsing similar languages, language varieties, and dialects. Finally, we discuss relevant applications such as language and dialect identification and machine translation for closely related languages, language varieties, and dialects.Non peer reviewe

    New Developments in Tagging Pre-modern Orthodox Slavic Texts

    Get PDF
    Pre-modern Orthodox Slavic texts pose certain difficulties when it comes to part-of-speech and full morphological tagging. Orthographic and morphological heterogeneity makes it hard to apply resources that rely on normalized data, which is why previous attempts to train part-of-speech (POS) taggers for pre-modern Slavic often apply normalization routines. In the current paper, we further explore the normalization path; at the same time, we use the statistical CRF-tagger MarMoT and a newly developed neural network tagger that cope better with variation than previously applied rule-based or statistical taggers. Furthermore, we conduct transfer experiments to apply Modern Russian resources to pre-modern data. Our experiments show that while transfer experiments could not improve tagging performance significantly, state-of-the-art taggers reach between 90% and more than 95% tagging accuracy and thus approach the tagging accuracy of modern standard languages with rich morphology. Remarkably, these results are achieved without the need for normalization, which makes our research of practical relevance to the Paleoslavistic community.Peer reviewe

    Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

    Get PDF
    Peer reviewe

    A quantitative and typological study of Early Slavic participle clauses and their competition

    Get PDF
    This thesis investigates the semantic and pragmatic properties of Early Slavic participle constructions (conjunct participles and dative absolutes) to understand the principles motivating their selection over one another and over their main finite competitor (jegda-clauses). The issue is tackled by adopting two broadly different approaches, which inform the division of the thesis into two parts. The first part of the thesis uses detailed linguistic annotation on Early Slavic corpora at the morphosyntactic, dependency, information-structural, and lexical levels to obtain indirect evidence for different potential functions of participle clauses and their main finite competitor. The goal of this part of the thesis is to understand the roles of compositionality and default discourse reasoning as explanations for the distribution of participle constructions and jegda-clauses in the Early Slavic corpus. The investigation shows that the competition between conjunct participles, absolute constructions, and jegda-clauses occurs at the level of discourse organization, where the main determining factor in their distribution is the distinction between background and foreground content of an (elementary or complex) discourse unit. The analysis also shows that the major common denominator between the three constructions is that all of them can function as frame-setting devices (i.e. background clauses), albeit to very different extents. In fact, conjunct participles are more typically associated with the foreground constituent of a discourse unit, whereas dative absolutes and jegda-clauses are typically associated with the background content. The second part of the thesis uses massively parallel data, including Old Church Slavonic and Ancient Greek, and analyses typological variation in how languages express the semantic space of English when, whose scope encompasses that of Early Slavic participle constructions and jegda-clauses. To do so, probabilistic semantic maps are generated and statistical methods (including Kriging, Gaussian Mixture Modelling, precision and recall analysis) are used to induce cross-linguistically salient dimensions from the parallel corpus and to study conceptual variation within the semantic space of the hypothetical concept when. Clear typological correspondences and differences with Early Slavic from linguistic phenomena in other languages are then exploited to corroborate and refine observations made on the core semantic-pragmatic properties of participle constructions and jegda-clauses on the basis of annotated Early Slavic data. The analysis shows that 'null’ constructions (juxtaposed clauses such as participles and converbs, or independent clauses) consistently cluster in particular regions of the semantic map cross-linguistically, which clearly indicates that participle clauses are not equally viable as alternatives to any use of when, but carry particular meanings that make them less suitable for some of its functions. The investigation helped identify genealogically and areally unrelated languages that seem typologically very similar to Old Church Slavonic in the way they divide the semantic space of when between overtly subordinated and 'null’ constructions. Comparison with these languages reveals great similarities between the functions of Early Slavic participle constructions and of linguistic phenomena in some of these languages (particularly clause chaining, bridging, insubordination, and switch reference). Crucially, new clear correspondences are found between these phenomena and 'non-canonical’ usages of participle constructions (i.e. coreferential dative absolutes, syntactically independent absolutes and conjunct participles, and participle constructions with no apparent matrix clause), which had often been written off as ‘aberrations’ by previous literature on Early Slavic

    Neural morphosyntactic tagging for Rusyn

    No full text
    corecore