411 research outputs found

    Joint Morphological and Syntactic Analysis for Richly Inflected Languages

    Get PDF

    Morphological Disambiguation from Stemming Data

    Full text link
    Morphological analysis and disambiguation is an important task and a crucial preprocessing step in natural language processing of morphologically rich languages. Kinyarwanda, a morphologically rich language, currently lacks tools for automated morphological analysis. While linguistically curated finite state tools can be easily developed for morphological analysis, the morphological richness of the language allows many ambiguous analyses to be produced, requiring effective disambiguation. In this paper, we propose learning to morphologically disambiguate Kinyarwanda verbal forms from a new stemming dataset collected through crowd-sourcing. Using feature engineering and a feed-forward neural network based classifier, we achieve about 89% non-contextualized disambiguation accuracy. Our experiments reveal that inflectional properties of stems and morpheme association rules are the most discriminative features for disambiguation

    Systematic Comparison Of Cross-Lingual Projection Techniques For Low-Density Nlp Under Strict Resource Constraints

    Full text link
    The field of low-density NLP is often approached from an engineering perspective, and evaluations are typically haphazard - considering different architectures, given different languages, and different available resources - without a systematic comparison. The resulting architectures are then tested on the unique corpus and language for which this approach has been designed. This makes it difficult to truly evaluate which approach is truly the best, or which approaches are best for a given language. In this dissertation, several state-of-the-art architectures and approaches to low-density language Part-Of-Speech Tagging are reimplemented; all of these techniques exploit a relationship between a high-density (HD) language and a low-density (LD) language. As a novel contribution, a testbed is created using a representative sample of seven (HD - LD) language pairs, all drawn from the same massively parallel corpus, Europarl, and selected for their particular linguistic features. With this testbed in place, never-before-possible comparisons are conducted, to evaluate which broad approach performs the best for particular language pairs, and investigate whether particular language features should suggest a particular NLP approach. A survey of the field suggested some unexplored approaches with potential to yield better performance, be quicker to implement, and require less intensive linguistic resources. Under strict resource limitations, which are typical for low-density NLP environments, these characteristics are important. The approaches investigated in this dissertation are each a form of language ifier, which modifies an LD-corpus to be more like an HD-corpus, or alternatively, modifies an HD-corpus to be more like an LD-corpus, prior to supervised training. Each relying on relatively few linguistic resources, four variations of language ifier designs have been implemented and evaluated in this dissertation: lexical replacement, affix replacement, cognate replacement, and exemplar replacement. Based on linguistic properties of the languages drawn from the Europarl corpus, various predictions were made of which prior and novel approaches would be most effective for languages with specific linguistic properties, and these predictions were evaluated through systematic evaluations with the testbed of languages. The results of this dissertation serve as guidance for future researchers who must select an appropriate cross-lingual projection approach (and a high-density language from which to project) for a given low-density language. Finally, all the languages drawn from the Europarl corpus are actually HD, but for the sake of the evaluation testbed in this dissertation, certain languages are treated as if they were LD (ignoring any available HD resources). In order to evaluate how various approaches perform on an actual LD language, a case study was conducted in which part-of-speech taggers were implemented for Tajiki, harnessing linguistic resources from a related HD-language, Farsi, using all of the prior and novel approaches investigated in this dissertation. Insights from this case study were documented so that future researchers can gain insight into what their experience might be in implementing NLP tools for an LD language given the strict resource limitations considered in this dissertation

    Cliticization as Unselective Attract

    Get PDF
    The purpose of this article is to provide an explanatory account of the divide between enclisis and proclisis in pronominal clitic constructions in Romance and Semitic languages. The analysis is based on two fundamental assumptions: (i) clitics do not target designated prelabelled positions, but take maximal advantage of the available categorial structure; (ii) cliticization patterns are tightly dependent on the inflectional properties of the language, more specifically, on the feature content of the two functional categories, Infl and v. We show that the various asymmetries in clitic behavior can elegantly be explained in terms of the minimalist theory of movement, combined with certain formal hypotheses about the building of phrase structure and about the relation of morphology to syntax. Relying on certain ideas about uninterpretable features, Attract and Agree, we argue that cliticization patterns can be made to follow from the strategies made available by U.G. to check the uninterpretable feature of the category Infl and from the derivational origin of the tense and person-number features. A principle, the Unselective Attract Principle, is introduced according to which an uninterpretable feature is a potential attractor for all the features which are of the same type as the one which it selectively attracts. In Romance and in Semitic, clitic phi-sets are unselectively attracted by Infl. Two additional principles, the Priority Principle and the Single Licensing Condition, insure that at some point in the derivation a clitic can incorporate into Infl only if Infl doesn't already host an attracted inflectional morpheme. This idea holds the key for the enclisis/proclisis divide. Enclisis, i.e. clitic incorporation into Infl, is disallowed in Romance finite clauses where the uninterpretable feature of Infl selectively attracts the person-number agreement phi-set; it is legitimate in Semitic and European Portuguese finite clauses in which the same feature is checked through Agree.L'objectiu d'aquest article és donar compte de manera explicativa de la divisió entre enclisi i proclisi en construccions amb pronoms clítics en llengües romàniques i semítiques. L'anàlisi es basa en dos supòsits fonamentals: (i) els clítics no es traslladen a posicions específiques preestablertes, sinó que aprofiten màximament l'estructura categorial existent; (ii) els patrons de cliticització depenen estretament de les propietats flexives de la llengua, més concretament, del contingut de trets de les dues categories funcionals, Flex i v. Mostrem com les diverses asimetries en el comportament dels clítics es poden explicar elegantment en termes de la teoria minimista del trasllat, combinada amb certes hipòtesis formals sobre la construcció de l'estructura sintagmàtica i sobre la relació entre la morfologia i la sintaxi. Basant-nos en algunes idees sobre els trets pretables, i sobre Atraieu i Concordeu, defensem que els patrons de cliticització es poden derivar de les estratègies que permet la GU per comprovar els trets no interpretables de la categoria Flex, així com de l'origen derivacional de dels trets de temps i de persona i nombre. Proposem un principi, el Principi de l'Atraieu No Selectiu, segons el qual un tret no interpretable és un atractor potencial per a tots els trets que són de mateix tipus que el tret que atreu selectivament. En llengües romàniques i semítiques, els conjunts de trets dels clítics són atrets no selectivament per Flex. Hi ha dos principis més, el Principi de Prioritat i la Condició de Legitimació Única, que garanteixen que en algun moment de la derivació un clític es pot incorporar a Flex només si Flex no té ja un morfema flexiu adjuntat. Aquesta idea és la clau de la divisió entre enclisi i proclisi. L'enclisi, és a dir, la incorporació dels clítics a Flex, no és permesa en les oracions finites de les llengües romàniques en què el tret no interpretable de Flex atreu selectivament el conjunt de trets de persona i nombre; en canvi, és permesa en llengües semítiques i en portuguès europeu, en què els mateixos trets es comproven per Concordeu
    corecore