778 research outputs found

    A framework for lexical representation

    Full text link
    In this paper we present a unification-based lexical platform designed for highly inflected languages (like Roman ones). A formalism is proposed for encoding a lemma-based lexical source, well suited for linguistic generalizations. From this source, we automatically generate an allomorph indexed dictionary, adequate for efficient processing. A set of software tools have been implemented around this formalism: access libraries, morphological processors, etc.Comment: 9 page

    An extended spell checker for unknown words

    Get PDF

    SMM: Detailed, Structured Morphological Analysis for Spanish

    Get PDF
    We present a morphological analyzer for Spanish called SMM. SMM is implemented in the grammar development framework Malaga, which is based on the formalism of Left-Associative Grammar. We briefly present the Malaga framework, describe the implementation decisions for some interesting morphological phenomena of Spanish, and report on the evaluation results from the analysis of corpora. SMM was originally only designed for analyzing word forms; in this article we outline two approaches for using SMM and the facilities provided by Malaga to also generate verbal paradigms. SMM can also be embedded into applications by making use of the Malagaprogramming interface; we briefly discuss some application scenarios

    TectoMT – a deep-­linguistic core of the combined Chimera MT system

    Get PDF
    Chimera is a machine translation system that combines the TectoMT deep-linguistic core with phrase-based MT system Moses. For English–Czech pair it also uses the Depfix post-correction system. All the components run on Unix/Linux platform and are open source (available from Perl repository CPAN and the LINDAT/CLARIN repository). The main website is https://ufal.mff.cuni.cz/tectomt. The development is currently supported by the QTLeap 7th FP project (http://qtleap.eu)

    Grammar Enhanced Biliteracy: Naskapi Language Structures For Facilitating Reading In Naskapi

    Get PDF
    The Naskapi language is the language of instruction in the early primary grades of the school in the Naskapi community. Only recently have Naskapi-speaking teachers received formal instruction in pedagogy, with a cohort of Naskapi teachers following courses for their Bachelor of Education degree towards careers teaching in the Naskapi language in their local school. These adults are highly motivated to become literate in their mother tongue in order to teach or prepare curriculum materials in the Naskapi language. This thesis explores how basic grammatical structures can be mastered, and provides insight into the form that pedagogical grammatical instruction should take, in order to equip these individuals to become adequately literate in their mother tongue

    towards an optimal solution to lemmatization in arabic

    Get PDF
    Abstract Lemmatization—computing the canonical forms of words in running text—is an important component in any NLP system and a key preprocessing step for most applications that rely on natural language understanding. In the case of Arabic, lemmatization is a complex task because of the rich morphology, agglutinative aspects, and lexical ambiguity due to the absence of short vowels in writing. In this paper, we introduce a new lemmatizer tool that combines a machine-learning-based approach with a lemmatization dictionary, the latter providing increased accuracy, robustness, and flexibility to the former. Our evaluations yield a performance of over 98% for the entire lemmatization pipeline. The lemmatizer tools are freely downloadable for private and research purposes
    • …