261 research outputs found

    Towards a machine-learning architecture for lexical functional grammar parsing

    Get PDF
    Data-driven grammar induction aims at producing wide-coverage grammars of human languages. Initial efforts in this field produced relatively shallow linguistic representations such as phrase-structure trees, which only encode constituent structure. Recent work on inducing deep grammars from treebanks addresses this shortcoming by also recovering non-local dependencies and grammatical relations. My aim is to investigate the issues arising when adapting an existing Lexical Functional Grammar (LFG) induction method to a new language and treebank, and find solutions which will generalize robustly across multiple languages. The research hypothesis is that by exploiting machine-learning algorithms to learn morphological features, lemmatization classes and grammatical functions from treebanks we can reduce the amount of manual specification and improve robustness, accuracy and domain- and language -independence for LFG parsing systems. Function labels can often be relatively straightforwardly mapped to LFG grammatical functions. Learning them reliably permits grammar induction to depend less on language-specific LFG annotation rules. I therefore propose ways to improve acquisition of function labels from treebanks and translate those improvements into better-quality f-structure parsing. In a lexicalized grammatical formalism such as LFG a large amount of syntactically relevant information comes from lexical entries. It is, therefore, important to be able to perform morphological analysis in an accurate and robust way for morphologically rich languages. I propose a fully data-driven supervised method to simultaneously lemmatize and morphologically analyze text and obtain competitive or improved results on a range of typologically diverse languages

    An investigation into lemmatization in Southern Sotho

    Get PDF
    Lemmatization refers to the process whereby a lexicographer assigns a specific place in a dictionary to a word which he regards as the most basic form amongst other related forms. The fact that in Bantu languages formative elements can be added to one another in an often seemingly interminable series till quite long words are produced, evokes curiosity as far as lemmatization is concerned. Being aware of the productive nature of Southern Sotho it is interesting to observe how lexicographers go about handling the question of morphological complexities they are normally faced with in the process of arranging lexical items. This study has shown that some difficulties are encountered as far as adhering to the traditional method of alphabetization is concerned. It does not aim at proposing solutions but does point out some considerations which should be borne in mind in the process of lemmatization.African LanguagesM.A. (African Languages

    AFRILEX-ALASA 2009 Conference Book

    Get PDF

    The TXM Portal Software giving access to Old French Manuscripts Online

    Get PDF
    Texte intégral en ligne : http://www.lrec-conf.org/proceedings/lrec2012/workshops/13.ProceedingsCultHeritage.pdfInternational audiencehttp://www.lrec-conf.org/proceedings/lrec2012/workshops/13.ProceedingsCultHeritage.pdf This paper presents the new TXM software platform giving online access to Old French Text Manuscripts images and tagged transcriptions for concordancing and text mining. This platform is able to import medieval sources encoded in XML according to the TEI Guidelines for linking manuscript images to transcriptions, encode several diplomatic levels of transcription including abbreviations and word level corrections. It includes a sophisticated tokenizer able to deal with TEI tags at different levels of linguistic hierarchy. Words are tagged on the fly during the import process using IMS TreeTagger tool with a specific language model. Synoptic editions displaying side by side manuscript images and text transcriptions are automatically produced during the import process. Texts are organized in a corpus with their own metadata (title, author, date, genre, etc.) and several word properties indexes are produced for the CQP search engine to allow efficient word patterns search to build different type of frequency lists or concordances. For syntactically annotated texts, special indexes are produced for the Tiger Search engine to allow efficient syntactic concordances building. The platform has also been tested on classical Latin, ancient Greek, Old Slavonic and Old Hieroglyphic Egyptian corpora (including various types of encoding and annotations)

    The compilation of corpus-based Setswana dictionaries

    Get PDF
    The aim of this thesis is to describe how corpus-based Setswana dictionaries should be compiled. The challenge to the modern Setswana lexicographer is to compile very practical descriptive and user-friendly dictionaries. A detailed evaluation of existing Setswana dictionaries will be performed in terms of the macrostructural and microstructural aspects: Coverage of frequently used words. Effective use of dictionary space. Use of standard dictionary conventions. Choice, ordering and composition of translation equivalent paradigms. The focus will be on material collection and corpus building. Informants will be used to compile an oral corpus of 100,000 tokens. All ethical requirements such as informed consent requirements (See Appendix 1) will be honoured. Since the text corpus is an organic corpus, thus not a designed corpus aimed at balance and representativeness, the oral corpus will be constructed in the same way i.e. only basic selection criteria: Mother tongue speakers of Setswana. Adults (to be on a par with authors of the written sources in the text corpus). Age: ranging from 20-60 years. Male and female. Critical analysis of all currently available Setswana dictionaries will be done with special reference to the dictionaries of Brown (1987) (SESD), Snyman, et al. (1990), Matumo (1993).(MSED), Kgasa (1976) (THAND) and Kgasa and Tsonope (1995).(THAN) In all these cases the strategy would be in terms of the theoretical criteria and best practices in terms of a broad theoretical survey of core aspects of dictionary compilation. Finally, the study will be concluded with an analysis of corpus integrity and stability of Setswana corpora based on the model introduced by Prinsloo and De Schryver (2001a).Thesis (DLitt)--University of Pretoria, 2009.African LanguagesUnrestricte

    ORTHOGRAPHIC ENRICHMENT FOR ARABIC GRAMMATICAL ANALYSIS

    Get PDF
    Thesis (Ph.D.) - Indiana University, Linguistics, 2010The Arabic orthography is problematic in two ways: (1) it lacks the short vowels, and this leads to ambiguity as the same orthographic form can be pronounced in many different ways each of which can have its own grammatical category, and (2) the Arabic word may contain several units like pronouns, conjunctions, articles and prepositions without an intervening white space. These two problems lead to difficulties in the automatic processing of Arabic. The thesis proposes a pre-processing scheme that applies word segmentation and word vocalization for the purpose of grammatical analysis: part of speech tagging and parsing. The thesis examines the impact of human-produced vocalization and segmentation on the grammatical analysis of Arabic, then applies a pipeline of automatic vocalization and segmentation for the purpose of Arabic part of speech tagging. The pipeline is then used, along with the POS tags produced, for the purpose of dependency parsing, which produces grammatical relations between the words in a sentence. The study uses the memory-based algorithm for vocalization, segmentation, and part of speech tagging, and the natural language parser MaltParser for dependency parsing. The thesis represents the first approach to the processing of real-world Arabic, and has found that through the correct choice of features and algorithms, the need for pre-processing for grammatical analysis can be minimized

    CLARIN

    Get PDF
    The book provides a comprehensive overview of the Common Language Resources and Technology Infrastructure – CLARIN – for the humanities. It covers a broad range of CLARIN language resources and services, its underlying technological infrastructure, the achievements of national consortia, and challenges that CLARIN will tackle in the future. The book is published 10 years after establishing CLARIN as an Europ. Research Infrastructure Consortium
    • 

    corecore