8 research outputs found

    Clause structure, pro-drop and control in Wolof: an LFG/XLE perspective

    Get PDF
    This paper provides a formal description of the syntactic analysis of core constructions of Wolof clausal/verbal morphosyntax within the Lexical-Functional Grammar formalism. This includes the basic phrase structure, pro-drop, and control relations. The Wolof grammar is implemented in XLE and uses a cascade of finite-state transducers for morphological analysis and tokenization. This work is part of the ongoing process on building language resources and tools for Wolof, in particular a computational grammar.publishedVersio

    LFG parse disambiguation for Wolof

    Get PDF
    This paper presents several techniques for managing ambiguity in LFG parsing of Wolof, a less-resourced Niger-Congo language. Ambiguity is pervasive in Wolof and This raises a number of theoretical and practical issues for managing ambiguity associated with different objectives. From a theoretical perspective, the main aim is to design a large-scale grammar for Wolof that is able to make linguistically motivated disambiguation decisions, and to find appropriate ways of controlling ambiguity at important interface representations. The practical aim is to develop disambiguation strategies to improve the performance of the grammar in terms of efficiency, robustness and coverage. To achieve these goals, different avenues are explored to manage ambiguity in the Wolof grammar, including the formal encoding of noun class indeterminacy, lexical specifications, the use of Constraint Grammar models (Karlsson 1990) for morphological disambiguation, the application of the c-structure pruning mechanism (Cahill et al. 2007, 2008; Crouch et al. 2013), and the use of optimality marks for preferences (Frank et al. 1998, 2001). The parsing system is further controlled by packing ambiguities. In addition, discriminant-based techniques for parse disambiguation (Rosén et al. 2007) are applied for treebanking purposes

    Clause structure, pro-drop and control in Wolof: an LFG/XLE perspective

    No full text
    This paper provides a formal description of the syntactic analysis of core constructions of Wolof clausal/verbal morphosyntax within the Lexical-Functional Grammar formalism. This includes the basic phrase structure, pro-drop, and control relations. The Wolof grammar is implemented in XLE and uses a cascade of finite-state transducers for morphological analysis and tokenization. This work is part of the ongoing process on building language resources and tools for Wolof, in particular a computational grammar

    Finite-State Tokenization for a Deep Wolof LFG Grammar

    No full text
    This paper presents a finite-state transducer (FST) for tokenizing and normalizing natural texts that are input to a large-scale LFG grammar for Wolof. In the early stage of grammar development, a language-independent tokenizer was used to split the input stream into a unique sequence of tokens. is simple transducer took into account general character classes, without using any language-specific information. However, at a later stage of grammar development, uncovered and non-trivial tokenization issues arose, including issues related to multi-word expressions (MWEs), clitics and text normalization. As a consequence, the tokenizer was extended by integrating FST components. is extension was crucial for scaling the hand-written grammar to free text and for enhancing the performance of the parser.

    Design and Development of Part-of-Speech-Tagging Resources for Wolof (Niger-Congo, spoken in Senegal)

    No full text
    Dione CMB, Kuhn J, Zarrieß S. Design and Development of Part-of-Speech-Tagging Resources for Wolof (Niger-Congo, spoken in Senegal). In: Proceedings of the Seventh International Conference on Language Resources and Evaluation ({LREC}'10). Valletta, Malta: European Language Resources Association (ELRA); 2010.In this paper, we report on the design of a part-of-speech-tagset for Wolof and on the creation of a semi-automatically annotated gold standard. In order to achieve high-quality annotation relatively fast, we first generated an accurate lexicon that draws on existing word and name lists and takes into account inflectional and derivational morphology. The main motivation for the tagged corpus is to obtain data for training automatic taggers with machine learning approaches. Hence, we took machine learning considerations into account during tagset design and we present training experiments as part of this paper. The best automatic tagger achieves an accuracy of 95.2{\%} in cross-validation experiments. We also wanted to create a basis for experimenting with annotation projection techniques, which exploit parallel corpora. For this reason, it was useful to use a part of the Bible as the gold standard corpus, for which sentence-aligned parallel versions in many languages are easy to obtain. We also report on preliminary experiments exploiting a statistical word alignment of the parallel text

    ParGramBank: The ParGram Parallel Treebank

    No full text
    This paper discusses the construction of a parallel treebank currently involving ten languages from six language families. The treebank is based on deep LFG (LexicalFunctional Grammar) grammars that were developed within the framework of the ParGram (Parallel Grammar) effort. The grammars produce output that is maximally parallelized across languages and language families. This output forms the basis of a parallel treebank covering a diverse set of phenomena. The treebank is publicly available via the INESS treebanking environment, which also allows for the alignment of language pairs. We thus present a unique, multilayered parallel treebank that represents more and different types of languages than are available in other treebanks, that represents deep linguistic knowledge and that allows for the alignment of sentences at several levels: dependency structures, constituency structures and POS informatio

    MasakhaNER 2.0:Africa-centric Transfer Learning for Named Entity Recognition

    No full text
    African languages are spoken by over a billion people, but are underrepresented in NLP research and development. The challenges impeding progress include the limited availability of annotated datasets, as well as a lack of understanding of the settings where current methods are effective. In this paper, we make progress towards solutions for these challenges, focusing on the task of named entity recognition (NER). We create the largest human-annotated NER dataset for 20 African languages, and we study the behavior of state-of-the-art cross-lingual transfer methods in an Africa-centric setting, demonstrating that the choice of source language significantly affects performance. We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points across 20 languages compared to using English. Our results highlight the need for benchmark datasets and models that cover typologically-diverse African languages
    corecore