2,845 research outputs found

    Left Recursion in Parsing Expression Grammars

    Full text link
    Parsing Expression Grammars (PEGs) are a formalism that can describe all deterministic context-free languages through a set of rules that specify a top-down parser for some language. PEGs are easy to use, and there are efficient implementations of PEG libraries in several programming languages. A frequently missed feature of PEGs is left recursion, which is commonly used in Context-Free Grammars (CFGs) to encode left-associative operations. We present a simple conservative extension to the semantics of PEGs that gives useful meaning to direct and indirect left-recursive rules, and show that our extensions make it easy to express left-recursive idioms from CFGs in PEGs, with similar results. We prove the conservativeness of these extensions, and also prove that they work with any left-recursive PEG. PEGs can also be compiled to programs in a low-level parsing machine. We present an extension to the semantics of the operations of this parsing machine that let it interpret left-recursive PEGs, and prove that this extension is correct with regards to our semantics for left-recursive PEGs.Comment: Extended version of the paper "Left Recursion in Parsing Expression Grammars", that was published on 2012 Brazilian Symposium on Programming Language

    Preparing, restructuring, and augmenting a French treebank: lexicalised parsers or coherent treebanks?

    Get PDF
    We present the Modified French Treebank (MFT), a completely revamped French Treebank, derived from the Paris 7 Treebank (P7T), which is cleaner, more coherent, has several transformed structures, and introduces new linguistic analyses. To determine the effect of these changes, we investigate how theMFT fares in statistical parsing. Probabilistic parsers trained on the MFT training set (currently 3800 trees) already perform better than their counterparts trained on five times the P7T data (18,548 trees), providing an extreme example of the importance of data quality over quantity in statistical parsing. Moreover, regression analysis on the learning curve of parsers trained on the MFT lead to the prediction that parsers trained on the full projected 18,548 tree MFT training set will far outscore their counterparts trained on the full P7T. These analyses also show how problematic data can lead to problematic conclusionsā€“in particular, we find that lexicalisation in the probabilistic parsing of French is probably not as crucial as was once thought (Arun and Keller (2005))

    The ModelCC Model-Driven Parser Generator

    Full text link
    Syntax-directed translation tools require the specification of a language by means of a formal grammar. This grammar must conform to the specific requirements of the parser generator to be used. This grammar is then annotated with semantic actions for the resulting system to perform its desired function. In this paper, we introduce ModelCC, a model-based parser generator that decouples language specification from language processing, avoiding some of the problems caused by grammar-driven parser generators. ModelCC receives a conceptual model as input, along with constraints that annotate it. It is then able to create a parser for the desired textual syntax and the generated parser fully automates the instantiation of the language conceptual model. ModelCC also includes a reference resolution mechanism so that ModelCC is able to instantiate abstract syntax graphs, rather than mere abstract syntax trees.Comment: In Proceedings PROLE 2014, arXiv:1501.0169

    C-structures and f-structures for the British national corpus

    Get PDF
    We describe how the British National Corpus (BNC), a one hundred million word balanced corpus of British English, was parsed into Lexical Functional Grammar (LFG) c-structures and f-structures, using a treebank-based parsing architecture. The parsing architecture uses a state-of-the-art statistical parser and reranker trained on the Penn Treebank to produce context-free phrase structure trees, and an annotation algorithm to automatically annotate these trees into LFG f-structures. We describe the pre-processing steps which were taken to accommodate the differences between the Penn Treebank and the BNC. Some of the issues encountered in applying the parsing architecture on such a large scale are discussed. The process of annotating a gold standard set of 1,000 parse trees is described. We present evaluation results obtained by evaluating the c-structures produced by the statistical parser against the c-structure gold standard. We also present the results obtained by evaluating the f-structures produced by the annotation algorithm against an automatically constructed f-structure gold standard. The c-structures achieve an f-score of 83.7% and the f-structures an f-score of 91.2%
    • ā€¦
    corecore