545 research outputs found

    Three Generative, Lexicalised Models for Statistical Parsing

    Full text link
    In this paper we first propose a new statistical parsing model, which is a generative model of lexicalised context-free grammar. We then extend the model to include a probabilistic treatment of both subcategorisation and wh-movement. Results on Wall Street Journal text show that the parser performs at 88.1/87.5% constituent precision/recall, an average improvement of 2.3% over (Collins 96).Comment: 8 pages, to appear in Proceedings of ACL/EACL 97

    Improving dependency label accuracy using statistical post-editing: A cross-framework study

    Get PDF
    We present a statistical post-editing method for modifying the dependency labels in a dependency analysis. We test the method using two English datasets, three parsing systems and three labelled dependency schemes. We demonstrate how it can be used both to improve dependency label accuracy in parser output and highlight problems with and differences between constituency-to-dependency conversions

    Can Subcategorisation Probabilities Help a Statistical Parser?

    Full text link
    Research into the automatic acquisition of lexical information from corpora is starting to produce large-scale computational lexicons containing data on the relative frequencies of subcategorisation alternatives for individual verbal predicates. However, the empirical question of whether this type of frequency information can in practice improve the accuracy of a statistical parser has not yet been answered. In this paper we describe an experiment with a wide-coverage statistical grammar and parser for English and subcategorisation frequencies acquired from ten million words of text which shows that this information can significantly improve parse accuracy.Comment: 9 pages, uses colacl.st

    C-structures and f-structures for the British national corpus

    Get PDF
    We describe how the British National Corpus (BNC), a one hundred million word balanced corpus of British English, was parsed into Lexical Functional Grammar (LFG) c-structures and f-structures, using a treebank-based parsing architecture. The parsing architecture uses a state-of-the-art statistical parser and reranker trained on the Penn Treebank to produce context-free phrase structure trees, and an annotation algorithm to automatically annotate these trees into LFG f-structures. We describe the pre-processing steps which were taken to accommodate the differences between the Penn Treebank and the BNC. Some of the issues encountered in applying the parsing architecture on such a large scale are discussed. The process of annotating a gold standard set of 1,000 parse trees is described. We present evaluation results obtained by evaluating the c-structures produced by the statistical parser against the c-structure gold standard. We also present the results obtained by evaluating the f-structures produced by the annotation algorithm against an automatically constructed f-structure gold standard. The c-structures achieve an f-score of 83.7% and the f-structures an f-score of 91.2%

    Inducing Compact but Accurate Tree-Substitution Grammars

    Get PDF
    Tree substitution grammars (TSGs) are a compelling alternative to context-free grammars for modelling syntax. However, many popular techniques for estimating weighted TSGs (under the moniker of Data Oriented Parsing) suffer from the problems of inconsistency and over-fitting. We present a theoretically principled model which solves these problems using a Bayesian non-parametric formulation. Our model learns compact and simple grammars, uncovering latent linguistic structures (e.g., verb subcategorisation), and in doing so far out-performs a standard PCFG.

    Exploiting multi-word units in statistical parsing and generation

    Get PDF
    Syntactic parsing is an important prerequisite for many natural language processing (NLP) applications. The task refers to the process of generating the tree of syntactic nodes with associated phrase category labels corresponding to a sentence. Our objective is to improve upon statistical models for syntactic parsing by leveraging multi-word units (MWUs) such as named entities and other classes of multi-word expressions. Multi-word units are phrases that are lexically, syntactically and/or semantically idiosyncratic in that they are to at least some degree non-compositional. If such units are identified prior to, or as part of, the parsing process their boundaries can be exploited as islands of certainty within the very large (and often highly ambiguous) search space. Luckily, certain types of MWUs can be readily identified in an automatic fashion (using a variety of techniques) to a near-human level of accuracy. We carry out a number of experiments which integrate knowledge about different classes of MWUs in several commonly deployed parsing architectures. In a supplementary set of experiments, we attempt to exploit these units in the converse operation to statistical parsing---statistical generation (in our case, surface realisation from Lexical-Functional Grammar f-structures). We show that, by exploiting knowledge about MWUs, certain classes of parsing and generation decisions are more accurately resolved. This translates to improvements in overall parsing and generation results which, although modest, are demonstrably significant

    Arc-Standard Spinal Parsing with Stack-LSTMs

    Full text link
    We present a neural transition-based parser for spinal trees, a dependency representation of constituent trees. The parser uses Stack-LSTMs that compose constituent nodes with dependency-based derivations. In experiments, we show that this model adapts to different styles of dependency relations, but this choice has little effect for predicting constituent structure, suggesting that LSTMs induce useful states by themselves.Comment: IWPT 201
    corecore