96 research outputs found

    Can Subcategorisation Probabilities Help a Statistical Parser?

    Full text link
    Research into the automatic acquisition of lexical information from corpora is starting to produce large-scale computational lexicons containing data on the relative frequencies of subcategorisation alternatives for individual verbal predicates. However, the empirical question of whether this type of frequency information can in practice improve the accuracy of a statistical parser has not yet been answered. In this paper we describe an experiment with a wide-coverage statistical grammar and parser for English and subcategorisation frequencies acquired from ten million words of text which shows that this information can significantly improve parse accuracy.Comment: 9 pages, uses colacl.st

    Apportioning Development Effort in a Probabilistic LR Parsing System through Evaluation

    Get PDF
    We describe an implemented system for robust domain-independent syntactic parsing of English, using a unification-based grammar of part-of-speech and punctuation labels coupled with a probabilistic LR parser. We present evaluations of the system's performance along several different dimensions; these enable us to assess the contribution that each individual part is making to the success of the system as a whole, and thus prioritise the effort to be devoted to its further enhancement. Currently, the system is able to parse around 80% of sentences in a substantial corpus of general text containing a number of distinct genres. On a random sample of 250 such sentences the system has a mean crossing bracket rate of 0.71 and recall and precision of 83% and 84% respectively when evaluated against manually-disambiguated analyses.Comment: 10 pages, 1 Postscript figure. To Appear in Proceedings of the Conference on Empirical Methods in Natural Language Processing, University of Pennsylvania, May 199

    Three Generative, Lexicalised Models for Statistical Parsing

    Full text link
    In this paper we first propose a new statistical parsing model, which is a generative model of lexicalised context-free grammar. We then extend the model to include a probabilistic treatment of both subcategorisation and wh-movement. Results on Wall Street Journal text show that the parser performs at 88.1/87.5% constituent precision/recall, an average improvement of 2.3% over (Collins 96).Comment: 8 pages, to appear in Proceedings of ACL/EACL 97

    Optimality Theory as a Framework for Lexical Acquisition

    Full text link
    This paper re-investigates a lexical acquisition system initially developed for French.We show that, interestingly, the architecture of the system reproduces and implements the main components of Optimality Theory. However, we formulate the hypothesis that some of its limitations are mainly due to a poor representation of the constraints used. Finally, we show how a better representation of the constraints used would yield better results

    A distributional investigation of German verbs

    Get PDF
    Diese Dissertation bietet eine empirische Untersuchung deutscher Verben auf der Grundlage statistischer Beschreibungen, die aus einem großen deutschen Textkorpus gewonnen wurden. In einem kurzen Überblick ĂŒber linguistische Theorien zur lexikalischen Semantik von Verben skizziere ich die Idee, dass die Verbbedeutung wesentlich von seiner Argumentstruktur (der Anzahl und Art der Argumente, die zusammen mit dem Verb auftreten) und seiner Aspektstruktur (Eigenschaften, die den zeitlichen Ablauf des vom Verb denotierten Ereignisses bestimmen) abhĂ€ngt. Anschließend erstelle ich statistische Beschreibungen von Verben, die auf diesen beiden unterschiedlichen Bedeutungsfacetten basieren. Insbesondere untersuche ich verbale Subkategorisierung, SelektionsprĂ€ferenzen und Aspekt. Alle diese Modellierungsstrategien werden anhand einer gemeinsamen Aufgabe, der Verbklassifikation, bewertet. Ich zeige, dass im Rahmen von maschinellem Lernen erworbene Merkmale, die verbale lexikalische Aspekte erfassen, fĂŒr eine Anwendung von Vorteil sind, die Argumentstrukturen betrifft, nĂ€mlich semantische Rollenkennzeichnung. DarĂŒber hinaus zeige ich, dass Merkmale, die die verbale Argumentstruktur erfassen, bei der Aufgabe, ein Verb nach seiner Aspektklasse zu klassifizieren, gut funktionieren. Diese Ergebnisse bestĂ€tigen, dass diese beiden Facetten der Verbbedeutung auf grundsĂ€tzliche Weise zusammenhĂ€ngen.This dissertation provides an empirical investigation of German verbs conducted on the basis of statistical descriptions acquired from a large corpus of German text. In a brief overview of the linguistic theory pertaining to the lexical semantics of verbs, I outline the idea that verb meaning is composed of argument structure (the number and types of arguments that co-occur with a verb) and aspectual structure (properties describing the temporal progression of an event referenced by the verb). I then produce statistical descriptions of verbs according to these two distinct facets of meaning: In particular, I examine verbal subcategorisation, selectional preferences, and aspectual type. All three of these modelling strategies are evaluated on a common task, automatic verb classification. I demonstrate that automatically acquired features capturing verbal lexical aspect are beneficial for an application that concerns argument structure, namely semantic role labelling. Furthermore, I demonstrate that features capturing verbal argument structure perform well on the task of classifying a verb for its aspectual type. These findings suggest that these two facets of verb meaning are related in an underlying way

    Robust Grammatical Analysis for Spoken Dialogue Systems

    Full text link
    We argue that grammatical analysis is a viable alternative to concept spotting for processing spoken input in a practical spoken dialogue system. We discuss the structure of the grammar, and a model for robust parsing which combines linguistic sources of information and statistical sources of information. We discuss test results suggesting that grammatical processing allows fast and accurate processing of spoken input.Comment: Accepted for JNL

    Automatic Extraction of Subcategorization from Corpora

    Full text link
    We describe a novel technique and implemented system for constructing a subcategorization dictionary from textual corpora. Each dictionary entry encodes the relative frequency of occurrence of a comprehensive set of subcategorization classes for English. An initial experiment, on a sample of 14 verbs which exhibit multiple complementation patterns, demonstrates that the technique achieves accuracy comparable to previous approaches, which are all limited to a highly restricted set of subcategorization classes. We also demonstrate that a subcategorization dictionary built with the system improves the accuracy of a parser by an appreciable amount.Comment: 8 pages; requires aclap.sty. To appear in ANLP-9

    Corpus Annotation for Parser Evaluation

    Full text link
    We describe a recently developed corpus annotation scheme for evaluating parsers that avoids shortcomings of current methods. The scheme encodes grammatical relations between heads and dependents, and has been used to mark up a new public-domain corpus of naturally occurring English text. We show how the corpus can be used to evaluate the accuracy of a robust parser, and relate the corpus to extant resources.Comment: 7 pages, LaTeX (uses eaclap.sty

    Parsing with automatically acquired, wide-coverage, robust, probabilistic LFG approximations

    Get PDF
    Traditionally, rich, constraint-based grammatical resources have been hand-coded. Scaling such resources beyond toy fragments to unrestricted, real text is knowledge-intensive, timeconsuming and expensive. The work reported in this thesis is part of a larger project to automate as much as possible the construction of wide-coverage, deep, constraint-based grammatical resources from treebanks. The Penn-II treebank is a large collection of parse-annotated newspaper text. We have designed a Lexical-Functional Grammar (LFG) (Kaplan and Bresnan, 1982) f-structure annotation algorithm to automatically annotate this treebank with f-structure information approximating to basic predicate-argument or dependency structures (Cahill et al., 2002c, 2004a). We then use the f-structure-annotated treebank resource to automatically extract grammars and lexical resources for parsing new text into f-structures. We have designed and implemented the Treebank Tool Suite (TTS) to support the linguistic work that seeds the automatic f-structure annotation algorithm (Cahill and van Genabith, 2002) and the F-Structure Annotation Tool (FSAT) to validate and visualise the results of automatic f-structure annotation. We have designed and implemented two PCFG-based probabilistic parsing architectures for parsing unseen text into f-structures: the pipeline and the integrated model. Both architectures parse raw text into basic, but possibly incomplete, predicate-argument structures (“proto f-structures”) with long distance dependencies (LDDs) unresolved (Cahill et al., 2002c). We have designed and implemented a method for automatically resolving LDDs at f-structure level based on a finite approximation of functional uncertainty equations (Kaplan and Zaenen, 1989) automatically acquired from the f structure-annotated treebank resource (Cahill et al., 2004b). To date, the best result achieved by our own Penn-II induced grammars is a dependency f-score of 80.33% against the PARC 700, an improvement of 0.73% over the best handcrafted grammar of (Kaplan et al., 2004). The processing architecture developed in this thesis is highly flexible: using external, state-of-the-art parsing technologies (Charniak, 2000) in our pipeline model, we achieve a dependency f-score of 81.79% against the PARC 700, an improvement of 2.19% over the results reported in Kaplan et al. (2004). We have also ported our grammar induction methodology to German and the TIGER treebank resource (Cahill et al., 2003a). We have developed a method for treebank-based, wide-coverage, deep, constraintbased grammar acquisition. The resulting PCFG-based LFG approximations parse the Penn-II treebank with wider coverage (measured in terms of complete spanning parse) and parsing results comparable to or better than those achieved by the best hand-crafted grammars, with, we believe, considerably less grammar development effort. We believe that our approach successfully addresses the knowledge-acquisition bottleneck (familiar from rule-based approaches to Al and NLP) in wide-coverage, constraint-based grammar development. Our approach can provide an attractive, wide-coverage, multilingual, deep, constraint-based grammar acquisition paradigm
    • 

    corecore