Search CORE

Evaluating two methods for Treebank grammar compaction

Author: Gaizauskas R.
Hepple M.
Krotov A.
Wilks Y.
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 01/12/1999
Field of study

Treebanks, such as the Penn Treebank, provide a basis for the automatic creation of broad coverage grammars. In the simplest case, rules can simply be ‘read off’ the parse-annotations of the corpus, producing either a simple or probabilistic context-free grammar. Such grammars, however, can be very large, presenting problems for the subsequent computational costs of parsing under the grammar. In this paper, we explore ways by which a treebank grammar can be reduced in size or ‘compacted’, which involve the use of two kinds of technique: (i) thresholding of rules by their number of occurrences; and (ii) a method of rule-parsing, which has both probabilistic and non-probabilistic variants. Our results show that by a combined use of these two techniques, a probabilistic context-free grammar can be reduced in size by 62% without any loss in parsing performance, and by 71% to give a gain in recall, but some loss in precision

White Rose Research Online

Experiments in Structure-Preserving Grammar Compaction

Author: Hepple Mark
van Genabith Josef
Publication venue
Publication date: 01/01/2000
Field of study

Structure preserving grammar compaction (SPC) is a simple CFG compaction technique originally described in (van Genabith et al., 1999a, 1999b). It works by generalising category labels and in so doing plugs holes in the grammar. To date the method has been tested on small corpra only. In the present research we apply SPC to a large grammar extracted from the Penn Treebank and examine its effects on rule treebank grammar size and on rule accession rates (as an indicator of grammar completeness) . 1 Introduction Tree banks and resources compiled from treebanks are potentially very useful in NLP. Grammars extracted from treebanks --- so called treebank grammars (Charniak, 1996) --- can form the basis of large coverage NLP systems. Such treebank grammars, however, can suffer from several shortcomings: they commonly feature a large number of flat, highly specific rules that may be rarely used, with ensuing costs for processing (load) under the grammar

CiteSeerX

University of Essex Research Repository

From treebank resources to LFG F-structures

Author: A Cahill
A Frank
A Frank.
C Pollard
E Charniak.
G Leech
J Bresnan.
J Genabith van
L Sadler
RM Kaplan
S Abney.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2003
Field of study

We present two methods for automatically annotating treebank resources with functional structures. Both methods define systematic patterns of correspondence between partial PS configurations and functional structures. These are applied to PS rules extracted from treebanks, or directly to constraint set encodings of treebank PS trees

Parsing with PCFGs and automatic f-structure annotation

Author: Cahill Aoife
McCarthy Mairéad
van Genabith Josef
Way Andy
Publication venue: CSLI Publications
Publication date: 01/01/2002
Field of study

The development of large coverage, rich unification- (constraint-) based grammar resources is very time consuming, expensive and requires lots of linguistic expertise. In this paper we report initial results on a new methodology that attempts to partially automate the development of substantial parts of large coverage, rich unification- (constraint-) based grammar resources. The method is based on a treebank resource (in our case Penn-II) and an automatic f-structure annotation algorithm that annotates treebank trees with proto-f-structure information. Based on these, we present two parsing architectures: in our pipeline architecture we first extract a PCFG from the treebank following the method of (Charniak,1996), use the PCFG to parse new text, automatically annotate the resulting trees with our f-structure annotation algorithm and generate proto-f-structures. By contrast, in the integrated architecture we first automatically annotate the treebank trees with f-structure information and then extract an annotated PCFG (A-PCFG) from the treebank. We then use the A-PCFG to parse new text to generate proto-f-structures. Currently our best parsers achieve more than 81% f-score on the 2400 trees in section 23 of the Penn-II treebank and more than 60% f-score on gold-standard proto-f-structures for 105 randomly selected trees from section 23

Automatic F-Structure Annotation from the AP Treebank

Author: Sadler Louisa
van Genabith Josef
Way Andy
Publication venue: CSLI Publications
Publication date: 01/01/2000
Field of study

We present a method for automatically annotating treebank resources with functional structures. The method defines systematic patterns of correspondence between partial PS configurations and functional structures. These are applied to PS rules extracted from treebanks. The set of techniques which we have developed constitute a methodology for corpus-guided grammar development. Despite the widespread belief that treebank representations are not very useful in grammar development, we show that systematic patterns of c-structure to f-structure correspondence can be simply and successfully stated over such rules. The method is partial in that it requires manual correction of the annotated grammar rules

CiteSeerX

Treebank vs. xbar-based automatic f-structure annotation

Author: Frank Anette
van Genabith Josef
Way Andy
Publication venue: CSLI Publications
Publication date: 01/01/2001
Field of study

Manual, large scale (computational) grammar development is time consuming, expensive and requires lots of linguistic expertise. More recently, a number of alternatives based on treebank resources (such as Penn-II, Susanne, AP treebank) have been explored. The idea is to automatically ``induce'' or rather read off (P)CFG grammars from the parse annotated treebank resources and to use the treebank grammars thus obtained in (probabilistic) parsing or as a starting point for further grammar development. The approach is cheap, fast, automatic, large scale, ``data driven'' and based on real language resources. Treebank grammars typically involve large sets of lexical tags and non-lexical categories as syntactic information tends to be encoded in monadic category symbols. They feature flat rules (trees) that can ``underspecify'' attachment possibilities. Treebank grammars do not in general follow Xbar architectural design principles (this is not to say that treebank grammars do not have design principles). As a consequence, treebank grammars tend to have very large CFG rule bases (e.g. Penn-II > 17,000 CFG rules for about 1 million words of text) with often only minimally differing rules. Even though treebank grammars are large, they are still incomplete, exhibiting unabated rule accession rates. From a grammar engineering point of view, the size of the rule base poses problems for maintainability, extendability and, if a treebank grammar is to be used as a CF-base in a LFG grammar, for functional (feature-structure) annotations. From the point of view of theoretical linguistics, flat treebank trees and treebank grammars extracted from such trees do not express linguistic generalisations. From the perspective of empirical and corpus linguistics, flat trees are well-motivated as they allow underspecification of subtle and often time consuming attachment decisions. Indeed, it is sometimes doubted whether highly general Xbar schemata usefully scale to ``real'' language. In previous work we developed methodologies for automatic feature-structure annotation of grammars extracted from treebanks. Automatic annotation of ``raw'' treebank grammars is difficult as annotation rules often need to identify subsequences in the RHSs of flat treebank rules as they explicitly encode head, complement and modifier relations. Xbar based CFG rules should substantially facilitate automatic feature-structure annotation of grammar rules. In the present paper we conduct a number of experiments to explore a space of possible grammars based on a small fragment of the AP treebank resource. Starting with the original treebank fragment we automatically extract a CFG G. We then apply an automatic structure preserving grammar compaction step which generalises categories in the original treebank fragment and reduces the number of rules extracted, resulting in a generalised treebank fragment and in a compacted grammar Gc. The generalised fragment is then manually corrected to catch missed constituents (and the like) resulting in an automatically extracted, compacted and (effectively manually) corrected grammar Gc,m. Manual correction proceeds in the ``spirit'' of treebank grammars (we do not introduce Xbar analyses). We then explore how many of the manual correction steps on treebank trees can be achieved automatically. We develop, implement and test an automatic treebank ``grooming'' methodology which is applied to the generalised treebank fragment to yield a compacted and automatically corrected grammar Gc,a. Grammars Gc,m and Gc,a are very similar to compiled out ``flat'' LFG-82 style grammars. We explore regular expression based compaction (both manual and automatic) to relate Gc,m to a LFG-82 style grammar design. Finally, we manually recode a subsection of the generalised and manually corrected treebank fragment into ``vanilla-flavour'' XBar based trees. From these we extract a compacted, manually corrected, XBar based grammar Gc,m,x. We evaluate our grammars and methods using standard labelled bracketing measures and according to how well they perform under automatic feature-structure annotation tasks

Noun phrase recognition with tree patterns

Author: Hócza András
Publication venue
Publication date: 01/01/2004
Field of study

This paper offers a method for the noun phrase recognition of Hungarian natural language texts based on machine learning methods. The approach learns noun phrase tree patterns described by regular expressions from an annotated corpus. The tree patterns are completed with probability values using error statistics. The noun phrase recognition parser tries to find the best-fitting trees for a sentence using backtracking technique. The results are used in an information extraction toolchain

University of Szeged

Large-scale induction and evaluation of lexical resources from the Penn-II treebank

Author: Burke Michael
Cahill Aoife
O'Donovan Ruth
van Genabith Josef
Way Andy
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2004
Field of study

In this paper we present a methodology for extracting subcategorisation frames based on an automatic LFG f-structure annotation algorithm for the Penn-II Treebank. We extract abstract syntactic function-based subcategorisation frames (LFG semantic forms), traditional CFG categorybased subcategorisation frames as well as mixed function/category-based frames, with or without preposition information for obliques and particle information for particle verbs. Our approach does not predefine frames, associates probabilities with frames conditional on the lemma, distinguishes between active and passive frames, and fully reflects the effects of long-distance dependencies in the source data structures. We extract 3586 verb lemmas, 14348 semantic form types (an average of 4 per lemma) with 577 frame types. We present a large-scale evaluation of the complete set of forms extracted against the full COMLEX resource