133 research outputs found
Can Subcategorisation Probabilities Help a Statistical Parser?
Research into the automatic acquisition of lexical information from corpora
is starting to produce large-scale computational lexicons containing data on
the relative frequencies of subcategorisation alternatives for individual
verbal predicates. However, the empirical question of whether this type of
frequency information can in practice improve the accuracy of a statistical
parser has not yet been answered. In this paper we describe an experiment with
a wide-coverage statistical grammar and parser for English and
subcategorisation frequencies acquired from ten million words of text which
shows that this information can significantly improve parse accuracy.Comment: 9 pages, uses colacl.st
Apportioning Development Effort in a Probabilistic LR Parsing System through Evaluation
We describe an implemented system for robust domain-independent syntactic
parsing of English, using a unification-based grammar of part-of-speech and
punctuation labels coupled with a probabilistic LR parser. We present
evaluations of the system's performance along several different dimensions;
these enable us to assess the contribution that each individual part is making
to the success of the system as a whole, and thus prioritise the effort to be
devoted to its further enhancement. Currently, the system is able to parse
around 80% of sentences in a substantial corpus of general text containing a
number of distinct genres. On a random sample of 250 such sentences the system
has a mean crossing bracket rate of 0.71 and recall and precision of 83% and
84% respectively when evaluated against manually-disambiguated analyses.Comment: 10 pages, 1 Postscript figure. To Appear in Proceedings of the
Conference on Empirical Methods in Natural Language Processing, University of
Pennsylvania, May 199
Three Generative, Lexicalised Models for Statistical Parsing
In this paper we first propose a new statistical parsing model, which is a
generative model of lexicalised context-free grammar. We then extend the model
to include a probabilistic treatment of both subcategorisation and wh-movement.
Results on Wall Street Journal text show that the parser performs at 88.1/87.5%
constituent precision/recall, an average improvement of 2.3% over (Collins 96).Comment: 8 pages, to appear in Proceedings of ACL/EACL 97
Optimality Theory as a Framework for Lexical Acquisition
This paper re-investigates a lexical acquisition system initially developed
for French.We show that, interestingly, the architecture of the system
reproduces and implements the main components of Optimality Theory. However, we
formulate the hypothesis that some of its limitations are mainly due to a poor
representation of the constraints used. Finally, we show how a better
representation of the constraints used would yield better results
A distributional investigation of German verbs
Diese Dissertation bietet eine empirische Untersuchung deutscher Verben auf der Grundlage statistischer Beschreibungen, die aus einem groĂen deutschen Textkorpus gewonnen wurden. In einem kurzen Ăberblick ĂŒber linguistische Theorien zur lexikalischen Semantik von Verben skizziere ich die Idee, dass die Verbbedeutung wesentlich von seiner Argumentstruktur (der Anzahl und Art der Argumente, die zusammen mit dem Verb auftreten) und seiner Aspektstruktur (Eigenschaften, die den zeitlichen Ablauf des vom Verb denotierten Ereignisses bestimmen) abhĂ€ngt. AnschlieĂend erstelle ich statistische Beschreibungen von Verben, die auf diesen beiden unterschiedlichen Bedeutungsfacetten basieren. Insbesondere untersuche ich verbale Subkategorisierung, SelektionsprĂ€ferenzen und Aspekt. Alle diese Modellierungsstrategien werden anhand einer gemeinsamen Aufgabe, der Verbklassifikation, bewertet. Ich zeige, dass im Rahmen von maschinellem Lernen erworbene Merkmale, die verbale lexikalische Aspekte erfassen, fĂŒr eine Anwendung von Vorteil sind, die Argumentstrukturen betrifft, nĂ€mlich semantische Rollenkennzeichnung. DarĂŒber hinaus zeige ich, dass Merkmale, die die verbale Argumentstruktur erfassen, bei der Aufgabe, ein Verb nach seiner Aspektklasse zu klassifizieren, gut funktionieren. Diese Ergebnisse bestĂ€tigen, dass diese beiden Facetten der Verbbedeutung auf grundsĂ€tzliche Weise zusammenhĂ€ngen.This dissertation provides an empirical investigation of German verbs conducted on the basis of statistical descriptions acquired from a large corpus of German text. In a brief overview of the linguistic theory pertaining to the lexical semantics of verbs, I outline the idea that verb meaning is composed of argument structure (the number and types of arguments that co-occur with a verb) and aspectual structure (properties describing the temporal progression of an event referenced by the verb). I then produce statistical descriptions of verbs according to these two distinct facets of meaning: In particular, I examine verbal subcategorisation, selectional preferences, and aspectual type. All three of these modelling strategies are evaluated on a common task, automatic verb classification. I demonstrate that automatically acquired features capturing verbal lexical aspect are beneficial for an application that concerns argument structure, namely semantic role labelling. Furthermore, I demonstrate that features capturing verbal argument structure perform well on the task of classifying a verb for its aspectual type. These findings suggest that these two facets of verb meaning are related in an underlying way
Robust Grammatical Analysis for Spoken Dialogue Systems
We argue that grammatical analysis is a viable alternative to concept
spotting for processing spoken input in a practical spoken dialogue system. We
discuss the structure of the grammar, and a model for robust parsing which
combines linguistic sources of information and statistical sources of
information. We discuss test results suggesting that grammatical processing
allows fast and accurate processing of spoken input.Comment: Accepted for JNL
Statistical model of human lexical category disambiguation
Research in Sentence Processing is concerned with discovering the mechanism by
which linguistic utterances are mapped onto meaningful representations within the
human mind. Models of the Human Sentence Processing Mechanism (HSPM) can
be divided into those in which such mapping is performed by a number of limited
modular processes and those in which there is a single interactive process. A further,
and increasingly important, distinction is between models which rely on innate
preferences to guide decision processes and those which make use of experiencebased
statistics.
In this context, the aims of the current thesis are two-fold:
âą To argue that the correct architecture of the HSPM is both modular and
statistical - the Modular Statistical Hypothesis (MSH).
âą To propose and provide empirical support for a position in which human
lexical category disambiguation occurs within a modular process, distinct
from syntactic parsing and guided by a statistical decision process.
Arguments are given for why a modular statistical architecture should be preferred
on both methodological and rational grounds. We then turn to the (often ignored)
problem of lexical category disambiguation and propose the existence of a presyntactic
Statistical Lexical Category Module (SLCM). A number of variants of the
SLCM are introduced. By empirically investigating this particular architecture we
also hope to provide support for the more general hypothesis - the MSH.
The SLCM has some interesting behavioural properties; the remainder of the thesis
empirically investigates whether these behaviours are observable in human sentence
processing. We first consider whether the results of existing studies might be
attributable to SLCM behaviour. Such evaluation provides support for an HSPM
architecture that includes this SLCM and allows us to determine which SLCM
variant is empirically most plausible. Predictions are made, using this variant, to
determine SLCM behaviour in the face of novel utterances; these predictions are then
tested using a self-paced reading paradigm. The results of this experimentation fully
support the inclusion of the SLCM in a model of the HSPM and are not compatible
with other existing models.
As the SLCM is a modular and statistical process, empirical evidence for the SLCM
also directly supports an HSPM architecture which is modular and statistical. We
therefore conclude that our results strongly support both the SLCM and the MSH.
However, more work is needed, both to produce further evidence and to define the
model further
Automatic Extraction of Subcategorization from Corpora
We describe a novel technique and implemented system for constructing a
subcategorization dictionary from textual corpora. Each dictionary entry
encodes the relative frequency of occurrence of a comprehensive set of
subcategorization classes for English. An initial experiment, on a sample of 14
verbs which exhibit multiple complementation patterns, demonstrates that the
technique achieves accuracy comparable to previous approaches, which are all
limited to a highly restricted set of subcategorization classes. We also
demonstrate that a subcategorization dictionary built with the system improves
the accuracy of a parser by an appreciable amount.Comment: 8 pages; requires aclap.sty. To appear in ANLP-9
Corpus Annotation for Parser Evaluation
We describe a recently developed corpus annotation scheme for evaluating
parsers that avoids shortcomings of current methods. The scheme encodes
grammatical relations between heads and dependents, and has been used to mark
up a new public-domain corpus of naturally occurring English text. We show how
the corpus can be used to evaluate the accuracy of a robust parser, and relate
the corpus to extant resources.Comment: 7 pages, LaTeX (uses eaclap.sty
Treebank-based grammar acquisition for German
Manual development of deep linguistic resources is time-consuming and costly and therefore often described as a bottleneck for traditional rule-based NLP. In my PhD thesis I present a treebank-based method for the automatic acquisition of LFG resources for German. The method automatically creates deep and rich linguistic presentations
from labelled data (treebanks) and can be applied to large data sets.
My research is based on and substantially extends previous work on automatically acquiring wide-coverage, deep, constraint-based grammatical resources from the English Penn-II treebank (Cahill et al.,2002; Burke et al., 2004; Cahill, 2004). Best results for English show a dependency f-score of 82.73% (Cahill et al., 2008) against the PARC
700 dependency bank, outperforming the best hand-crafted grammar of Kaplan et al. (2004). Preliminary work has been carried out to test the approach on languages other than English, providing proof of concept for the applicability of the method (Cahill et al., 2003; Cahill,
2004; Cahill et al., 2005).
While first results have been promising, a number of important research questions have been raised. The original approach presented first in Cahill et al. (2002) is strongly tailored to English and the datastructures
provided by the Penn-II treebank (Marcus et al., 1993).
English is configurational and rather poor in inflectional forms. German, by contrast, features semi-free word order and a much richer morphology. Furthermore, treebanks for German differ considerably from the Penn-II treebank as regards data structures and encoding schemes underlying the grammar acquisition task.
In my thesis I examine the impact of language-specific properties of German as well as linguistically motivated treebank design decisions on PCFG parsing and LFG grammar acquisition. I present experiments investigating the influence of treebank design on PCFG parsing and show which type of representations are useful for the PCFG and
LFG grammar acquisition tasks. Furthermore, I present a novel approach to cross-treebank comparison, measuring the effect of controlled error insertion on treebank
trees and parser output from different treebanks. I complement the cross-treebank comparison by providing a human evaluation using TePaCoC, a new testsuite for testing parser performance on complex grammatical constructions. Manual evaluation on TePaCoC data provides
new insights on the impact of flat vs. hierarchical annotation schemes on data-driven parsing. I present treebank-based LFG acquisition methodologies for two German treebanks. An extensive evaluation along different dimensions complements the investigation and provides valuable insights for the future development of treebanks
- âŠ