12,318 research outputs found
Parsing without grammar - using complete trees instead
The definition of similarity between sentences is formulated on the levels of words, POS tags, and chunks (Abney 91; Abney 96). The evaluation of this approach shows that while precision and recall based on the PARSEVAL measures (Black et al. 91) do not reach state of the art Parsers yet (F1=87.19 on syntactic constituents, F1=77.78 including functionargument structure), the parser shows a very reliable performance where function-argument structure is concerned (F1=96.52). The lower F-scores are very often due to unattached constituents
Evaluating two methods for Treebank grammar compaction
Treebanks, such as the Penn Treebank, provide a basis for the automatic creation of broad coverage grammars. In the simplest case, rules can simply be ‘read off’ the parse-annotations of the corpus, producing either a simple or probabilistic context-free grammar. Such grammars, however, can be very large, presenting problems for the subsequent computational costs of parsing under the grammar.
In this paper, we explore ways by which a treebank grammar can be reduced in size or ‘compacted’, which involve the use of two kinds of technique: (i) thresholding of rules by their number of occurrences; and (ii) a method of rule-parsing, which has both probabilistic and non-probabilistic variants. Our results show that by a combined use of these two techniques, a probabilistic context-free grammar can be reduced in size by 62% without any loss in parsing performance, and by 71% to give a gain in recall, but some loss in precision
Compacting the Penn Treebank Grammar
Treebanks, such as the Penn Treebank (PTB), offer a simple approach to
obtaining a broad coverage grammar: one can simply read the grammar off the
parse trees in the treebank. While such a grammar is easy to obtain, a
square-root rate of growth of the rule set with corpus size suggests that the
derived grammar is far from complete and that much more treebanked text would
be required to obtain a complete grammar, if one exists at some limit. However,
we offer an alternative explanation in terms of the underspecification of
structures within the treebank. This hypothesis is explored by applying an
algorithm to compact the derived grammar by eliminating redundant rules --
rules whose right hand sides can be parsed by other rules. The size of the
resulting compacted grammar, which is significantly less than that of the full
treebank grammar, is shown to approach a limit. However, such a compacted
grammar does not yield very good performance figures. A version of the
compaction algorithm taking rule probabilities into account is proposed, which
is argued to be more linguistically motivated. Combined with simple
thresholding, this method can be used to give a 58% reduction in grammar size
without significant change in parsing performance, and can produce a 69%
reduction with some gain in recall, but a loss in precision.Comment: 5 pages, 2 figure
Disambiguation of Super Parts of Speech (or Supertags): Almost Parsing
In a lexicalized grammar formalism such as Lexicalized Tree-Adjoining Grammar
(LTAG), each lexical item is associated with at least one elementary structure
(supertag) that localizes syntactic and semantic dependencies. Thus a parser
for a lexicalized grammar must search a large set of supertags to choose the
right ones to combine for the parse of the sentence. We present techniques for
disambiguating supertags using local information such as lexical preference and
local lexical dependencies. The similarity between LTAG and Dependency grammars
is exploited in the dependency model of supertag disambiguation. The
performance results for various models of supertag disambiguation such as
unigram, trigram and dependency-based models are presented.Comment: ps file. 8 page
The ModelCC Model-Driven Parser Generator
Syntax-directed translation tools require the specification of a language by
means of a formal grammar. This grammar must conform to the specific
requirements of the parser generator to be used. This grammar is then annotated
with semantic actions for the resulting system to perform its desired function.
In this paper, we introduce ModelCC, a model-based parser generator that
decouples language specification from language processing, avoiding some of the
problems caused by grammar-driven parser generators. ModelCC receives a
conceptual model as input, along with constraints that annotate it. It is then
able to create a parser for the desired textual syntax and the generated parser
fully automates the instantiation of the language conceptual model. ModelCC
also includes a reference resolution mechanism so that ModelCC is able to
instantiate abstract syntax graphs, rather than mere abstract syntax trees.Comment: In Proceedings PROLE 2014, arXiv:1501.0169
An Efficient Implementation of the Head-Corner Parser
This paper describes an efficient and robust implementation of a
bi-directional, head-driven parser for constraint-based grammars. This parser
is developed for the OVIS system: a Dutch spoken dialogue system in which
information about public transport can be obtained by telephone.
After a review of the motivation for head-driven parsing strategies, and
head-corner parsing in particular, a non-deterministic version of the
head-corner parser is presented. A memoization technique is applied to obtain a
fast parser. A goal-weakening technique is introduced which greatly improves
average case efficiency, both in terms of speed and space requirements.
I argue in favor of such a memoization strategy with goal-weakening in
comparison with ordinary chart-parsers because such a strategy can be applied
selectively and therefore enormously reduces the space requirements of the
parser, while no practical loss in time-efficiency is observed. On the
contrary, experiments are described in which head-corner and left-corner
parsers implemented with selective memoization and goal weakening outperform
`standard' chart parsers. The experiments include the grammar of the OVIS
system and the Alvey NL Tools grammar.
Head-corner parsing is a mix of bottom-up and top-down processing. Certain
approaches towards robust parsing require purely bottom-up processing.
Therefore, it seems that head-corner parsing is unsuitable for such robust
parsing techniques. However, it is shown how underspecification (which arises
very naturally in a logic programming environment) can be used in the
head-corner parser to allow such robust parsing techniques. A particular robust
parsing model is described which is implemented in OVIS.Comment: 31 pages, uses cl.st
Data-Oriented Language Processing. An Overview
During the last few years, a new approach to language processing has started
to emerge, which has become known under various labels such as "data-oriented
parsing", "corpus-based interpretation", and "tree-bank grammar" (cf. van den
Berg et al. 1994; Bod 1992-96; Bod et al. 1996a/b; Bonnema 1996; Charniak
1996a/b; Goodman 1996; Kaplan 1996; Rajman 1995a/b; Scha 1990-92; Sekine &
Grishman 1995; Sima'an et al. 1994; Sima'an 1995-96; Tugwell 1995). This
approach, which we will call "data-oriented processing" or "DOP", embodies the
assumption that human language perception and production works with
representations of concrete past language experiences, rather than with
abstract linguistic rules. The models that instantiate this approach therefore
maintain large corpora of linguistic representations of previously occurring
utterances. When processing a new input utterance, analyses of this utterance
are constructed by combining fragments from the corpus; the
occurrence-frequencies of the fragments are used to estimate which analysis is
the most probable one.
In this paper we give an in-depth discussion of a data-oriented processing
model which employs a corpus of labelled phrase-structure trees. Then we review
some other models that instantiate the DOP approach. Many of these models also
employ labelled phrase-structure trees, but use different criteria for
extracting fragments from the corpus or employ different disambiguation
strategies (Bod 1996b; Charniak 1996a/b; Goodman 1996; Rajman 1995a/b; Sekine &
Grishman 1995; Sima'an 1995-96); other models use richer formalisms for their
corpus annotations (van den Berg et al. 1994; Bod et al., 1996a/b; Bonnema
1996; Kaplan 1996; Tugwell 1995).Comment: 34 pages, Postscrip
- …