1,014 research outputs found
Amalia -- A Unified Platform for Parsing and Generation
Contemporary linguistic theories (in particular, HPSG) are declarative in
nature: they specify constraints on permissible structures, not how such
structures are to be computed. Grammars designed under such theories are,
therefore, suitable for both parsing and generation. However, practical
implementations of such theories don't usually support bidirectional processing
of grammars. We present a grammar development system that includes a compiler
of grammars (for parsing and generation) to abstract machine instructions, and
an interpreter for the abstract machine language. The generation compiler
inverts input grammars (designed for parsing) to a form more suitable for
generation. The compiled grammars are then executed by the interpreter using
one control strategy, regardless of whether the grammar is the original or the
inverted version. We thus obtain a unified, efficient platform for developing
reversible grammars.Comment: 8 pages postscrip
An Abstract Machine for Unification Grammars
This work describes the design and implementation of an abstract machine,
Amalia, for the linguistic formalism ALE, which is based on typed feature
structures. This formalism is one of the most widely accepted in computational
linguistics and has been used for designing grammars in various linguistic
theories, most notably HPSG. Amalia is composed of data structures and a set of
instructions, augmented by a compiler from the grammatical formalism to the
abstract instructions, and a (portable) interpreter of the abstract
instructions. The effect of each instruction is defined using a low-level
language that can be executed on ordinary hardware.
The advantages of the abstract machine approach are twofold. From a
theoretical point of view, the abstract machine gives a well-defined
operational semantics to the grammatical formalism. This ensures that grammars
specified using our system are endowed with well defined meaning. It enables,
for example, to formally verify the correctness of a compiler for HPSG, given
an independent definition. From a practical point of view, Amalia is the first
system that employs a direct compilation scheme for unification grammars that
are based on typed feature structures. The use of amalia results in a much
improved performance over existing systems.
In order to test the machine on a realistic application, we have developed a
small-scale, HPSG-based grammar for a fragment of the Hebrew language, using
Amalia as the development platform. This is the first application of HPSG to a
Semitic language.Comment: Doctoral Thesis, 96 pages, many postscript figures, uses pstricks,
pst-node, psfig, fullname and a macros fil
The syntactic processing of particles in Japanese spoken language
Particles fullfill several distinct central roles in the Japanese language.
They can mark arguments as well as adjuncts, can be functional or have semantic
funtions. There is, however, no straightforward matching from particles to
functions, as, e.g., GA can mark the subject, the object or an adjunct of a
sentence. Particles can cooccur. Verbal arguments that could be identified by
particles can be eliminated in the Japanese sentence. And finally, in spoken
language particles are often omitted. A proper treatment of particles is thus
necessary to make an analysis of Japanese sentences possible. Our treatment is
based on an empirical investigation of 800 dialogues. We set up a type
hierarchy of particles motivated by their subcategorizational and
modificational behaviour. This type hierarchy is part of the Japanese syntax in
VERBMOBIL.Comment: 8 page
Wide-coverage deep statistical parsing using automatic dependency structure annotation
A number of researchers (Lin 1995; Carroll, Briscoe, and Sanfilippo 1998; Carroll et al. 2002; Clark and Hockenmaier 2002; King et al. 2003; Preiss 2003; Kaplan et al. 2004;Miyao and Tsujii 2004) have convincingly argued for the use of dependency (rather than CFG-tree) representations
for parser evaluation. Preiss (2003) and Kaplan et al. (2004) conducted a number of experiments comparing “deep” hand-crafted wide-coverage with “shallow” treebank- and machine-learning based parsers at the level of dependencies, using simple and automatic methods to convert tree output generated by the shallow parsers into dependencies. In this article, we revisit the experiments
in Preiss (2003) and Kaplan et al. (2004), this time using the sophisticated automatic LFG f-structure annotation methodologies of Cahill et al. (2002b, 2004) and Burke (2006), with surprising results. We compare various PCFG and history-based parsers (based on Collins, 1999; Charniak, 2000; Bikel, 2002) to find a baseline parsing system that fits best into our automatic dependency structure annotation technique. This combined system of syntactic parser and dependency structure annotation is compared to two hand-crafted, deep constraint-based parsers (Carroll and Briscoe 2002; Riezler et al. 2002). We evaluate using dependency-based gold standards (DCU 105, PARC 700, CBS 500 and dependencies for WSJ Section 22) and use the Approximate Randomization Test (Noreen 1989) to test the statistical significance of the results. Our experiments show that machine-learning-based shallow grammars augmented with sophisticated automatic dependency annotation technology outperform hand-crafted, deep, widecoverage constraint grammars. Currently our best system achieves an f-score of 82.73% against the PARC 700 Dependency Bank (King et al. 2003), a statistically significant improvement of 2.18%over the most recent results of 80.55%for the hand-crafted LFG grammar and XLE parsing system of Riezler et al. (2002), and an f-score of 80.23% against the CBS 500 Dependency Bank (Carroll, Briscoe, and Sanfilippo 1998), a statistically significant 3.66% improvement over the 76.57% achieved by the hand-crafted RASP grammar and parsing system of Carroll and
Briscoe (2002)
Principle Based Semantics for HPSG
The paper presents a constraint based semantic formalism for HPSG. The
advantages of the formlism are shown with respect to a grammar for a fragment
of German that deals with (i) quantifier scope ambiguities triggered by
scrambling and/or movement and (ii) ambiguities that arise from the
collective/distributive distinction of plural NPs. The syntax-semantics
interface directly implements syntactic conditions on quantifier scoping and
distributivity. The construction of semantic representations is guided by
general principles governing the interaction between syntax and semantics. Each
of these principles acts as a constraint to narrow down the set of possible
interpretations of a sentence. Meanings of ambiguous sentences are represented
by single partial representations (so-called U(nderspecified) D(iscourse)
R(epresentation) S(tructure)s) to which further constraints can be added
monotonically to gain more information about the content of a sentence. There
is no need to build up a large number of alternative representations of the
sentence which are then filtered by subsequent discourse and world knowledge.
The advantage of UDRSs is not only that they allow for monotonic incremental
interpretation but also that they are equipped with truth conditions and a
proof theory that allows for inferences to be drawn directly on structures
where quantifier scope is not resolved
Treebank-based acquisition of a Chinese lexical-functional grammar
Scaling wide-coverage, constraint-based grammars such as Lexical-Functional Grammars (LFG) (Kaplan and Bresnan, 1982; Bresnan, 2001) or Head-Driven Phrase Structure Grammars (HPSG) (Pollard and Sag, 1994) from fragments to naturally occurring unrestricted text is knowledge-intensive, time-consuming and (often prohibitively) expensive. A number of researchers have recently presented methods to automatically acquire wide-coverage, probabilistic constraint-based grammatical resources from treebanks (Cahill et al., 2002, Cahill et al., 2003; Cahill et al., 2004; Miyao et al., 2003; Miyao et al., 2004; Hockenmaier and Steedman, 2002; Hockenmaier, 2003), addressing the knowledge acquisition bottleneck in constraint-based grammar development. Research to date has concentrated on English and German. In this paper we report on an experiment to induce wide-coverage, probabilistic LFG grammatical and lexical resources for Chinese from the Penn Chinese Treebank (CTB) (Xue et al., 2002) based on an automatic f-structure annotation algorithm. Currently 96.751% of the CTB trees receive a single, covering and connected f-structure, 0.112% do not receive an f-structure due to feature clashes, while 3.137% are associated with multiple f-structure fragments. From the f-structure-annotated CTB we extract a total of 12975 lexical entries with 20 distinct subcategorisation frame types. Of these 3436 are verbal entries with a total of 11 different frame types. We extract a number of PCFG-based LFG approximations. Currently our best automatically induced grammars achieve an f-score of 81.57% against the trees in unseen articles 301-325; 86.06% f-score (all grammatical functions) and 73.98% (preds-only) against the dependencies derived from the f-structures automatically generated for the original trees in 301-325 and 82.79% (all grammatical functions) and 67.74% (preds-only) against the dependencies derived from the manually annotated gold-standard f-structures for 50 trees randomly selected from articles 301-325
A prototype for projecting HPSG syntactic lexica towards LMF
The comparative evaluation of Arabic HPSG grammar lexica requires a deep
study of their linguistic coverage. The complexity of this task results mainly
from the heterogeneity of the descriptive components within those lexica
(underlying linguistic resources and different data categories, for example).
It is therefore essential to define more homogeneous representations, which in
turn will enable us to compare them and eventually merge them. In this context,
we present a method for comparing HPSG lexica based on a rule system. This
method is implemented within a prototype for the projection from Arabic HPSG to
a normalised pivot language compliant with LMF (ISO 24613 - Lexical Markup
Framework) and serialised using a TEI (Text Encoding Initiative) based
representation. The design of this system is based on an initial study of the
HPSG formalism looking at its adequacy for the representation of Arabic, and
from this, we identify the appropriate feature structures corresponding to each
Arabic lexical category and their possible LMF counterparts
- …