9,421 research outputs found
The Wall Street Journal experiment (and useful programs)
This document gives information on parsing experiments applied to the standard Wall Street Journal corpus (``Standard'' means that this corpus has been widely used for exhibiting parsing tests of various models). The tested syntactic models are : standard Stochastic Context-Free Grammars, standard Tree Substitution Grammars, Gibbsian Context-Free Grammars and Gibbsian Tree Substitution Grammars. The parsing experiments are described with deep details so as to enable reader to easily redo the experiments from scratch (i.e. preparing the database, training and evaluating the models). The programs developped for these experiments are also described
Lexicalization and Grammar Development
In this paper we present a fully lexicalized grammar formalism as a
particularly attractive framework for the specification of natural language
grammars. We discuss in detail Feature-based, Lexicalized Tree Adjoining
Grammars (FB-LTAGs), a representative of the class of lexicalized grammars. We
illustrate the advantages of lexicalized grammars in various contexts of
natural language processing, ranging from wide-coverage grammar development to
parsing and machine translation. We also present a method for compact and
efficient representation of lexicalized trees.Comment: ps file. English w/ German abstract. 10 page
Developing and applying heterogeneous phylogenetic models with XRate
Modeling sequence evolution on phylogenetic trees is a useful technique in
computational biology. Especially powerful are models which take account of the
heterogeneous nature of sequence evolution according to the "grammar" of the
encoded gene features. However, beyond a modest level of model complexity,
manual coding of models becomes prohibitively labor-intensive. We demonstrate,
via a set of case studies, the new built-in model-prototyping capabilities of
XRate (macros and Scheme extensions). These features allow rapid implementation
of phylogenetic models which would have previously been far more
labor-intensive. XRate's new capabilities for lineage-specific models,
ancestral sequence reconstruction, and improved annotation output are also
discussed. XRate's flexible model-specification capabilities and computational
efficiency make it well-suited to developing and prototyping phylogenetic
grammar models. XRate is available as part of the DART software package:
http://biowiki.org/DART .Comment: 34 pages, 3 figures, glossary of XRate model terminolog
On external presentations of infinite graphs
The vertices of a finite state system are usually a subset of the natural
numbers. Most algorithms relative to these systems only use this fact to select
vertices.
For infinite state systems, however, the situation is different: in
particular, for such systems having a finite description, each state of the
system is a configuration of some machine. Then most algorithmic approaches
rely on the structure of these configurations. Such characterisations are said
internal. In order to apply algorithms detecting a structural property (like
identifying connected components) one may have first to transform the system in
order to fit the description needed for the algorithm. The problem of internal
characterisation is that it hides structural properties, and each solution
becomes ad hoc relatively to the form of the configurations.
On the contrary, external characterisations avoid explicit naming of the
vertices. Such characterisation are mostly defined via graph transformations.
In this paper we present two kind of external characterisations:
deterministic graph rewriting, which in turn characterise regular graphs,
deterministic context-free languages, and rational graphs. Inverse substitution
from a generator (like the complete binary tree) provides characterisation for
prefix-recognizable graphs, the Caucal Hierarchy and rational graphs. We
illustrate how these characterisation provide an efficient tool for the
representation of infinite state systems
D-Tree Grammars
DTG are designed to share some of the advantages of TAG while overcoming some
of its limitations. DTG involve two composition operations called subsertion
and sister-adjunction. The most distinctive feature of DTG is that, unlike TAG,
there is complete uniformity in the way that the two DTG operations relate
lexical items: subsertion always corresponds to complementation and
sister-adjunction to modification. Furthermore, DTG, unlike TAG, can provide a
uniform analysis for em wh-movement in English and Kashmiri, despite the fact
that the em wh element in Kashmiri appears in sentence-second position, and not
sentence-initial position as in English.Comment: Latex source, needs aclap.sty, 8 pages, to appear in ACL-9
Multiple Context-Free Tree Grammars: Lexicalization and Characterization
Multiple (simple) context-free tree grammars are investigated, where "simple"
means "linear and nondeleting". Every multiple context-free tree grammar that
is finitely ambiguous can be lexicalized; i.e., it can be transformed into an
equivalent one (generating the same tree language) in which each rule of the
grammar contains a lexical symbol. Due to this transformation, the rank of the
nonterminals increases at most by 1, and the multiplicity (or fan-out) of the
grammar increases at most by the maximal rank of the lexical symbols; in
particular, the multiplicity does not increase when all lexical symbols have
rank 0. Multiple context-free tree grammars have the same tree generating power
as multi-component tree adjoining grammars (provided the latter can use a
root-marker). Moreover, every multi-component tree adjoining grammar that is
finitely ambiguous can be lexicalized. Multiple context-free tree grammars have
the same string generating power as multiple context-free (string) grammars and
polynomial time parsing algorithms. A tree language can be generated by a
multiple context-free tree grammar if and only if it is the image of a regular
tree language under a deterministic finite-copying macro tree transducer.
Multiple context-free tree grammars can be used as a synchronous translation
device.Comment: 78 pages, 13 figure
Capturing CFLs with Tree Adjoining Grammars
We define a decidable class of TAGs that is strongly equivalent to CFGs and
is cubic-time parsable. This class serves to lexicalize CFGs in the same manner
as the LCFGs of Schabes and Waters but with considerably less restriction on
the form of the grammars. The class provides a normal form for TAGs that
generate local sets in much the same way that regular grammars provide a normal
form for CFGs that generate regular sets.Comment: 8 pages, 3 figures. To appear in proceedings of ACL'9
Korean to English Translation Using Synchronous TAGs
It is often argued that accurate machine translation requires reference to
contextual knowledge for the correct treatment of linguistic phenomena such as
dropped arguments and accurate lexical selection. One of the historical
arguments in favor of the interlingua approach has been that, since it revolves
around a deep semantic representation, it is better able to handle the types of
linguistic phenomena that are seen as requiring a knowledge-based approach. In
this paper we present an alternative approach, exemplified by a prototype
system for machine translation of English and Korean which is implemented in
Synchronous TAGs. This approach is essentially transfer based, and uses
semantic feature unification for accurate lexical selection of polysemous
verbs. The same semantic features, when combined with a discourse model which
stores previously mentioned entities, can also be used for the recovery of
topicalized arguments. In this paper we concentrate on the translation of
Korean to English.Comment: ps file. 8 page
- …