58 research outputs found
An Abstract Machine for Unification Grammars
This work describes the design and implementation of an abstract machine,
Amalia, for the linguistic formalism ALE, which is based on typed feature
structures. This formalism is one of the most widely accepted in computational
linguistics and has been used for designing grammars in various linguistic
theories, most notably HPSG. Amalia is composed of data structures and a set of
instructions, augmented by a compiler from the grammatical formalism to the
abstract instructions, and a (portable) interpreter of the abstract
instructions. The effect of each instruction is defined using a low-level
language that can be executed on ordinary hardware.
The advantages of the abstract machine approach are twofold. From a
theoretical point of view, the abstract machine gives a well-defined
operational semantics to the grammatical formalism. This ensures that grammars
specified using our system are endowed with well defined meaning. It enables,
for example, to formally verify the correctness of a compiler for HPSG, given
an independent definition. From a practical point of view, Amalia is the first
system that employs a direct compilation scheme for unification grammars that
are based on typed feature structures. The use of amalia results in a much
improved performance over existing systems.
In order to test the machine on a realistic application, we have developed a
small-scale, HPSG-based grammar for a fragment of the Hebrew language, using
Amalia as the development platform. This is the first application of HPSG to a
Semitic language.Comment: Doctoral Thesis, 96 pages, many postscript figures, uses pstricks,
pst-node, psfig, fullname and a macros fil
A Note on the Complexity of Restricted Attribute-Value Grammars
The recognition problem for attribute-value grammars (AVGs) was shown to be
undecidable by Johnson in 1988. Therefore, the general form of AVGs is of no
practical use. In this paper we study a very restricted form of AVG, for which
the recognition problem is decidable (though still NP-complete), the R-AVG. We
show that the R-AVG formalism captures all of the context free languages and
more, and introduce a variation on the so-called `off-line parsability
constraint', the `honest parsability constraint', which lets different types of
R-AVG coincide precisely with well-known time complexity classes.Comment: 18 pages, also available by (1) anonymous ftp at
ftp://ftp.fwi.uva.nl/pub/theory/illc/researchReports/CT-95-02.ps.gz ; (2) WWW
from http://www.fwi.uva.nl/~mtrautwe
A Feature-Based Lexicalized Tree Adjoining Grammar for Korean
This document describes an on-going project of developing a grammar of Korean, the Korean XTAG grammar, written in the TAG formalism and implemented for use with the XTAG system enriched with a Korean morphological analyzer. The Korean XTAG grammar described in this report is based on the TAG formalism (Joshi et al. (1975)), which has been extended to include lexicalization (Schabes et al. (1988)), and unification-based feature structures (Vijay-Shanker and Joshi (1991)). The document first describes the modifications that we have made to the XTAG system (The XTAG-Group (1998)) to handle rich inflectional morphology in Korean. Then various syntactic phenomena that can be currently handled are described, including adverb modification, relative clauses, complex noun phrases, auxiliary verb constructions, gerunds and adjunct clauses. The work reported here is a first step towards the development of an implemented TAG grammar for Korean, which is continuously updated with the addition of new analyses and modification of old ones
Baldwinian accounts of language evolution
Since Hinton & Nowlan published their seminal paper (Hinton & Nowlan 1987), the
neglected evolutionary process of the Baldwin effect has been widely acknowledged.
Especially in the field of language evolution, the Baldwin effect (Baldwin 1896d,
Simpson 1953) has been expected to salvage the long-lasting deadlocked situation of
modern linguistics: i.e., it may shed light on the relationship between environment
and innateness in the formation of language.However, as intense research of this evolutionary theory goes on, certain robust
difficulties have become apparent. One example is genotype-phenotype correlation.
By computer simulations, both Yamauchi (1999, 2001) and Mayley (19966) show
that for the Baldwin effect to work legitimately, correlation between genotypes and
phenotypes is the most essential underpinning. This is due to the fact that this type
of the Baldwin effect adopts as its core mechanism Waddington's (1975) "genetic
assimilation". In this mechanism, phenocopies have to be genetically closer to the
innately predisposed genotype. Unfortunately this is an overly naiive assumption
for the theory of language evolution. As a highly complex cognitive ability, the
possibility that this type of genotype-phenotype correlation exists in the domain of
linguistic ability is vanishingly small.In this thesis, we develop a new type of mechanism, called "Baldwinian Niche
Construction (BNC), that has a rich explanatory power and can potentially over¬
come this bewildering problem of the Baldwin effect. BNC is based on the theory
of niche construction that has been developed by Odling-Smee et al. (2003). The
incorporation of the theory into the Baldwin effect was first suggested by Deacon
(1997) and briefly introduced by Godfrey-Smith (2003). However, its formulation
is yet incomplete.In the thesis, first, we review the studies of the Baldwin effect in both biology
and the study of language evolution. Then the theory of BNC is more rigorously
developed. Linguistic communication has an intrinsic property that is fundamentally described in the theory of niche construction. This naturally leads us to the
theoretical necessity of BNC in language evolution. By creating a new linguistic
niche, learning discloses a previously hidden genetic variance on which the Baldwin
'canalizing' effect can take place. It requires no genetic modification in a given
genepool. There is even no need that genes responsible for learning occupy the
same loci as genes for the innate linguistic knowledge. These and other aspects of
BNC are presented with some results from computer simulations
Supervised Training on Synthetic Languages: A Novel Framework for Unsupervised Parsing
This thesis focuses on unsupervised dependency parsing—parsing sentences of a language into dependency trees without accessing the training data of that language. Different from most prior work that uses unsupervised learning to estimate the parsing parameters, we estimate the parameters by supervised training on synthetic languages. Our parsing framework has three major components: Synthetic language generation gives a rich set of training languages by mix-and-match over the real languages; surface-form feature extraction maps an unparsed corpus of a language into a fixed-length vector as the syntactic signature of that language; and, finally, language-agnostic parsing incorporates the syntactic signature during parsing so that the decision on each word token is reliant upon the general syntax of the target language.
The fundamental question we are trying to answer is whether some useful information about the syntax of a language could be inferred from its surface-form evidence (unparsed corpus). This is the same question that has been implicitly asked by previous papers on unsupervised parsing, which only assumes an unparsed corpus to be available for the target language. We show that, indeed, useful features of the target language can be extracted automatically from an unparsed corpus, which consists only of gold part-of-speech (POS) sequences. Providing these features to our neural parser enables it to parse sequences like those in the corpus. Strikingly, our system has no supervision in the target language. Rather, it is a multilingual system that is trained end-to-end on a variety of other languages, so it learns a feature extractor that works well.
This thesis contains several large-scale experiments requiring hundreds of thousands of CPU-hours. To our knowledge, this is the largest study of unsupervised parsing yet attempted. We show experimentally across multiple languages: (1) Features computed from the unparsed corpus improve parsing accuracy. (2) Including thousands of synthetic languages in the training yields further improvement. (3) Despite being computed from unparsed corpora, our learned task-specific features beat previous works’ interpretable typological features that require parsed corpora or expert categorization of the language
- …