2,212 research outputs found
Automatic Acquisition of Lexical-Functional Grammar Resources from a Japanese Dependency Corpus
PACLIC 21 / Seoul National University, Seoul, Korea / November 1-3, 200
DCU 250 Arabic dependency bank: an LFG gold standard resource for the Arabic Penn treebank
This paper describes the construction of a dependency bank gold standard for Arabic, DCU 250 Arabic Dependency Bank (DCU 250), based on the Arabic Penn Treebank Corpus (ATB) (Bies and Maamouri, 2003; Maamouri and Bies, 2004) within the theoretical framework of Lexical Functional Grammar (LFG). For parsing and automatically extracting grammatical and lexical resources from treebanks, it is necessary to evaluate against established gold standard resources. Gold standards for various languages have been developed, but to our knowledge, such a resource has not yet been constructed for Arabic. The construction of the DCU 250 marks the first step
towards the creation of an automatic LFG f-structure annotation algorithm for the ATB,
and for the extraction of Arabic grammatical and lexical resources
Treebank-based acquisition of wide-coverage, probabilistic LFG resources: project overview, results and evaluation
This paper presents an overview of a project to acquire wide-coverage, probabilistic Lexical-Functional Grammar
(LFG) resources from treebanks. Our approach is based on an automatic annotation algorithm that annotates ārawā treebank trees with LFG f-structure information approximating to basic predicate-argument/dependency structure. From the f-structure-annotated treebank
we extract probabilistic unification grammar resources. We present the annotation algorithm, the extraction of
lexical information and the acquisition of wide-coverage and robust PCFG-based LFG approximations including
long-distance dependency resolution.
We show how the methodology can be applied to multilingual, treebank-based unification grammar acquisition. Finally
we show how simple (quasi-)logical forms can be derived automatically from the f-structures generated for the treebank trees
Treebank-based acquisition of LFG parsing resources for French
Motivated by the expense in time and other resources to produce hand-crafted grammars, there has been increased interest in automatically obtained wide-coverage grammars from treebanks for natural language processing. In particular, recent years have seen the growth in interest in automatically obtained deep resources that can represent information absent from simple CFG-type structured treebanks
and which are considered to produce more language-neutral linguistic representations, such as dependency syntactic trees. As is often the case in early pioneering work on natural language processing, English has provided the focus of first efforts towards acquiring deep-grammar resources, followed by successful treatments of, for example, German, Japanese, Chinese and Spanish. However, no comparable large-scale automatically acquired deep-grammar resources have been obtained for French to date. The goal of this paper is to present the application of treebank-based language acquisition to the case of French. We show that with modest changes to the established parsing architectures, encouraging results can be obtained for French, with a best dependency structure f-score of 86.73%
Optimality Theory as a Framework for Lexical Acquisition
This paper re-investigates a lexical acquisition system initially developed
for French.We show that, interestingly, the architecture of the system
reproduces and implements the main components of Optimality Theory. However, we
formulate the hypothesis that some of its limitations are mainly due to a poor
representation of the constraints used. Finally, we show how a better
representation of the constraints used would yield better results
Automatic acquisition of Spanish LFG resources from the Cast3LB treebank
In this paper, we describe the automatic annotation of the Cast3LB Treebank with LFG f-structures for the subsequent extraction of Spanish probabilistic grammar and lexical resources. We adapt the approach and methodology of Cahill et al. (2004), OāDonovan et al. (2004) and elsewhere for English to Spanish and the Cast3LB treebank encoding. We report on the quality and coverage of the automatic f-structure annotation. Following the pipeline and integrated models of Cahill et al. (2004), we extract wide-coverage
probabilistic LFG approximations and parse unseen Spanish text into f-structures. We also extend Bikelās (2002) Multilingual Parse Engine to include a Spanish language module. Using the retrained Bikel parser in the pipeline model gives the best results against a manually constructed gold standard (73.20% predsonly f-score). We also extract Spanish lexical resources: 4090 semantic form types with 98 frame types. Subcategorised prepositions and particles are included in the frames
Treebank-based automatic acquisition of wide coverage, deep linguistic resources for Japanese
The objective f this thesis is to design, implement and evaluate a methodology for the automatic acquisition of wide-coverage treebank-based deep linguistic resources fr Japanese, as part of the GramLab project which focuses on the automatic treebank-based induction of multilingual resources in the framework of Lexical-Functional Grammar (LFG).
After introducing the basic framework of LFG in Chapter 2, I describe the core syntactic and morphological aspects of Japanese in Chapter 3: non-configurationality; the concept of "bunsetsu" r syntactic units and their dependency relationship represented in Directed Acyclic Graphs (DAGs); topicalisation by a particular particle; and frequent use of zero pronouns with or without over antecedents. Inflecting parts-of-speech and non-inflecting parts-of-speech of Japanese are also described with examples.
In Chapter 4, I provide the linguistic representation of core grammatical features and functions of Japanese in the framework of LFG.I use Directed Acyclic Graphs (DAG) as a framework for the unified representation f surface syntactic, morphological and lexical information in an LFG f-structure.
In Chapters 5 and 6, I describe the automatic annotation algorithm of LFG f-structure functional equations (i.e. labelled dependencies) to the Kyoto Text Corpus version 4.0 (KTC4) and the output of Kurohashi-Nagao Parser (KNP provide unlabelled dependencies only. The method presented in this dissertation also includes zero pronoun identification.
Finally in Chapter 7 I evaluate the performance of the f-structure annotation algorithm with zero-pronoun identification for KTC4 against a manually-corrected Gold Standard of 500 sentences randomly chosen from KTC4. Using KTC4 treebank trees, currently my method achieves a pred-only dependency f-score of 94.72%. The parsing experiments using KNP output yield a pred-only dependency f-score of 82.38%
Treebank-based acquisition of Chinese LFG resources for parsing and generation
This thesis describes a treebank-based approach to automatically acquire robust,wide-coverage Lexical-Functional Grammar (LFG) resources for Chinese parsing
and generation, which is part of a larger project on the rapid construction of deep, large-scale, constraint-based, multilingual grammatical resources. I present an application-oriented LFG analysis for Chinese core linguistic phenomena and (in cooperation with PARC) develop a gold-standard dependency-bank of Chinese f-structures for evaluation. Based on the Penn Chinese Treebank, I design and implement two architectures for inducing Chinese LFG resources, one annotation-based and the other dependency conversion-based. I then apply the f-structure acquisition algorithm together with external, state-of-the-art parsers to parsing new text into "proto" f-structures. In order to convert "proto" f-structures into "proper" f-structures or deep dependencies, I present a novel Non-Local Dependency (NLD) recovery algorithm using subcategorisation frames and f-structure paths linking antecedents and traces in NLDs extracted from the automatically-built LFG f-structure treebank. Based on the grammars extracted from the f-structure annotated treebank, I develop a PCFG-based chart generator and a new n-gram based pure dependency generator to realise Chinese sentences from LFG f-structures.
The work reported in this thesis is the first effort to scale treebank-based, probabilistic Chinese LFG resources from proof-of-concept research to unrestricted, real
text. Although this thesis concentrates on Chinese and LFG, many of the methodologies, e.g. the acquisition of predicate-argument structures, NLD resolution and
the PCFG- and dependency n-gram-based generation models, are largely language and formalism independent and should generalise to diverse languages as well as to labelled bilexical dependency representations other than LFG
Treebank-Based Deep Grammar Acquisition for French Probabilistic Parsing Resources
Motivated by the expense in time and other resources to produce hand-crafted grammars, there has been increased interest in wide-coverage grammars automatically obtained from treebanks. In particular, recent years have seen a move
towards acquiring deep (LFG, HPSG and CCG) resources that can represent information absent from simple CFG-type structured treebanks and which are considered to produce more language-neutral linguistic representations, such
as syntactic dependency trees. As is often the case in early pioneering work in natural language processing, English has been the focus of attention in the first efforts towards acquiring treebank-based deep-grammar resources, followed by treatments of, for example, German, Japanese, Chinese and Spanish. However, to date no comparable large-scale automatically acquired deep-grammar resources have been obtained for French. The goal of the research presented in this thesis is to develop, implement, and evaluate treebank-based deep-grammar acquisition techniques for French. Along the way towards achieving this goal, this thesis presents the derivation of a new treebank for French from the Paris 7 Treebank, the Modified French Treebank, a cleaner, more coherent treebank with several transformed structures and new linguistic analyses. Statistical parsers trained on this data outperform those trained on the original Paris 7 Treebank, which has five times the amount of data. The Modified French Treebank is the data source used for the development of treebank-based automatic deep-grammar acquisition for LFG parsing resources
for French, based on an f-structure annotation algorithm for this treebank. LFG CFG-based parsing architectures are then extended and tested, achieving a competitive best f-score of 86.73% for all features. The CFG-based parsing architectures are then complemented with an alternative dependency-based statistical parsing approach, obviating the CFG-based parsing step, and instead directly
parsing strings into f-structures
Automatic Scaling of Text for Training Second Language Reading Comprehension
For children learning their first language, reading is one of the most effective ways to acquire new vocabulary. Studies link students who read more with larger and more complex vocabularies. For second language learners, there is a substantial barrier to reading. Even the books written for early first language readers assume a base vocabulary of nearly 7000 word families and a nuanced understanding of grammar. This project will look at ways that technology can help second language learners overcome this high barrier to entry, and the effectiveness of learning through reading for adults acquiring a foreign language. Through the implementation of Dokusha, an automatic graded reader generator for Japanese, this project will explore how advancements in natural language processing can be used to automatically simplify text for extensive reading in Japanese as a foreign language
- ā¦