147 research outputs found
Rapid Development of Morphological Descriptions for Full Language Processing Systems
I describe a compiler and development environment for feature-augmented
two-level morphology rules integrated into a full NLP system. The compiler is
optimized for a class of languages including many or most European ones, and
for rapid development and debugging of descriptions of new languages. The key
design decision is to compose morphophonological and morphosyntactic
information, but not the lexicon, when compiling the description. This results
in typical compilation times of about a minute, and has allowed a reasonably
full, feature-based description of French inflectional morphology to be
developed in about a month by a linguist new to the system.Comment: 8 pages, LaTeX (2.09 preferred); eaclap.sty; Procs of Euro ACL-9
Learning tree patterns for syntactic parsing
This paper presents a method for parsing Hungarian texts using a machine learning approach. The method collects the initial grammar for a learner from an annotated corpus with the help of tree shapes. The PGS algorithm, an improved version of the RGLearn algorithm, was developed and applied to learning tree patterns with various phrase types described by regular expressions. The method also calculates the probability values of the learned tree patterns. The syntactic parser of learned grammar using the Viterbi algorithm performs a quick search for finding the most probable derivation of a sentence. The results were built into an information extraction pipeline
A Feature-Based Lexicalized Tree Adjoining Grammar for Korean
This document describes an on-going project of developing a grammar of Korean, the Korean XTAG grammar, written in the TAG formalism and implemented for use with the XTAG system enriched with a Korean morphological analyzer. The Korean XTAG grammar described in this report is based on the TAG formalism (Joshi et al. (1975)), which has been extended to include lexicalization (Schabes et al. (1988)), and unification-based feature structures (Vijay-Shanker and Joshi (1991)). The document first describes the modifications that we have made to the XTAG system (The XTAG-Group (1998)) to handle rich inflectional morphology in Korean. Then various syntactic phenomena that can be currently handled are described, including adverb modification, relative clauses, complex noun phrases, auxiliary verb constructions, gerunds and adjunct clauses. The work reported here is a first step towards the development of an implemented TAG grammar for Korean, which is continuously updated with the addition of new analyses and modification of old ones
Wide-coverage parsing for Turkish
Wide-coverage parsing is an area that attracts much attention in natural language processing
research. This is due to the fact that it is the first step tomany other applications
in natural language understanding, such as question answering.
Supervised learning using human-labelled data is currently the best performing
method. Therefore, there is great demand for annotated data. However, human annotation
is very expensive and always, the amount of annotated data is much less than
is needed to train well-performing parsers. This is the motivation behind making the
best use of data available. Turkish presents a challenge both because syntactically
annotated Turkish data is relatively small and Turkish is highly agglutinative, hence
unusually sparse at the whole word level.
METU-Sabancı Treebank is a dependency treebank of 5620 sentences with surface
dependency relations and morphological analyses for words. We show that including
even the crudest forms of morphological information extracted from the data boosts
the performance of both generative and discriminative parsers, contrary to received
opinion concerning English.
We induce word-based and morpheme-based CCG grammars from Turkish dependency
treebank. We use these grammars to train a state-of-the-art CCG parser that
predicts long-distance dependencies in addition to the ones that other parsers are capable
of predicting. We also use the correct CCG categories as simple features in a
graph-based dependency parser and show that this improves the parsing results.
We show that a morpheme-based CCG lexicon for Turkish is able to solve many
problems such as conflicts of semantic scope, recovering long-range dependencies,
and obtaining smoother statistics from the models. CCG handles linguistic phenomena
i.e. local and long-range dependencies more naturally and effectively than other linguistic
theories while potentially supporting semantic interpretation in parallel. Using
morphological information and a morpheme-cluster based lexicon improve the performance
both quantitatively and qualitatively for Turkish.
We also provide an improved version of the treebank which will be released by
kind permission of METU and Sabancı
- âŠ