17,761 research outputs found
Minimally-Supervised Morphological Segmentation using Adaptor Grammars
This paper explores the use of Adaptor Grammars, a nonparametric Bayesian modelling framework, for minimally supervised morphological segmentation. We compare three training methods: unsupervised training, semi-supervised training, and a novel model selection method. In the model selection method, we train unsupervised Adaptor Grammars using an over-articulated metagrammar, then use a small labelled data set to select which potential morph boundaries identified by the metagrammar should be returned in the final output. We evaluate on five languages and show that semi-supervised training provides a boost over unsupervised training, while the model selection method yields the best average results over all languages and is competitive with state-of-the-art semi-supervised systems. Moreover, this method provides the potential to tune performance according to different evaluation metrics or downstream tasks.12 page(s
Automatic acquisition of LFG resources for German - as good as it gets
We present data-driven methods for the acquisition of LFG resources from two German treebanks. We discuss problems specific to semi-free word order languages as well as problems arising fromthe data structures determined
by the design of the different treebanks. We compare two ways of encoding semi-free word order, as done in the two German treebanks, and argue that the design of the TiGer treebank is more adequate for the acquisition of LFG
resources. Furthermore, we describe an architecture for LFG grammar acquisition for German, based on the two German treebanks, and compare our results with a hand-crafted German LFG grammar
Use of Weighted Finite State Transducers in Part of Speech Tagging
This paper addresses issues in part of speech disambiguation using
finite-state transducers and presents two main contributions to the field. One
of them is the use of finite-state machines for part of speech tagging.
Linguistic and statistical information is represented in terms of weights on
transitions in weighted finite-state transducers. Another contribution is the
successful combination of techniques -- linguistic and statistical -- for word
disambiguation, compounded with the notion of word classes.Comment: uses psfig, ipamac
Lexicalization and Grammar Development
In this paper we present a fully lexicalized grammar formalism as a
particularly attractive framework for the specification of natural language
grammars. We discuss in detail Feature-based, Lexicalized Tree Adjoining
Grammars (FB-LTAGs), a representative of the class of lexicalized grammars. We
illustrate the advantages of lexicalized grammars in various contexts of
natural language processing, ranging from wide-coverage grammar development to
parsing and machine translation. We also present a method for compact and
efficient representation of lexicalized trees.Comment: ps file. English w/ German abstract. 10 page
Some Novel Applications of Explanation-Based Learning to Parsing Lexicalized Tree-Adjoining Grammars
In this paper we present some novel applications of Explanation-Based
Learning (EBL) technique to parsing Lexicalized Tree-Adjoining grammars. The
novel aspects are (a) immediate generalization of parses in the training set,
(b) generalization over recursive structures and (c) representation of
generalized parses as Finite State Transducers. A highly impoverished parser
called a ``stapler'' has also been introduced. We present experimental results
using EBL for different corpora and architectures to show the effectiveness of
our approach.Comment: uuencoded postscript fil
Syntactic parsing of unrestricted Spanish text
This research focusses on the syntactical parsing of morphologycal
tagged corpora. A proposal for a corpus oriented Spanish grammar is
presented in this document. This work has been developed in the
framework of the ITEM project and its main goal is to provide
multilingual background for information extraction and retrieval
tasks. The main goal of Tacat analyser is to provide a way of
obtaining large amounts of bracketed and parsed corpora, both general land specific domain. Tacat uses context free grammars and has as input following categories of Parole specification.The incremental
methodology that we use allows us to recognise different levels of
complexity in the analysis and to produce compatible outputs of all
the grammars.Postprint (published version
- …