6 research outputs found
A framework for lexical representation
In this paper we present a unification-based lexical platform designed for
highly inflected languages (like Roman ones). A formalism is proposed for
encoding a lemma-based lexical source, well suited for linguistic
generalizations. From this source, we automatically generate an allomorph
indexed dictionary, adequate for efficient processing. A set of software tools
have been implemented around this formalism: access libraries, morphological
processors, etc.Comment: 9 page
GRAMPAL: A Morphological Processor for Spanish implemented in Prolog
A model for the full treatment of Spanish inflection for verbs, nouns and
adjectives is presented. This model is based on feature unification and it
relies upon a lexicon of allomorphs both for stems and morphemes. Word forms
are built by the concatenation of allomorphs by means of special contextual
features. We make use of standard Definite Clause Grammars (DCG) included in
most Prolog implementations, instead of the typical finite-state approach. This
allows us to take advantage of the declarativity and bidirectionality of Logic
Programming for NLP.
The most salient feature of this approach is simplicity: A really
straightforward rule and lexical components. We have developed a very simple
model for complex phenomena.
Declarativity, bidirectionality, consistency and completeness of the model
are discussed: all and only correct word forms are analysed or generated, even
alternative ones and gaps in paradigms are preserved. A Prolog implementation
has been developed for both analysis and generation of Spanish word forms. It
consists of only six DCG rules, because our {\em lexicalist\/} approach --i.e.
most information is in the dictionary. Although it is quite efficient, the
current implementation could be improved for analysis by using the non logical
features of Prolog, especially in word segmentation and dictionary access.Comment: 11 page
The ARIES toolbox: a continuing R+D effort
The effort under the ARIES toolbox spans through the last six years. The core of the toolbox is its lexical platform, including a large Spanish lexicon, lexical maintenance and access tools and morphological analyser/generator. Upon this platform a set of tools have been implemented, including tokenizers, spell checker, unification-based parser and grammar, stochastic and neural morphosyntactic taggers, etc. On the side of applications, the current work is oriented towards offering networking linguistic services for the publishing industry
Una propuesta y un etiquetador de codificación morfosintáctica para corpus de referencia en lengua española
Este trabajo presenta una propuesta de codificación morfosintáctica para corpus de referencia en lengua española basada en los estándares de la Text Encoding Initiative (TEI), The Network of European Reference Corpora (NERC) y The Expert Advisory Group on Language Engineering Standards (EAGLES) tal y como se presenta en (MartÃn de Santa Olalla, 1994). Presentamos también el trabajo de creación de etiquetador morfosintáctico que utiliza el conjunto de etiquetas que ésta contiene
Unsupervised Language Acquisition
This thesis presents a computational theory of unsupervised language
acquisition, precisely defining procedures for learning language from ordinary
spoken or written utterances, with no explicit help from a teacher. The theory
is based heavily on concepts borrowed from machine learning and statistical
estimation. In particular, learning takes place by fitting a stochastic,
generative model of language to the evidence. Much of the thesis is devoted to
explaining conditions that must hold for this general learning strategy to
arrive at linguistically desirable grammars. The thesis introduces a variety of
technical innovations, among them a common representation for evidence and
grammars, and a learning strategy that separates the ``content'' of linguistic
parameters from their representation. Algorithms based on it suffer from few of
the search problems that have plagued other computational approaches to
language acquisition.
The theory has been tested on problems of learning vocabularies and grammars
from unsegmented text and continuous speech, and mappings between sound and
representations of meaning. It performs extremely well on various objective
criteria, acquiring knowledge that causes it to assign almost exactly the same
structure to utterances as humans do. This work has application to data
compression, language modeling, speech recognition, machine translation,
information retrieval, and other tasks that rely on either structural or
stochastic descriptions of language.Comment: PhD thesis, 133 page