528 research outputs found
D4.1. Technologies and tools for corpus creation, normalization and annotation
The objectives of the Corpus Acquisition and Annotation (CAA) subsystem are the acquisition and processing of monolingual and bilingual language resources (LRs) required in the PANACEA context. Therefore, the CAA subsystem includes: i) a Corpus Acquisition Component (CAC) for extracting monolingual and bilingual data from the web, ii) a component for cleanup and normalization (CNC) of these data and iii) a text processing component (TPC) which consists of NLP tools including modules for sentence splitting, POS tagging, lemmatization, parsing and named entity recognition
An Efficient Probabilistic Context-Free Parsing Algorithm that Computes Prefix Probabilities
We describe an extension of Earley's parser for stochastic context-free
grammars that computes the following quantities given a stochastic context-free
grammar and an input string: a) probabilities of successive prefixes being
generated by the grammar; b) probabilities of substrings being generated by the
nonterminals, including the entire string being generated by the grammar; c)
most likely (Viterbi) parse of the string; d) posterior expected number of
applications of each grammar production, as required for reestimating rule
probabilities. (a) and (b) are computed incrementally in a single left-to-right
pass over the input. Our algorithm compares favorably to standard bottom-up
parsing methods for SCFGs in that it works efficiently on sparse grammars by
making use of Earley's top-down control structure. It can process any
context-free rule format without conversion to some normal form, and combines
computations for (a) through (d) in a single algorithm. Finally, the algorithm
has simple extensions for processing partially bracketed inputs, and for
finding partial parses and their likelihoods on ungrammatical inputs.Comment: 45 pages. Slightly shortened version to appear in Computational
Linguistics 2
Proceedings
Proceedings of the Ninth International Workshop
on Treebanks and Linguistic Theories.
Editors: Markus Dickinson, Kaili Müürisep and Marco Passarotti.
NEALT Proceedings Series, Vol. 9 (2010), 268 pages.
© 2010 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/15891
D6.1: Technologies and Tools for Lexical Acquisition
This report describes the technologies and tools to be used for Lexical Acquisition in PANACEA. It includes descriptions of existing technologies and tools which can be built on and improved within PANACEA, as well as of new technologies and tools to be developed and integrated in PANACEA platform. The report also specifies the Lexical Resources to be produced. Four main areas of lexical acquisition are included: Subcategorization frames (SCFs), Selectional Preferences (SPs), Lexical-semantic Classes (LCs), for both nouns and verbs, and Multi-Word Expressions (MWEs)
On Pseudorelatives and Human Sentence Parsing
The debate over whether universal parsing mechanisms are necessary to
explain sentence comprehension is clearly a fundamental one for cognitive science.
This dissertation focuses on the relation between syntactic ambiguity and
principles of economy in the parsing of ambiguous Pseudo Relative (PR)/ Relative
Clause (RC) strings. While the principles of locality would predict local
attachment in (exclusive) RC contexts, PR-first Hypothesis (Grillo & Costa, 2014)
predicts high attachment (corresponding to a PR parse) in ambiguous PR/RC
contexts.
We test the offline and online effects of PR availability in Spanish using
a variety of research methods (eye-tracking while reading, sentence completion
task, forced-choice questionnaire, acceptability judgement), while also looking at
the interaction with other factors such as aspectual properties of the embedded
predicate.
The results reported here are robust across studies and show an influence of
PRs on the parsing of RCs: when PRs are not a confound, and relevant factors are
controlled (e.g. length of the clauses), locality principles apply to RC attachment;
when PRs are available, attachment preferences shift toward the non-local option.
These results support the universality of parsing principles and suggest that crosslinguistic
variation in RC attachment is epiphenomenal and largely attributable
to the asymmetric availability of PRs across languages. This dissertation also
provides a detailed description on PR-licensing contexts that might be useful for
future research on RC attachment preferences to avoid the PR confound.O debate sobre se os mecanismos de análise universal são necessários para
explicar a compreensão de frases é claramente fundamental para a Ciência Cognitiva.
Esta dissertação centra-se na relação entre ambiguidade sintática e princípios
de economia na análise de estruturaspseudorelativas (PR)/ orações relativas (OR)
ambíguas. Enquanto os princípios de localidade prediriam a ligação local em contextos
(exclusivos) das OR, a PR-first Hypothesis (Grillo & Costa, 2014) prevê uma
alta ligação (correspondente a uma análise da PR) em contextos PR/OR ambíguos.
Nesta tese testamos os efeitos offline e online da disponibilidade das PRs
em Espanhol, utilizando uma variedade de métodos de investigação (técnica de
registo dos comportamentos oculares (eye-tracking) durante a leitura, tarefa de
preenchimento de frases, questionários, julgamento da aceitabilidade), ao mesmo
tempo que também analisamos a interação com as propriedades aspetuais do
predicado encaixado.
Os resultados obtidos nesta dissertação mostram uma influência das PRs na
análise das ORs: quando as PRs estão disponíveis e os fatores relevantes são controlados
(por exemplo, o comprimento das orações), os princípios da localidade
aplicam-se à adjunção das ORs; quando as PRs estão disponíveis, as preferências
de adjunção mudam para a opção não-local. Estes resultados apoiam a universalidade
dos princípios de análise e sugerem que a variação linguística na adjunção
da OR é epifenomenal e amplamente atribuível à disponibilidade assimétrica das
PRs entre línguas. Esta dissertação também fornece uma descrição detalhada dos
contextos de licenciamento da PR, que podem ser úteis para evitar a ambiguidade
PR/OR em futuras pesquisas sobre as preferências da ligação da OR
General methods for fine-grained morphological and syntactic disambiguation
We present methods for improved handling of morphologically
rich languages (MRLS) where we define
MRLS as languages that
are morphologically more complex than English. Standard
algorithms for language modeling, tagging and parsing have
problems with the productive nature of such
languages. Consider for example the possible forms of a
typical English verb like work that generally has four
four different
forms: work, works, working
and worked. Its Spanish counterpart trabajar
has 6 different forms in present
tense: trabajo, trabajas, trabaja, trabajamos, trabajáis
and trabajan and more than 50 different forms when
including the different tenses, moods (indicative,
subjunctive and imperative) and participles. Such a high
number of forms leads to sparsity issues: In a recent
Wikipedia dump of more than 400 million tokens we find that
20 of these forms occur only twice or less and that 10 forms
do not occur at all. This means that even if we only need
unlabeled data to estimate a model and even when looking at
a relatively common and frequent verb, we do not have enough
data to make reasonable estimates for some of its
forms. However, if we decompose an unseen form such
as trabajaréis `you will work', we find that it
is trabajar in future tense and second person
plural. This allows us to make the predictions that are
needed to decide on the grammaticality (language modeling)
or syntax (tagging and parsing) of a sentence.
In the first part of this thesis, we develop
a morphological language model. A language model
estimates the grammaticality and coherence of a
sentence. Most language models used today are word-based
n-gram models, which means that they estimate the
transitional probability of a word following a history, the
sequence of the (n - 1) preceding words. The probabilities
are estimated from the frequencies of the history and the
history followed by the target word in a huge text
corpus. If either of the sequences is unseen, the length of
the history has to be reduced. This leads to a less accurate
estimate as less context is taken into account.
Our morphological language model estimates an additional
probability from the morphological classes of the
words. These classes are built automatically by extracting
morphological features from the word forms. To this end, we
use unsupervised segmentation algorithms to find the
suffixes of word forms. Such an algorithm might for example
segment trabajaréis into trabaja
and réis and we can then estimate the properties
of trabajaréis from other word forms with the same or
similar morphological properties. The data-driven nature of
the segmentation algorithms allows them to not only find
inflectional suffixes (such as -réis), but also more
derivational phenomena such as the head nouns of compounds
or even endings such as -tec, which identify
technology oriented companies such
as Vortec, Memotec and Portec and would
not be regarded as a morphological suffix by traditional
linguistics. Additionally, we extract shape features such as
if a form contains digits or capital characters. This is
important because many rare or unseen forms are proper
names or numbers and often do not have meaningful
suffixes. Our class-based morphological model is then
interpolated with a word-based model to combine the
generalization capabilities of the first and the high
accuracy in case of sufficient data of the second.
We evaluate our model across 21 European languages and find
improvements between 3% and 11% in perplexity, a standard
language modeling evaluation measure. Improvements are
highest for languages with more productive and complex
morphology such as Finnish and Estonian, but also visible
for languages with a relatively simple morphology such as
English and Dutch. We conclude that a morphological
component yields consistent improvements for all the tested
languages and argue that it should be part of every language
model.
Dependency trees represent the syntactic structure of a
sentence by attaching each word to its syntactic head, the
word it is directly modifying. Dependency parsing
is usually tackled using heavily lexicalized (word-based)
models and a thorough morphological preprocessing is
important for optimal performance, especially for MRLS. We
investigate if the lack of morphological features can be
compensated by features induced using hidden Markov
models with latent annotations (HMM-LAs)
and find this to be the case for German. HMM-LAs were
proposed as a method to increase part-of-speech tagging
accuracy. The model splits the observed part-of-speech tags
(such as verb and noun) into subtags. An expectation
maximization algorithm is then used to fit the subtags to
different roles. A verb tag for example might be split into
an auxiliary verb and a full verb subtag. Such a split is
usually beneficial because these two verb classes have
different contexts. That is, a full verb might follow an
auxiliary verb, but usually not another full verb.
For German and English, we find that our model leads to
consistent improvements over a parser
not using subtag features. Looking at the labeled attachment
score (LAS), the number of words correctly attached to their head,
we observe an improvement from 90.34 to 90.75 for English
and from 87.92 to 88.24 for German. For German, we
additionally find that our model achieves almost the same
performance (88.24) as a model using tags annotated by a
supervised morphological tagger (LAS of 88.35). We also find
that the German latent tags correlate with
morphology. Articles for example are split by their
grammatical case.
We also investigate the part-of-speech tagging accuracies of
models using the traditional treebank tagset and models
using induced tagsets of the same size and find that the
latter outperform the former, but are in turn outperformed
by a discriminative tagger.
Furthermore, we present a method for fast and
accurate morphological tagging. While
part-of-speech tagging annotates tokens in context with
their respective word categories, morphological tagging
produces a complete annotation containing all the relevant
inflectional features such as case, gender and tense. A
complete reading is represented as a single tag. As a
reading might consist of several morphological features the
resulting tagset usually contains hundreds or even thousands
of tags. This is an issue for many decoding algorithms such
as Viterbi which have runtimes depending quadratically on
the number of tags. In the case of morphological tagging,
the problem can be avoided by using a morphological
analyzer. A morphological analyzer is a manually created
finite-state transducer that produces the possible
morphological readings of a word form. This analyzer can be
used to prune the tagging lattice and to allow for the
application of standard sequence labeling algorithms. The
downside of this approach is that such an analyzer is not
available for every language or might not have the coverage
required for the task. Additionally, the output tags of some
analyzers are not compatible with the annotations of the
treebanks, which might require some manual mapping of the
different annotations or even to reduce the complexity of
the annotation.
To avoid this problem we propose to use the posterior
probabilities of a conditional random field (CRF)
lattice to prune the space of possible
taggings. At the zero-order level the posterior
probabilities of a token can be calculated independently
from the other tokens of a sentence. The necessary
computations can thus be performed in linear time. The
features available to the model at this time are similar to
the features used by a morphological analyzer (essentially
the word form and features based on it), but also include
the immediate lexical context. As the ambiguity of word
types varies substantially, we just fix the average number of
readings after pruning by dynamically estimating a
probability threshold. Once we obtain the pruned lattice, we
can add tag transitions and convert it into a first-order
lattice. The quadratic forward-backward computations are now
executed on the remaining plausible readings and thus
efficient. We can now continue pruning and extending the
lattice order at a relatively low additional runtime cost
(depending on the pruning thresholds). The training of the
model can be implemented efficiently by applying stochastic
gradient descent (SGD). The CRF gradient can be calculated
from a lattice of any order as long as the correct reading
is still in the lattice. During training, we thus run the
lattice pruning until we either reach the maximal order or
until the correct reading is pruned. If the reading is
pruned we perform the gradient update with the highest order
lattice still containing the reading. This approach is
similar to early updating in the structured perceptron
literature and forces the model to learn how to keep the
correct readings in the lower order lattices. In practice,
we observe a high number of lower updates during the first
training epoch and almost exclusively higher order updates
during later epochs.
We evaluate our CRF tagger on six languages with different
morphological properties. We find that for languages with a
high word form ambiguity such as German, the pruning results
in a moderate drop in tagging accuracy while for languages
with less ambiguity such as Spanish and Hungarian the loss
due to pruning is negligible. However, our pruning strategy
allows us to train higher order models (order > 1), which give
substantial improvements for all languages and also
outperform unpruned first-order models. That is, the model
might lose some of the correct readings during pruning, but
is also able to solve more of the harder cases that require
more context. We also find our model to substantially and
significantly outperform a number of frequently used taggers
such as Morfette and SVMTool.
Based on our morphological tagger we develop a simple method
to increase the performance of a state-of-the-art
constituency parser. A constituency tree
describes the syntactic properties of a sentence by
assigning spans of text to a hierarchical bracket
structure. developed a
language-independent approach for the automatic annotation
of accurate and compact grammars. Their implementation --
known as the Berkeley parser -- gives state-of-the-art results
for many languages such as English and German. For some MRLS
such as Basque and Korean, however, the parser gives
unsatisfactory results because of its simple unknown word
model. This model maps unknown words to a small number of
signatures (similar to our morphological classes). These
signatures do not seem expressive enough for many of the
subtle distinctions made during parsing. We propose to
replace rare words by the morphological reading generated by
our tagger instead. The motivation is twofold. First, our
tagger has access to a number of lexical and sublexical
features not available during parsing. Second, we expect
the morphological readings to contain most of the
information required to make the correct parsing decision
even though we know that things such as the correct
attachment of prepositional phrases might require some
notion of lexical semantics.
In experiments on the SPMRL 2013 dataset
of nine MRLS we find our method to give improvements for all
languages except French for which we observe a minor drop in
the Parseval score of 0.06. For Hebrew, Hungarian and
Basque we find substantial absolute improvements of 5.65,
11.87 and 15.16, respectively.
We also performed an extensive evaluation on the utility of
word representations for morphological tagging. Our goal was
to reduce the drop in performance that is caused when a
model trained on a specific domain is applied to some other
domain. This problem is usually addressed by domain adaption
(DA). DA adapts a model towards a specific domain using a
small amount of labeled or a huge amount of unlabeled data
from that domain. However, this procedure requires us to
train a model for every target domain. Instead we are trying
to build a robust system that is trained on domain-specific
labeled and domain-independent or general unlabeled data. We
believe word representations to be key in the development of
such models because they allow us to leverage unlabeled
data efficiently. We compare data-driven representations to
manually created morphological analyzers. We understand
data-driven representations as models that cluster word
forms or map them to a vectorial representation. Examples
heavily used in the literature include Brown clusters,
Singular Value Decompositions of count
vectors and neural-network-based
embeddings. We create a test suite of
six languages consisting of in-domain and out-of-domain test
sets. To this end we converted annotations for Spanish and
Czech and annotated the German part of the Smultron
treebank with a morphological layer. In
our experiments on these data sets we find Brown clusters to
outperform the other data-driven representations. Regarding
the comparison with morphological analyzers, we find Brown
clusters to give slightly better performance in
part-of-speech tagging, but to be substantially outperformed
in morphological tagging
Memory limitations are hidden in grammar
[Abstract] The ability to produce and understand an unlimited number of different sentences is a hallmark of human language. Linguists have sought to define the essence of this generative capacity using formal grammars that describe the syntactic dependencies between constituents, independent of the computational limitations of the human brain. Here, we evaluate this independence assumption by sampling sentences uniformly from the space of possible syntactic structures. We find that the average dependency distance between syntactically related words, a proxy for memory limitations, is less than expected by chance in a collection of state-of-the-art classes of dependency grammars. Our findings indicate that memory limitations have permeated grammatical descriptions, suggesting that it may be impossible to build a parsimonious theory of human linguistic productivity independent
of non-linguistic cognitive constraints
Low Resources Machine Translation
METIS-II was a EU-FET MT project running from October 2004 to September
2007, which aimed at translating free text input without resorting to parallel corpora.
The idea was to use ‘basic’ linguistic tools and representations and to link them with
patterns and statistics from the monolingual target-language corpus. The METIS-II
project has four partners, translating from their ‘home’ languages Greek, Dutch,
German, and Spanish into English.
The paper outlines the basic ideas of the project, their implementation, the
resources used, and the results obtained. It also gives examples of how METIS-II
has continued beyond its lifetime and the original scope of the project. On the basis
of the results and experiences obtained, we believe that the approach is promising
and offers the potential for development in various directions
An Unsolicited Soliloquy on Dependency Parsing
Programa Oficial de Doutoramento en Computación . 5009V01[Abstract]
This thesis presents work on dependency parsing covering two distinct lines of research. The
first aims to develop efficient parsers so that they can be fast enough to parse large amounts
of data while still maintaining decent accuracy. We investigate two techniques to achieve
this. The first is a cognitively-inspired method and the second uses a model distillation
method. The first technique proved to be utterly dismal, while the second was somewhat of
a success.
The second line of research presented in this thesis evaluates parsers. This is also done in
two ways. We aim to evaluate what causes variation in parsing performance for different
algorithms and also different treebanks. This evaluation is grounded in dependency displacements
(the directed distance between a dependent and its head) and the subsequent
distributions associated with algorithms and the distributions found in treebanks. This work
sheds some light on the variation in performance for both different algorithms and different
treebanks. And the second part of this area focuses on the utility of part-of-speech tags
when used with parsing systems and questions the standard position of assuming that they
might help but they certainly won’t hurt.[Resumen]
Esta tesis presenta trabajo sobre análisis de dependencias que cubre dos líneas de investigación distintas. La primera tiene como objetivo desarrollar analizadores eficientes, de
modo que sean suficientemente rápidos como para analizar grandes volúmenes de datos y,
al mismo tiempo, sean suficientemente precisos. Investigamos dos métodos. El primero se
basa en teorías cognitivas y el segundo usa una técnica de destilación. La primera técnica
resultó un enorme fracaso, mientras que la segunda fue en cierto modo un ´éxito.
La otra línea evalúa los analizadores sintácticos. Esto también se hace de dos maneras. Evaluamos
la causa de la variación en el rendimiento de los analizadores para distintos algoritmos
y corpus. Esta evaluación utiliza la diferencia entre las distribuciones del desplazamiento
de arista (la distancia dirigida de las aristas) correspondientes a cada algoritmo y corpus.
También evalúa la diferencia entre las distribuciones del desplazamiento de arista en los
datos de entrenamiento y prueba. Este trabajo esclarece las variaciones en el rendimiento
para algoritmos y corpus diferentes. La segunda parte de esta línea investiga la utilidad de
las etiquetas gramaticales para los analizadores sintácticos.[Resumo]
Esta tese presenta traballo sobre análise sintáctica, cubrindo dúas liñas de investigación. A
primeira aspira a desenvolver analizadores eficientes, de maneira que sexan suficientemente
rápidos para procesar grandes volumes de datos e á vez sexan precisos. Investigamos dous
métodos. O primeiro baséase nunha teoría cognitiva, e o segundo usa unha técnica de
destilación. O primeiro método foi un enorme fracaso, mentres que o segundo foi en certo
modo un éxito.
A outra liña avalúa os analizadores sintácticos. Esto tamén se fai de dúas maneiras. Avaliamos
a causa da variación no rendemento dos analizadores para distintos algoritmos e corpus. Esta
avaliaci´on usa a diferencia entre as distribucións do desprazamento de arista (a distancia
dirixida das aristas) correspondentes aos algoritmos e aos corpus. Tamén avalía a diferencia
entre as distribucións do desprazamento de arista nos datos de adestramento e proba.
Este traballo esclarece as variacións no rendemento para algoritmos e corpus diferentes. A
segunda parte desta liña investiga a utilidade das etiquetas gramaticais para os analizadores
sintácticos.This work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (FASTPARSE, grant agreement No 714150) and from the Centro de Investigación de Galicia (CITIC) which is funded by the Xunta de Galicia and the European Union (ERDF - Galicia 2014-2020 Program) by grant ED431G 2019/01.Xunta de Galicia; ED431G 2019/0
- …