149 research outputs found
A Universal Part-of-Speech Tagset
To facilitate future research in unsupervised induction of syntactic
structure and to standardize best-practices, we propose a tagset that consists
of twelve universal part-of-speech categories. In addition to the tagset, we
develop a mapping from 25 different treebank tagsets to this universal set. As
a result, when combined with the original treebank data, this universal tagset
and mapping produce a dataset consisting of common parts-of-speech for 22
different languages. We highlight the use of this resource via two experiments,
including one that reports competitive accuracies for unsupervised grammar
induction without gold standard part-of-speech tags
Analysis of the Penn Korean Universal Dependency Treebank (PKT-UD): Manual Revision to Build Robust Parsing Model in Korean
In this paper, we first open on important issues regarding the Penn Korean
Universal Treebank (PKT-UD) and address these issues by revising the entire
corpus manually with the aim of producing cleaner UD annotations that are more
faithful to Korean grammar. For compatibility to the rest of UD corpora, we
follow the UDv2 guidelines, and extensively revise the part-of-speech tags and
the dependency relations to reflect morphological features and flexible
word-order aspects in Korean. The original and the revised versions of PKT-UD
are experimented with transformer-based parsing models using biaffine
attention. The parsing model trained on the revised corpus shows a significant
improvement of 3.0% in labeled attachment score over the model trained on the
previous corpus. Our error analysis demonstrates that this revision allows the
parsing model to learn relations more robustly, reducing several critical
errors that used to be made by the previous model.Comment: Accepted by The 16th International Conference on Parsing
Technologies, IWPT 202
Applying Occam's Razor to Transformer-Based Dependency Parsing: What Works, What Doesn't, and What is Really Necessary
The introduction of pre-trained transformer-based contextualized word
embeddings has led to considerable improvements in the accuracy of graph-based
parsers for frameworks such as Universal Dependencies (UD). However, previous
works differ in various dimensions, including their choice of pre-trained
language models and whether they use LSTM layers. With the aims of
disentangling the effects of these choices and identifying a simple yet widely
applicable architecture, we introduce STEPS, a new modular graph-based
dependency parser. Using STEPS, we perform a series of analyses on the UD
corpora of a diverse set of languages. We find that the choice of pre-trained
embeddings has by far the greatest impact on parser performance and identify
XLM-R as a robust choice across the languages in our study. Adding LSTM layers
provides no benefits when using transformer-based embeddings. A multi-task
training setup outputting additional UD features may contort results. Taking
these insights together, we propose a simple but widely applicable parser
architecture and configuration, achieving new state-of-the-art results (in
terms of LAS) for 10 out of 12 diverse languages.Comment: 14 pages, 1 figure; camera-ready version for IWPT 202
Cross-lingual transfer parsing for low-resourced languages: an Irish case study
We present a study of cross-lingual direct transfer parsing for the Irish language. Firstly we
discuss mapping of the annotation scheme of the Irish Dependency Treebank to a universal dependency scheme. We explain our dependency label mapping choices and the structural changes
required in the Irish Dependency Treebank. We then experiment with the universally annotated
treebanks of ten languages from four language family groups to assess which languages are the
most useful for cross-lingual parsing of Irish by using these treebanks to train delexicalised parsing models which are then applied to sentences from the Irish Dependency Treebank. The best
results are achieved when using Indonesian, a language from the Austronesian language family
Overview of the SPMRL 2013 Shared Task: A Cross-Framework Evaluation of Parsing Morphologically Rich Languages
International audienceThis paper reports on the first shared task on statistical parsing of morphologically rich lan- guages (MRLs). The task features data sets from nine languages, each available both in constituency and dependency annotation. We report on the preparation of the data sets, on the proposed parsing scenarios, and on the eval- uation metrics for parsing MRLs given dif- ferent representation types. We present and analyze parsing results obtained by the task participants, and then provide an analysis and comparison of the parsers across languages and frameworks, reported for gold input as well as more realistic parsing scenarios
The optimality of syntactic dependency distances
It is often stated that human languages, as other biological systems, are
shaped by cost-cutting pressures but, to what extent? Attempts to quantify the
degree of optimality of languages by means of an optimality score have been
scarce and focused mostly on English. Here we recast the problem of the
optimality of the word order of a sentence as an optimization problem on a
spatial network where the vertices are words, arcs indicate syntactic
dependencies and the space is defined by the linear order of the words in the
sentence. We introduce a new score to quantify the cognitive pressure to reduce
the distance between linked words in a sentence. The analysis of sentences from
93 languages representing 19 linguistic families reveals that half of languages
are optimized to a 70% or more. The score indicates that distances are not
significantly reduced in a few languages and confirms two theoretical
predictions, i.e. that longer sentences are more optimized and that distances
are more likely to be longer than expected by chance in short sentences. We
present a new hierarchical ranking of languages by their degree of
optimization. The statistical advantages of the new score call for a
reevaluation of the evolution of dependency distance over time in languages as
well as the relationship between dependency distance and linguistic competence.
Finally, the principles behind the design of the score can be extended to
develop more powerful normalizations of topological distances or physical
distances in more dimensions
An Unsolicited Soliloquy on Dependency Parsing
Programa Oficial de Doutoramento en Computación . 5009V01[Abstract]
This thesis presents work on dependency parsing covering two distinct lines of research. The
first aims to develop efficient parsers so that they can be fast enough to parse large amounts
of data while still maintaining decent accuracy. We investigate two techniques to achieve
this. The first is a cognitively-inspired method and the second uses a model distillation
method. The first technique proved to be utterly dismal, while the second was somewhat of
a success.
The second line of research presented in this thesis evaluates parsers. This is also done in
two ways. We aim to evaluate what causes variation in parsing performance for different
algorithms and also different treebanks. This evaluation is grounded in dependency displacements
(the directed distance between a dependent and its head) and the subsequent
distributions associated with algorithms and the distributions found in treebanks. This work
sheds some light on the variation in performance for both different algorithms and different
treebanks. And the second part of this area focuses on the utility of part-of-speech tags
when used with parsing systems and questions the standard position of assuming that they
might help but they certainly won’t hurt.[Resumen]
Esta tesis presenta trabajo sobre análisis de dependencias que cubre dos líneas de investigación distintas. La primera tiene como objetivo desarrollar analizadores eficientes, de
modo que sean suficientemente rápidos como para analizar grandes volúmenes de datos y,
al mismo tiempo, sean suficientemente precisos. Investigamos dos métodos. El primero se
basa en teorías cognitivas y el segundo usa una técnica de destilación. La primera técnica
resultó un enorme fracaso, mientras que la segunda fue en cierto modo un ´éxito.
La otra línea evalúa los analizadores sintácticos. Esto también se hace de dos maneras. Evaluamos
la causa de la variación en el rendimiento de los analizadores para distintos algoritmos
y corpus. Esta evaluación utiliza la diferencia entre las distribuciones del desplazamiento
de arista (la distancia dirigida de las aristas) correspondientes a cada algoritmo y corpus.
También evalúa la diferencia entre las distribuciones del desplazamiento de arista en los
datos de entrenamiento y prueba. Este trabajo esclarece las variaciones en el rendimiento
para algoritmos y corpus diferentes. La segunda parte de esta línea investiga la utilidad de
las etiquetas gramaticales para los analizadores sintácticos.[Resumo]
Esta tese presenta traballo sobre análise sintáctica, cubrindo dúas liñas de investigación. A
primeira aspira a desenvolver analizadores eficientes, de maneira que sexan suficientemente
rápidos para procesar grandes volumes de datos e á vez sexan precisos. Investigamos dous
métodos. O primeiro baséase nunha teoría cognitiva, e o segundo usa unha técnica de
destilación. O primeiro método foi un enorme fracaso, mentres que o segundo foi en certo
modo un éxito.
A outra liña avalúa os analizadores sintácticos. Esto tamén se fai de dúas maneiras. Avaliamos
a causa da variación no rendemento dos analizadores para distintos algoritmos e corpus. Esta
avaliaci´on usa a diferencia entre as distribucións do desprazamento de arista (a distancia
dirixida das aristas) correspondentes aos algoritmos e aos corpus. Tamén avalía a diferencia
entre as distribucións do desprazamento de arista nos datos de adestramento e proba.
Este traballo esclarece as variacións no rendemento para algoritmos e corpus diferentes. A
segunda parte desta liña investiga a utilidade das etiquetas gramaticais para os analizadores
sintácticos.This work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (FASTPARSE, grant agreement No 714150) and from the Centro de Investigación de Galicia (CITIC) which is funded by the Xunta de Galicia and the European Union (ERDF - Galicia 2014-2020 Program) by grant ED431G 2019/01.Xunta de Galicia; ED431G 2019/0
- …