63 research outputs found
An Unsolicited Soliloquy on Dependency Parsing
Programa Oficial de Doutoramento en Computación . 5009V01[Abstract]
This thesis presents work on dependency parsing covering two distinct lines of research. The
first aims to develop efficient parsers so that they can be fast enough to parse large amounts
of data while still maintaining decent accuracy. We investigate two techniques to achieve
this. The first is a cognitively-inspired method and the second uses a model distillation
method. The first technique proved to be utterly dismal, while the second was somewhat of
a success.
The second line of research presented in this thesis evaluates parsers. This is also done in
two ways. We aim to evaluate what causes variation in parsing performance for different
algorithms and also different treebanks. This evaluation is grounded in dependency displacements
(the directed distance between a dependent and its head) and the subsequent
distributions associated with algorithms and the distributions found in treebanks. This work
sheds some light on the variation in performance for both different algorithms and different
treebanks. And the second part of this area focuses on the utility of part-of-speech tags
when used with parsing systems and questions the standard position of assuming that they
might help but they certainly won’t hurt.[Resumen]
Esta tesis presenta trabajo sobre análisis de dependencias que cubre dos lÃneas de investigación distintas. La primera tiene como objetivo desarrollar analizadores eficientes, de
modo que sean suficientemente rápidos como para analizar grandes volúmenes de datos y,
al mismo tiempo, sean suficientemente precisos. Investigamos dos métodos. El primero se
basa en teorÃas cognitivas y el segundo usa una técnica de destilación. La primera técnica
resultó un enorme fracaso, mientras que la segunda fue en cierto modo un ´éxito.
La otra lÃnea evalúa los analizadores sintácticos. Esto también se hace de dos maneras. Evaluamos
la causa de la variación en el rendimiento de los analizadores para distintos algoritmos
y corpus. Esta evaluación utiliza la diferencia entre las distribuciones del desplazamiento
de arista (la distancia dirigida de las aristas) correspondientes a cada algoritmo y corpus.
También evalúa la diferencia entre las distribuciones del desplazamiento de arista en los
datos de entrenamiento y prueba. Este trabajo esclarece las variaciones en el rendimiento
para algoritmos y corpus diferentes. La segunda parte de esta lÃnea investiga la utilidad de
las etiquetas gramaticales para los analizadores sintácticos.[Resumo]
Esta tese presenta traballo sobre análise sintáctica, cubrindo dúas liñas de investigación. A
primeira aspira a desenvolver analizadores eficientes, de maneira que sexan suficientemente
rápidos para procesar grandes volumes de datos e á vez sexan precisos. Investigamos dous
métodos. O primeiro baséase nunha teorÃa cognitiva, e o segundo usa unha técnica de
destilación. O primeiro método foi un enorme fracaso, mentres que o segundo foi en certo
modo un éxito.
A outra liña avalúa os analizadores sintácticos. Esto tamén se fai de dúas maneiras. Avaliamos
a causa da variación no rendemento dos analizadores para distintos algoritmos e corpus. Esta
avaliaci´on usa a diferencia entre as distribucións do desprazamento de arista (a distancia
dirixida das aristas) correspondentes aos algoritmos e aos corpus. Tamén avalÃa a diferencia
entre as distribucións do desprazamento de arista nos datos de adestramento e proba.
Este traballo esclarece as variacións no rendemento para algoritmos e corpus diferentes. A
segunda parte desta liña investiga a utilidade das etiquetas gramaticais para os analizadores
sintácticos.This work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (FASTPARSE, grant agreement No 714150) and from the Centro de Investigación de Galicia (CITIC) which is funded by the Xunta de Galicia and the European Union (ERDF - Galicia 2014-2020 Program) by grant ED431G 2019/01.Xunta de Galicia; ED431G 2019/0
Unsupervised grammar induction with Combinatory Categorial Grammars
Language is a highly structured medium for communication. An idea starts in the speaker's mind (semantics) and is transformed into a well formed, intelligible, sentence via the specific syntactic rules of a language. We aim to discover the fingerprints of this process in the choice and location of words used in the final utterance. What is unclear is how much of this latent process can be discovered from the linguistic signal alone and how much requires shared non-linguistic context, knowledge, or cues.
Unsupervised grammar induction is the task of analyzing strings in a language to discover the latent syntactic structure of the language without access to labeled training data. Successes in unsupervised grammar induction shed light on the amount of syntactic structure that is discoverable from raw or part-of-speech tagged text. In this thesis, we present a state-of-the-art grammar induction system based on Combinatory Categorial Grammars. Our choice of syntactic formalism enables the first labeled evaluation of an unsupervised system. This allows us to perform an in-depth analysis of the system’s linguistic strengths and weaknesses. In order to completely eliminate reliance on any supervised systems, we also examine how performance is affected when we use induced word clusters instead of gold-standard POS tags. Finally, we perform a semantic evaluation of induced grammars, providing unique insights into future directions for unsupervised grammar induction systems
- …