30 research outputs found
Analyzing, enhancing, optimizing and applying dependency analysis
Tesis inédita de la Universidad Complutense de Madrid, Facultad de Informática, Departamento de Ingeniería del Software e Inteligencia Artificial, leída el 19/12/2012Los analizadores de dependencias estadísticos han sido mejorados en gran medida durante los últimos años. Esto ha sido posible gracias a los sistemas basados en aprendizaje automático que muestran una gran precisión. Estos sistemas permiten la generación de parsers para idiomas en los que se disponga de un corpus adecuado sin causar, para ello, un gran esfuerzo en el usuario final. MaltParser es uno de estos sistemas. En esta tesis hemos usado sistemas del estado del arte, para mostrar una serie de contribuciones completamente relacionadas con el procesamiento de lenguaje natural (PLN) y análisis de dependencias: (i) Estudio del problema del análisis de dependencias demostrando la homogeneidad en la precisión y mostrando contribuciones interesantes sobre la longitud de las frases, el tamaño de los corpora de entrenamiento y como evaluamos los parsers. (ii) Hemos estudiado además algunas maneras de mejorar la precisión modificando el flujo de análisis de dos maneras distintas, analizando algunos segmentos de las frases de manera separada, y modificando el comportamiento interno de los algoritmos de parsing. (iii) Hemos investigado la selección automática de atributos para aprendizaje máquina para analizadores de dependencias basados en transiciones que consideramos un importante problema y algo que realmente es necesario resolver dado el estado de la cuestión, ya que además puede servir para resolver de mejor manera tareas relacionadas con el análisis de dependencias. (iv) Finalmente, hemos aplicado el análisis de dependencias para resolver algunos problemas, hoy en día importantes, para el procesamiento de lenguage natural (PLN) como son la simplificación de textos o la inferencia del alcance de señales de negación. Por último, añadir que el conocimiento adquirido en la realización de esta tesis puede usarse para implementar aplicaciones basadas en análisis de dependencias más robustas en PLN o en otras áreas relacionadas, como se demuestra a lo largo de la tesis.
[ABSTRACT] Statistical dependency parsing accuracy has been improved substantially during the last years. One of the main reasons is the inclusion of data- driven (or machine learning) based methods. Machine learning allows the development of parsers for every language that has an adequate training corpus without requiring a great effort. MaltParser is one of such systems. In the present thesis we have used state of the art systems (mainly Malt- Parser), to show some contributions in four different areas inherently related to natural language processing (NLP) and dependency parsing: (i) We stu- died the parsing problem demonstrating the homogeneity of the performance and showing interesting contributions about sentence length, corpora size and how we normally evaluate the parsers. (ii) We have also tried some ways of improving the parsing accuracy by modifying the flow of analysis, parsing some segments of the sentences separately by finally constructing a parsing combination problem. We also studied the modification of the inter- nal behavior of the parsers focusing on the root of dependency structures, which is an important part of what a dependency parser parses and worth studying. (iii) We have researched automatic feature selection and parsing optimization for transition based parsers which we consider an important problem and something that definitely needs to be done in dependency par- sing in order to solve parsing problems in a more successful way. And (iv) we have applied syntactic dependency structures and dependency parsing to solve some Natural Language Processing (NLP) problems such as text simplification and inferring the scope of negation cues. Furthermore, the knowledge acquired when developing this thesis could be used to implement more robust dependency parsing–based applications in different NLP (or related) areas, as we demonstrate in the present thesis.Depto. de Ingeniería de Software e Inteligencia Artificial (ISIA)Fac. de InformáticaTRUEunpu
Recommended from our members
Leveraging Text-to-Scene Generation for Language Elicitation and Documentation
Text-to-scene generation systems take input in the form of a natural language text and output a 3D scene illustrating the meaning of that text. A major benefit of text-to-scene generation is that it allows users to create custom 3D scenes without requiring them to have a background in 3D graphics or knowledge of specialized software packages. This contributes to making text-to-scene useful in scenarios from creative applications to education. The primary goal of this thesis is to explore how we can use text-to-scene generation in a new way: as a tool to facilitate the elicitation and formal documentation of language. In particular, we use text-to-scene generation (a) to assist field linguists studying endangered languages; (b) to provide a cross-linguistic framework for formally modeling spatial language; and (c) to collect language data using crowdsourcing. As a side effect of these goals, we also explore the problem of multilingual text-to-scene generation, that is, systems for generating 3D scenes from languages other than English.
The contributions of this thesis are the following. First, we develop a novel tool suite (the WordsEye Linguistics Tools, or WELT) that uses the WordsEye text-to-scene system to assist field linguists with eliciting and documenting endangered languages. WELT allows linguists to create custom elicitation materials and to document semantics in a formal way. We test WELT with two endangered languages, Nahuatl and Arrernte. Second, we explore the question of how to learn a syntactic parser for WELT. We show that an incremental learning method using a small number of annotated dependency structures can produce reasonably accurate results. We demonstrate that using a parser trained in this way can significantly decrease the time it takes an annotator to label a new sentence with dependency information. Third, we develop a framework that generates 3D scenes from spatial and graphical semantic primitives. We incorporate this system into the WELT tools for creating custom elicitation materials, allowing users to directly manipulate the underlying semantics of a generated scene. Fourth, we introduce a deep semantic representation of spatial relations and use this to create a new resource, SpatialNet, which formally declares the lexical semantics of spatial relations for a language. We demonstrate how SpatialNet can be used to support multilingual text-to-scene generation. Finally, we show how WordsEye and the semantic resources it provides can be used to facilitate elicitation of language using crowdsourcing
Proceedings
Proceedings of the Ninth International Workshop
on Treebanks and Linguistic Theories.
Editors: Markus Dickinson, Kaili Müürisep and Marco Passarotti.
NEALT Proceedings Series, Vol. 9 (2010), 268 pages.
© 2010 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/15891
Statistical Parsing by Machine Learning from a Classical Arabic Treebank
Research into statistical parsing for English has enjoyed over a decade of successful results. However, adapting these models to other languages has met with difficulties. Previous comparative work has shown that Modern Arabic is one of the most difficult languages to parse due to rich morphology and free word order. Classical Arabic is the ancient form of Arabic, and is understudied in computational linguistics, relative to its worldwide reach as the language of the Quran. The thesis is based on seven publications that make significant contributions to knowledge relating to annotating and parsing Classical Arabic.
Classical Arabic has been studied in depth by grammarians for over a thousand years using a traditional grammar known as i’rāb (إعغاة ). Using this grammar to develop a representation for parsing is challenging, as it describes syntax using a hybrid of phrase-structure and dependency relations. This work aims to advance the state-of-the-art for hybrid parsing by introducing a formal representation for annotation and a resource for machine learning. The main contributions are the first treebank for Classical Arabic and the first statistical dependency-based parser in any language for ellipsis, dropped pronouns and hybrid representations.
A central argument of this thesis is that using a hybrid representation closely aligned to traditional grammar leads to improved parsing for Arabic. To test this hypothesis, two approaches are compared. As a reference, a pure dependency parser is adapted using graph transformations, resulting in an 87.47% F1-score. This is compared to an integrated parsing model with an F1-score of 89.03%, demonstrating that joint dependency-constituency parsing is better suited to Classical Arabic.
The Quran was chosen for annotation as a large body of work exists providing detailed syntactic analysis. Volunteer crowdsourcing is used for annotation in combination with expert supervision. A practical result of the annotation effort is the corpus website: http://corpus.quran.com, an educational resource with over two million users per year
An Unsolicited Soliloquy on Dependency Parsing
Programa Oficial de Doutoramento en Computación . 5009V01[Abstract]
This thesis presents work on dependency parsing covering two distinct lines of research. The
first aims to develop efficient parsers so that they can be fast enough to parse large amounts
of data while still maintaining decent accuracy. We investigate two techniques to achieve
this. The first is a cognitively-inspired method and the second uses a model distillation
method. The first technique proved to be utterly dismal, while the second was somewhat of
a success.
The second line of research presented in this thesis evaluates parsers. This is also done in
two ways. We aim to evaluate what causes variation in parsing performance for different
algorithms and also different treebanks. This evaluation is grounded in dependency displacements
(the directed distance between a dependent and its head) and the subsequent
distributions associated with algorithms and the distributions found in treebanks. This work
sheds some light on the variation in performance for both different algorithms and different
treebanks. And the second part of this area focuses on the utility of part-of-speech tags
when used with parsing systems and questions the standard position of assuming that they
might help but they certainly won’t hurt.[Resumen]
Esta tesis presenta trabajo sobre análisis de dependencias que cubre dos líneas de investigación distintas. La primera tiene como objetivo desarrollar analizadores eficientes, de
modo que sean suficientemente rápidos como para analizar grandes volúmenes de datos y,
al mismo tiempo, sean suficientemente precisos. Investigamos dos métodos. El primero se
basa en teorías cognitivas y el segundo usa una técnica de destilación. La primera técnica
resultó un enorme fracaso, mientras que la segunda fue en cierto modo un ´éxito.
La otra línea evalúa los analizadores sintácticos. Esto también se hace de dos maneras. Evaluamos
la causa de la variación en el rendimiento de los analizadores para distintos algoritmos
y corpus. Esta evaluación utiliza la diferencia entre las distribuciones del desplazamiento
de arista (la distancia dirigida de las aristas) correspondientes a cada algoritmo y corpus.
También evalúa la diferencia entre las distribuciones del desplazamiento de arista en los
datos de entrenamiento y prueba. Este trabajo esclarece las variaciones en el rendimiento
para algoritmos y corpus diferentes. La segunda parte de esta línea investiga la utilidad de
las etiquetas gramaticales para los analizadores sintácticos.[Resumo]
Esta tese presenta traballo sobre análise sintáctica, cubrindo dúas liñas de investigación. A
primeira aspira a desenvolver analizadores eficientes, de maneira que sexan suficientemente
rápidos para procesar grandes volumes de datos e á vez sexan precisos. Investigamos dous
métodos. O primeiro baséase nunha teoría cognitiva, e o segundo usa unha técnica de
destilación. O primeiro método foi un enorme fracaso, mentres que o segundo foi en certo
modo un éxito.
A outra liña avalúa os analizadores sintácticos. Esto tamén se fai de dúas maneiras. Avaliamos
a causa da variación no rendemento dos analizadores para distintos algoritmos e corpus. Esta
avaliaci´on usa a diferencia entre as distribucións do desprazamento de arista (a distancia
dirixida das aristas) correspondentes aos algoritmos e aos corpus. Tamén avalía a diferencia
entre as distribucións do desprazamento de arista nos datos de adestramento e proba.
Este traballo esclarece as variacións no rendemento para algoritmos e corpus diferentes. A
segunda parte desta liña investiga a utilidade das etiquetas gramaticais para os analizadores
sintácticos.This work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (FASTPARSE, grant agreement No 714150) and from the Centro de Investigación de Galicia (CITIC) which is funded by the Xunta de Galicia and the European Union (ERDF - Galicia 2014-2020 Program) by grant ED431G 2019/01.Xunta de Galicia; ED431G 2019/0
Joint parsing of syntactic and semantic dependencies
Syntactic Dependency Parsing and Semantic Role Labeling (SRL) are two main problems in Natural Language Understanding. Both tasks are closely related and can be regarded as parsing on top of a given sequence. In the data-driven approach context, these tasks are typically addressed sequentially by a pipeline of classifiers. A syntactic parser is run in the first stage, and then given the predicates, the semantic roles are identified and classified (Gildea and Jurafsky, 2002).
An appealing and largely unexplored idea is to jointly process syntactic dependencies and semantic roles. A joint process could capture some interactions that pipeline systems are unable to model. We expect joint models to improve on syntax based on semantic cues and also the reverse. Despite this potential advantage and the interest in joint processing stimulated by the CoNLL-2008 and 2009 Shared Tasks (Surdeanu et al., 2008; Hajic et al., 2009), very few joint models have been proposed to date, few have achieved attention and fewer have obtained competitive results.
This thesis presents three contributions on this topic. The first contribution is to frame semantic role labeling as a linear assignment task. Under this framework we avoid assigning repeated roles to the arguments of a predicate. Our proposal follows previous work on enforcing constraints on the SRL analysis (Punyakanok et al., 2004; Surdeanu et al., 2007). But in our case, we enforce only a relevant subset of these constraints. We solve this problem with the efficient O(n^3) Hungarian algorithm. Our next contributions will rely on this assignment framework.
The second contribution of this thesis is a joint model that combines syntactic parsing and SRL (Lluís et al., 2013). We solve it by using dual-decomposition techniques. A strong point of our model is that it generates a joint solution relying on largely unmodified syntactic and SRL parsers. We train each component independently and the dual-decomposition method finds the optimal joint solution at decoding time. Our model has some optimality and efficiency guarantees. We show experiments comparing the pipeline and joint approaches on different test sets extracted from the CoNLL-2009 Shared Task. We observe some improvements both in syntax and semantics when our syntactic component is a first-order parser. Our results for the English language are competitive with respect to other state-of-the-art joint proposals such as Henderson et al., (2013).
The third contribution of this thesis is a model that finds semantic roles together with syntactic paths linking predicates and arguments (Lluís et al., 2014). We frame SRL as a shortest-path problem. Our method instead of conditioning over complete syntactic paths is based on the assumption that paths can be factorized. We rely on this factorization to efficiently solve our problem. The approach represents a novel way of exploiting syntactic variability in SRL. In experiments we observe improvements in the robustness of classifiers.L'anàlisi sintàctica de dependències i l'etiquetatge de rols semàntics són dues tasques principals dins el camp del Processament del Llenguatge Natural. Aquestes dues tasques estan estretament relacionades i poden considerar-se de manera genèrica com la construcció d'una anàlisi a partir d'una seqüència donada. En el context de les aproximacions basades en grans volums de dades, les dues tasques es tracten habitualment de manera seqüencial mitjançant una sèrie de classificadors en cadena. Un analitzador sintàctic s'aplica en primer lloc i a continuació i donats un predicats, els rols semàntics són identificats i classificats (Gildea i Jurafsky, 2002). Processar conjuntament les dependències sintàctiques i els rols semàntics és una idea que pot resultar atractiva però que tot i això s'ha explorat poc. Un procés conjunt podria capturar algunes interaccions que els sistemes seqüencials són incapaços de modelar. En un model conjunt esperem que la semàntica ens proporcioni pistes per tal de millorar la sintaxi així com també que es produeixin millores en el sentit contrari. Tot i aquests avantatges potencials i l'interès en els models conjunts que va despertar la tasca compartida de les "Conference on Computational Natural Language Learning" (CoNLL) 2008 i 2009 (Surdeanu et al., 2008; Hajic et al., 2009) fins al dia d'avui s'han proposat pocs models conjunts, pocs d'aquests han aconseguit tenir un ampli ressò i encara menys han presentat resultats competitius. La tesi presenta tres contribucions en aquest camp. La primera contribució és modelar l'etiquetatge de rols semàntics com un problema d'assignació lineal. Sota aquest marc evitem assignar rols repetits als arguments d'un predicat. Aquesta proposta va en la línia del treball previ sobre aplicació de restriccions en l'etiquetatge de rols semàntics (Punyakanok et al., 2004; Surdeanu et al., 2007). En el nostre cas però, apliquem només un subconjunt de les restriccions més rellevants presentades en treballs anteriors. El problema de l'assignació el resolem amb l'eficient algorisme Hongarès O(n^3). Les següents contribucions d'aquesta tesi utilitzen aquest mateix marc basat en l'assignació. La segona contribució de la tesi és un model conjunt que combina l'anàlisi sintàctica amb l'etiquetatge de rols semàntics (Lluís et al., 2013). Resolem aquest problema utilitzant el mètode anomenat "dual decomposition". Un punt destacable del nostre model és que genera la solució conjunta basant-se en analitzadors sintàctics i de rols semàntics pràcticament sense modificar. Entrenem cada component per separat i el mètode de "dual decomposition" ens permet obtenir la solució conjunta òptima durant la fase descodificació. El nostre model presenta algunes garanties d'optimalitat i eficiència. Mostrem experiments comparant les aproximacions seqüencials i conjuntes amb diferents conjunts de dades extrets de la tasca compartida del CoNLL-2009. Hem observat algunes millores tant en sintaxi com en semàntica en els casos en que el nostre component sintàctic és un analitzador de primer ordre. Els resultats que obtenim per a l'anglès són competitius respecte a altres sistemes conjunts de l'estat de l'art tals com Henderson et al. (2013). La tercera contribució de la tesi és un model que cerca rols semàntics juntament amb camins sintàctics que relacionen els predicats amb els seus arguments (Lluís et al., 2014). Considerem l'etiquetatge de rols com un problema de camins mínims. El nostre mètode enlloc de condicionar sobre camins sintàctics complets, es basa en l'assumpció que els camins poden ser factoritzats. Aquesta factorització és la que ens permet solucionar el problema de manera eficient. Aquesta aproximació representa una nova manera d'explotar variabilitat sintàctica durant l'etiquetatge de rols semàntics. En els experiments observem millores en la robustesa dels classificadors
D7.4 Third evaluation report. Evaluation of PANACEA v3 and produced resources
D7.4 reports on the evaluation of the different components integrated in the PANACEA third cycle of development as well as the final validation of the platform itself. All validation and evaluation experiments follow the evaluation criteria already described in D7.1. The main goal of WP7 tasks was to test the (technical) functionalities and capabilities of the middleware that allows the integration of the various resource-creation components into an interoperable distributed environment (WP3) and to evaluate the quality of the components developed in WP5 and WP6. The content of this deliverable is thus complementary to D8.2 and D8.3 that tackle advantages and usability in industrial scenarios. It has to be noted that the PANACEA third cycle of development addressed many components that are still under research. The main goal for this evaluation cycle thus is to assess the methods experimented with and their potentials for becoming actual production tools to be exploited outside research labs. For most of the technologies, an attempt was made to re-interpret standard evaluation measures, usually in terms of accuracy, precision and recall, as measures related to a reduction of costs (time and human resources) in the current practices based on the manual production of resources. In order to do so, the different tools had to be tuned and adapted to maximize precision and for some tools the possibility to offer confidence measures that could allow a separation of the resources that still needed manual revision has been attempted. Furthermore, the extension to other languages in addition to English, also a PANACEA objective, has been evaluated. The main facts about the evaluation results are now summarized