30 research outputs found

    Analyzing, enhancing, optimizing and applying dependency analysis

    Get PDF
    Tesis inédita de la Universidad Complutense de Madrid, Facultad de Informática, Departamento de Ingeniería del Software e Inteligencia Artificial, leída el 19/12/2012Los analizadores de dependencias estadísticos han sido mejorados en gran medida durante los últimos años. Esto ha sido posible gracias a los sistemas basados en aprendizaje automático que muestran una gran precisión. Estos sistemas permiten la generación de parsers para idiomas en los que se disponga de un corpus adecuado sin causar, para ello, un gran esfuerzo en el usuario final. MaltParser es uno de estos sistemas. En esta tesis hemos usado sistemas del estado del arte, para mostrar una serie de contribuciones completamente relacionadas con el procesamiento de lenguaje natural (PLN) y análisis de dependencias: (i) Estudio del problema del análisis de dependencias demostrando la homogeneidad en la precisión y mostrando contribuciones interesantes sobre la longitud de las frases, el tamaño de los corpora de entrenamiento y como evaluamos los parsers. (ii) Hemos estudiado además algunas maneras de mejorar la precisión modificando el flujo de análisis de dos maneras distintas, analizando algunos segmentos de las frases de manera separada, y modificando el comportamiento interno de los algoritmos de parsing. (iii) Hemos investigado la selección automática de atributos para aprendizaje máquina para analizadores de dependencias basados en transiciones que consideramos un importante problema y algo que realmente es necesario resolver dado el estado de la cuestión, ya que además puede servir para resolver de mejor manera tareas relacionadas con el análisis de dependencias. (iv) Finalmente, hemos aplicado el análisis de dependencias para resolver algunos problemas, hoy en día importantes, para el procesamiento de lenguage natural (PLN) como son la simplificación de textos o la inferencia del alcance de señales de negación. Por último, añadir que el conocimiento adquirido en la realización de esta tesis puede usarse para implementar aplicaciones basadas en análisis de dependencias más robustas en PLN o en otras áreas relacionadas, como se demuestra a lo largo de la tesis. [ABSTRACT] Statistical dependency parsing accuracy has been improved substantially during the last years. One of the main reasons is the inclusion of data- driven (or machine learning) based methods. Machine learning allows the development of parsers for every language that has an adequate training corpus without requiring a great effort. MaltParser is one of such systems. In the present thesis we have used state of the art systems (mainly Malt- Parser), to show some contributions in four different areas inherently related to natural language processing (NLP) and dependency parsing: (i) We stu- died the parsing problem demonstrating the homogeneity of the performance and showing interesting contributions about sentence length, corpora size and how we normally evaluate the parsers. (ii) We have also tried some ways of improving the parsing accuracy by modifying the flow of analysis, parsing some segments of the sentences separately by finally constructing a parsing combination problem. We also studied the modification of the inter- nal behavior of the parsers focusing on the root of dependency structures, which is an important part of what a dependency parser parses and worth studying. (iii) We have researched automatic feature selection and parsing optimization for transition based parsers which we consider an important problem and something that definitely needs to be done in dependency par- sing in order to solve parsing problems in a more successful way. And (iv) we have applied syntactic dependency structures and dependency parsing to solve some Natural Language Processing (NLP) problems such as text simplification and inferring the scope of negation cues. Furthermore, the knowledge acquired when developing this thesis could be used to implement more robust dependency parsing–based applications in different NLP (or related) areas, as we demonstrate in the present thesis.Depto. de Ingeniería de Software e Inteligencia Artificial (ISIA)Fac. de InformáticaTRUEunpu

    Proceedings

    Get PDF
    Proceedings of the Ninth International Workshop on Treebanks and Linguistic Theories. Editors: Markus Dickinson, Kaili Müürisep and Marco Passarotti. NEALT Proceedings Series, Vol. 9 (2010), 268 pages. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15891

    Statistical Parsing by Machine Learning from a Classical Arabic Treebank

    Get PDF
    Research into statistical parsing for English has enjoyed over a decade of successful results. However, adapting these models to other languages has met with difficulties. Previous comparative work has shown that Modern Arabic is one of the most difficult languages to parse due to rich morphology and free word order. Classical Arabic is the ancient form of Arabic, and is understudied in computational linguistics, relative to its worldwide reach as the language of the Quran. The thesis is based on seven publications that make significant contributions to knowledge relating to annotating and parsing Classical Arabic. Classical Arabic has been studied in depth by grammarians for over a thousand years using a traditional grammar known as i’rāb (إعغاة ). Using this grammar to develop a representation for parsing is challenging, as it describes syntax using a hybrid of phrase-structure and dependency relations. This work aims to advance the state-of-the-art for hybrid parsing by introducing a formal representation for annotation and a resource for machine learning. The main contributions are the first treebank for Classical Arabic and the first statistical dependency-based parser in any language for ellipsis, dropped pronouns and hybrid representations. A central argument of this thesis is that using a hybrid representation closely aligned to traditional grammar leads to improved parsing for Arabic. To test this hypothesis, two approaches are compared. As a reference, a pure dependency parser is adapted using graph transformations, resulting in an 87.47% F1-score. This is compared to an integrated parsing model with an F1-score of 89.03%, demonstrating that joint dependency-constituency parsing is better suited to Classical Arabic. The Quran was chosen for annotation as a large body of work exists providing detailed syntactic analysis. Volunteer crowdsourcing is used for annotation in combination with expert supervision. A practical result of the annotation effort is the corpus website: http://corpus.quran.com, an educational resource with over two million users per year

    An Unsolicited Soliloquy on Dependency Parsing

    Get PDF
    Programa Oficial de Doutoramento en Computación . 5009V01[Abstract] This thesis presents work on dependency parsing covering two distinct lines of research. The first aims to develop efficient parsers so that they can be fast enough to parse large amounts of data while still maintaining decent accuracy. We investigate two techniques to achieve this. The first is a cognitively-inspired method and the second uses a model distillation method. The first technique proved to be utterly dismal, while the second was somewhat of a success. The second line of research presented in this thesis evaluates parsers. This is also done in two ways. We aim to evaluate what causes variation in parsing performance for different algorithms and also different treebanks. This evaluation is grounded in dependency displacements (the directed distance between a dependent and its head) and the subsequent distributions associated with algorithms and the distributions found in treebanks. This work sheds some light on the variation in performance for both different algorithms and different treebanks. And the second part of this area focuses on the utility of part-of-speech tags when used with parsing systems and questions the standard position of assuming that they might help but they certainly won’t hurt.[Resumen] Esta tesis presenta trabajo sobre análisis de dependencias que cubre dos líneas de investigación distintas. La primera tiene como objetivo desarrollar analizadores eficientes, de modo que sean suficientemente rápidos como para analizar grandes volúmenes de datos y, al mismo tiempo, sean suficientemente precisos. Investigamos dos métodos. El primero se basa en teorías cognitivas y el segundo usa una técnica de destilación. La primera técnica resultó un enorme fracaso, mientras que la segunda fue en cierto modo un ´éxito. La otra línea evalúa los analizadores sintácticos. Esto también se hace de dos maneras. Evaluamos la causa de la variación en el rendimiento de los analizadores para distintos algoritmos y corpus. Esta evaluación utiliza la diferencia entre las distribuciones del desplazamiento de arista (la distancia dirigida de las aristas) correspondientes a cada algoritmo y corpus. También evalúa la diferencia entre las distribuciones del desplazamiento de arista en los datos de entrenamiento y prueba. Este trabajo esclarece las variaciones en el rendimiento para algoritmos y corpus diferentes. La segunda parte de esta línea investiga la utilidad de las etiquetas gramaticales para los analizadores sintácticos.[Resumo] Esta tese presenta traballo sobre análise sintáctica, cubrindo dúas liñas de investigación. A primeira aspira a desenvolver analizadores eficientes, de maneira que sexan suficientemente rápidos para procesar grandes volumes de datos e á vez sexan precisos. Investigamos dous métodos. O primeiro baséase nunha teoría cognitiva, e o segundo usa unha técnica de destilación. O primeiro método foi un enorme fracaso, mentres que o segundo foi en certo modo un éxito. A outra liña avalúa os analizadores sintácticos. Esto tamén se fai de dúas maneiras. Avaliamos a causa da variación no rendemento dos analizadores para distintos algoritmos e corpus. Esta avaliaci´on usa a diferencia entre as distribucións do desprazamento de arista (a distancia dirixida das aristas) correspondentes aos algoritmos e aos corpus. Tamén avalía a diferencia entre as distribucións do desprazamento de arista nos datos de adestramento e proba. Este traballo esclarece as variacións no rendemento para algoritmos e corpus diferentes. A segunda parte desta liña investiga a utilidade das etiquetas gramaticais para os analizadores sintácticos.This work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (FASTPARSE, grant agreement No 714150) and from the Centro de Investigación de Galicia (CITIC) which is funded by the Xunta de Galicia and the European Union (ERDF - Galicia 2014-2020 Program) by grant ED431G 2019/01.Xunta de Galicia; ED431G 2019/0

    Joint parsing of syntactic and semantic dependencies

    Get PDF
    Syntactic Dependency Parsing and Semantic Role Labeling (SRL) are two main problems in Natural Language Understanding. Both tasks are closely related and can be regarded as parsing on top of a given sequence. In the data-driven approach context, these tasks are typically addressed sequentially by a pipeline of classifiers. A syntactic parser is run in the first stage, and then given the predicates, the semantic roles are identified and classified (Gildea and Jurafsky, 2002). An appealing and largely unexplored idea is to jointly process syntactic dependencies and semantic roles. A joint process could capture some interactions that pipeline systems are unable to model. We expect joint models to improve on syntax based on semantic cues and also the reverse. Despite this potential advantage and the interest in joint processing stimulated by the CoNLL-2008 and 2009 Shared Tasks (Surdeanu et al., 2008; Hajic et al., 2009), very few joint models have been proposed to date, few have achieved attention and fewer have obtained competitive results. This thesis presents three contributions on this topic. The first contribution is to frame semantic role labeling as a linear assignment task. Under this framework we avoid assigning repeated roles to the arguments of a predicate. Our proposal follows previous work on enforcing constraints on the SRL analysis (Punyakanok et al., 2004; Surdeanu et al., 2007). But in our case, we enforce only a relevant subset of these constraints. We solve this problem with the efficient O(n^3) Hungarian algorithm. Our next contributions will rely on this assignment framework. The second contribution of this thesis is a joint model that combines syntactic parsing and SRL (Lluís et al., 2013). We solve it by using dual-decomposition techniques. A strong point of our model is that it generates a joint solution relying on largely unmodified syntactic and SRL parsers. We train each component independently and the dual-decomposition method finds the optimal joint solution at decoding time. Our model has some optimality and efficiency guarantees. We show experiments comparing the pipeline and joint approaches on different test sets extracted from the CoNLL-2009 Shared Task. We observe some improvements both in syntax and semantics when our syntactic component is a first-order parser. Our results for the English language are competitive with respect to other state-of-the-art joint proposals such as Henderson et al., (2013). The third contribution of this thesis is a model that finds semantic roles together with syntactic paths linking predicates and arguments (Lluís et al., 2014). We frame SRL as a shortest-path problem. Our method instead of conditioning over complete syntactic paths is based on the assumption that paths can be factorized. We rely on this factorization to efficiently solve our problem. The approach represents a novel way of exploiting syntactic variability in SRL. In experiments we observe improvements in the robustness of classifiers.L'anàlisi sintàctica de dependències i l'etiquetatge de rols semàntics són dues tasques principals dins el camp del Processament del Llenguatge Natural. Aquestes dues tasques estan estretament relacionades i poden considerar-se de manera genèrica com la construcció d'una anàlisi a partir d'una seqüència donada. En el context de les aproximacions basades en grans volums de dades, les dues tasques es tracten habitualment de manera seqüencial mitjançant una sèrie de classificadors en cadena. Un analitzador sintàctic s'aplica en primer lloc i a continuació i donats un predicats, els rols semàntics són identificats i classificats (Gildea i Jurafsky, 2002). Processar conjuntament les dependències sintàctiques i els rols semàntics és una idea que pot resultar atractiva però que tot i això s'ha explorat poc. Un procés conjunt podria capturar algunes interaccions que els sistemes seqüencials són incapaços de modelar. En un model conjunt esperem que la semàntica ens proporcioni pistes per tal de millorar la sintaxi així com també que es produeixin millores en el sentit contrari. Tot i aquests avantatges potencials i l'interès en els models conjunts que va despertar la tasca compartida de les "Conference on Computational Natural Language Learning" (CoNLL) 2008 i 2009 (Surdeanu et al., 2008; Hajic et al., 2009) fins al dia d'avui s'han proposat pocs models conjunts, pocs d'aquests han aconseguit tenir un ampli ressò i encara menys han presentat resultats competitius. La tesi presenta tres contribucions en aquest camp. La primera contribució és modelar l'etiquetatge de rols semàntics com un problema d'assignació lineal. Sota aquest marc evitem assignar rols repetits als arguments d'un predicat. Aquesta proposta va en la línia del treball previ sobre aplicació de restriccions en l'etiquetatge de rols semàntics (Punyakanok et al., 2004; Surdeanu et al., 2007). En el nostre cas però, apliquem només un subconjunt de les restriccions més rellevants presentades en treballs anteriors. El problema de l'assignació el resolem amb l'eficient algorisme Hongarès O(n^3). Les següents contribucions d'aquesta tesi utilitzen aquest mateix marc basat en l'assignació. La segona contribució de la tesi és un model conjunt que combina l'anàlisi sintàctica amb l'etiquetatge de rols semàntics (Lluís et al., 2013). Resolem aquest problema utilitzant el mètode anomenat "dual decomposition". Un punt destacable del nostre model és que genera la solució conjunta basant-se en analitzadors sintàctics i de rols semàntics pràcticament sense modificar. Entrenem cada component per separat i el mètode de "dual decomposition" ens permet obtenir la solució conjunta òptima durant la fase descodificació. El nostre model presenta algunes garanties d'optimalitat i eficiència. Mostrem experiments comparant les aproximacions seqüencials i conjuntes amb diferents conjunts de dades extrets de la tasca compartida del CoNLL-2009. Hem observat algunes millores tant en sintaxi com en semàntica en els casos en que el nostre component sintàctic és un analitzador de primer ordre. Els resultats que obtenim per a l'anglès són competitius respecte a altres sistemes conjunts de l'estat de l'art tals com Henderson et al. (2013). La tercera contribució de la tesi és un model que cerca rols semàntics juntament amb camins sintàctics que relacionen els predicats amb els seus arguments (Lluís et al., 2014). Considerem l'etiquetatge de rols com un problema de camins mínims. El nostre mètode enlloc de condicionar sobre camins sintàctics complets, es basa en l'assumpció que els camins poden ser factoritzats. Aquesta factorització és la que ens permet solucionar el problema de manera eficient. Aquesta aproximació representa una nova manera d'explotar variabilitat sintàctica durant l'etiquetatge de rols semàntics. En els experiments observem millores en la robustesa dels classificadors

    Nodalida 2005 - proceedings of the 15th NODALIDA conference

    Get PDF

    D7.4 Third evaluation report. Evaluation of PANACEA v3 and produced resources

    Get PDF
    D7.4 reports on the evaluation of the different components integrated in the PANACEA third cycle of development as well as the final validation of the platform itself. All validation and evaluation experiments follow the evaluation criteria already described in D7.1. The main goal of WP7 tasks was to test the (technical) functionalities and capabilities of the middleware that allows the integration of the various resource-creation components into an interoperable distributed environment (WP3) and to evaluate the quality of the components developed in WP5 and WP6. The content of this deliverable is thus complementary to D8.2 and D8.3 that tackle advantages and usability in industrial scenarios. It has to be noted that the PANACEA third cycle of development addressed many components that are still under research. The main goal for this evaluation cycle thus is to assess the methods experimented with and their potentials for becoming actual production tools to be exploited outside research labs. For most of the technologies, an attempt was made to re-interpret standard evaluation measures, usually in terms of accuracy, precision and recall, as measures related to a reduction of costs (time and human resources) in the current practices based on the manual production of resources. In order to do so, the different tools had to be tuned and adapted to maximize precision and for some tools the possibility to offer confidence measures that could allow a separation of the resources that still needed manual revision has been attempted. Furthermore, the extension to other languages in addition to English, also a PANACEA objective, has been evaluated. The main facts about the evaluation results are now summarized
    corecore