314 research outputs found

    An Unsolicited Soliloquy on Dependency Parsing

    Get PDF
    Programa Oficial de Doutoramento en Computación . 5009V01[Abstract] This thesis presents work on dependency parsing covering two distinct lines of research. The first aims to develop efficient parsers so that they can be fast enough to parse large amounts of data while still maintaining decent accuracy. We investigate two techniques to achieve this. The first is a cognitively-inspired method and the second uses a model distillation method. The first technique proved to be utterly dismal, while the second was somewhat of a success. The second line of research presented in this thesis evaluates parsers. This is also done in two ways. We aim to evaluate what causes variation in parsing performance for different algorithms and also different treebanks. This evaluation is grounded in dependency displacements (the directed distance between a dependent and its head) and the subsequent distributions associated with algorithms and the distributions found in treebanks. This work sheds some light on the variation in performance for both different algorithms and different treebanks. And the second part of this area focuses on the utility of part-of-speech tags when used with parsing systems and questions the standard position of assuming that they might help but they certainly won’t hurt.[Resumen] Esta tesis presenta trabajo sobre análisis de dependencias que cubre dos líneas de investigación distintas. La primera tiene como objetivo desarrollar analizadores eficientes, de modo que sean suficientemente rápidos como para analizar grandes volúmenes de datos y, al mismo tiempo, sean suficientemente precisos. Investigamos dos métodos. El primero se basa en teorías cognitivas y el segundo usa una técnica de destilación. La primera técnica resultó un enorme fracaso, mientras que la segunda fue en cierto modo un ´éxito. La otra línea evalúa los analizadores sintácticos. Esto también se hace de dos maneras. Evaluamos la causa de la variación en el rendimiento de los analizadores para distintos algoritmos y corpus. Esta evaluación utiliza la diferencia entre las distribuciones del desplazamiento de arista (la distancia dirigida de las aristas) correspondientes a cada algoritmo y corpus. También evalúa la diferencia entre las distribuciones del desplazamiento de arista en los datos de entrenamiento y prueba. Este trabajo esclarece las variaciones en el rendimiento para algoritmos y corpus diferentes. La segunda parte de esta línea investiga la utilidad de las etiquetas gramaticales para los analizadores sintácticos.[Resumo] Esta tese presenta traballo sobre análise sintáctica, cubrindo dúas liñas de investigación. A primeira aspira a desenvolver analizadores eficientes, de maneira que sexan suficientemente rápidos para procesar grandes volumes de datos e á vez sexan precisos. Investigamos dous métodos. O primeiro baséase nunha teoría cognitiva, e o segundo usa unha técnica de destilación. O primeiro método foi un enorme fracaso, mentres que o segundo foi en certo modo un éxito. A outra liña avalúa os analizadores sintácticos. Esto tamén se fai de dúas maneiras. Avaliamos a causa da variación no rendemento dos analizadores para distintos algoritmos e corpus. Esta avaliaci´on usa a diferencia entre as distribucións do desprazamento de arista (a distancia dirixida das aristas) correspondentes aos algoritmos e aos corpus. Tamén avalía a diferencia entre as distribucións do desprazamento de arista nos datos de adestramento e proba. Este traballo esclarece as variacións no rendemento para algoritmos e corpus diferentes. A segunda parte desta liña investiga a utilidade das etiquetas gramaticais para os analizadores sintácticos.This work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (FASTPARSE, grant agreement No 714150) and from the Centro de Investigación de Galicia (CITIC) which is funded by the Xunta de Galicia and the European Union (ERDF - Galicia 2014-2020 Program) by grant ED431G 2019/01.Xunta de Galicia; ED431G 2019/0

    Towards a machine-learning architecture for lexical functional grammar parsing

    Get PDF
    Data-driven grammar induction aims at producing wide-coverage grammars of human languages. Initial efforts in this field produced relatively shallow linguistic representations such as phrase-structure trees, which only encode constituent structure. Recent work on inducing deep grammars from treebanks addresses this shortcoming by also recovering non-local dependencies and grammatical relations. My aim is to investigate the issues arising when adapting an existing Lexical Functional Grammar (LFG) induction method to a new language and treebank, and find solutions which will generalize robustly across multiple languages. The research hypothesis is that by exploiting machine-learning algorithms to learn morphological features, lemmatization classes and grammatical functions from treebanks we can reduce the amount of manual specification and improve robustness, accuracy and domain- and language -independence for LFG parsing systems. Function labels can often be relatively straightforwardly mapped to LFG grammatical functions. Learning them reliably permits grammar induction to depend less on language-specific LFG annotation rules. I therefore propose ways to improve acquisition of function labels from treebanks and translate those improvements into better-quality f-structure parsing. In a lexicalized grammatical formalism such as LFG a large amount of syntactically relevant information comes from lexical entries. It is, therefore, important to be able to perform morphological analysis in an accurate and robust way for morphologically rich languages. I propose a fully data-driven supervised method to simultaneously lemmatize and morphologically analyze text and obtain competitive or improved results on a range of typologically diverse languages

    Sequence Labeling Parsing by Learning Across Representations

    Full text link
    We use parsing as sequence labeling as a common framework to learn across constituency and dependency syntactic abstractions. To do so, we cast the problem as multitask learning (MTL). First, we show that adding a parsing paradigm as an auxiliary loss consistently improves the performance on the other paradigm. Secondly, we explore an MTL sequence labeling model that parses both representations, at almost no cost in terms of performance and speed. The results across the board show that on average MTL models with auxiliary losses for constituency parsing outperform single-task ones by 1.14 F1 points, and for dependency parsing by 0.62 UAS points.Comment: Proc. of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019). Revised version after fixing evaluation bu

    Apports des analyses syntaxiques pour la détection automatique de mentions dans un corpus de français oral

    Get PDF
    National audienceWe present three experiments in detecting entity mentions in the corpus of oral French ANCOR, using publicly available parsing tools and state-of-the-art mention detection techniques used in coreference detection, anaphora resolution and Entity Detection and Tracking systems. While the tools we use are not specifically designed to deal with oral French, our results are comparable to those of state-of-the-art end-to-end systems for other languages. We also mention several ways to improve our results for future work in developing an end-to-end coreference resolution system for French, to which these experiments could be a baseline for mention detection.Cet article présente trois expériences de détection de mentions dans un corpus de français oral : ANCOR. Ces expériences utilisent des outils préexistants d'analyse syntaxique du français et des méthodes issues de travaux sur la coréférence, les anaphores et la détection d'entités nommées. Bien que ces outils ne soient pas optimisés pour le traitement de l'oral, la qualité de la détection des mentions que nous obtenons est comparable à l'état de l'art des systèmes conçus pour l'écrit dans d'autres langues. Nous concluons en proposant des perspectives pour l'amélioration des résultats que nous obtenons et la construction d'un système end-to-end pour lequel nos expériences peuvent servir de base de travail

    Translation universals: a usage-based approach

    Get PDF
    The language used in translated texts is said to differ from the language used in other communicative contexts. Translation-specific linguistic behaviour (translation universals) has been shown to explain those differences at the levels of syntax, lexicon, discourse, and semantics. Scholars seem to disagree as to the roots of this behaviour - some turn to socio-cultural and economic factors such as risk-avoidance while others argue that cognitive processing inherent in translation and unique to it affects the linguistic choices made by translators. The aim of this thesis is to shed new light on translation universals from a usage-based perspective. The plausibility of universal translational behaviour is assessed with reference to what we know about implicit and explicit linguistic knowledge: how it is acquired and how it affects language use. I argue that there is little support for the idea that the process of translation constrains the linguistic choices of translators. Instead, I will show that the differences between translated and non-translated texts observed in many studies, which have been attributed to translation universals, are likely to result from differences between the content of translated and non-translated components of comparable corpora. My hypothesis is supported with corpus and experimental evidence which shows that differences in the use of modality and aspect in translated and non-translated Polish texts can be explained with frequency effects: the two corpora contain different verbs whose frequency of occurrence affects translators' and authors' aspectual choices, resulting in the observed differences. The thesis has important methodological and theoretical implications for Translation Studies. First, it shows the importance of looking at the comparability of comparable corpora before turning to translation universals to explain the linguistic choices made in translation. Second, it casts doubt on the plausibility of translation universals as a factor in linguistic decision-making in translation and thereby simplifies the theoretical account needed to explain choices in translation

    Proceedings of the Second Workshop on Annotation of Corpora for Research in the Humanities (ACRH-2). 29 November 2012, Lisbon, Portugal

    Get PDF
    Proceedings of the Second Workshop on Annotation of Corpora for Research in the Humanities (ACRH-2), held in Lisbon, Portugal on 29 November 2012

    A dynamic network Aapproach to the study of syntax

    Get PDF
    Usage-based linguists and psychologists have produced a large body of empirical results suggesting that linguistic structure is derived from language use. However, while researchers agree that these results characterize grammar as an emergent phenomenon, there is no consensus among usage-based scholars as to how the various results can be explained and integrated into an explicit theory or model. Building on network theory, the current paper outlines a structured network approach to the study of grammar in which the core concepts of syntax are analyzed by a set of relations that specify associations between different aspects of a speaker’s linguistic knowledge. These associations are shaped by domain-general processes that can give rise to new structures and meanings in language acquisition and language change. Combining research from linguistics and psychology, the paper proposes specific network analyses for the following phenomena: argument structure, word classes, constituent structure, constructions and construction families, and grammatical categories such as voice, case and number. The article builds on data and analyses presented in Diessel (2019 ; The Grammar Network. How Linguistic Structure is Shaped by Language Use ) but approaches the topic from a different perspective
    corecore