497 research outputs found

    Automatic acquisition of Spanish LFG resources from the Cast3LB treebank

    Get PDF
    In this paper, we describe the automatic annotation of the Cast3LB Treebank with LFG f-structures for the subsequent extraction of Spanish probabilistic grammar and lexical resources. We adapt the approach and methodology of Cahill et al. (2004), O’Donovan et al. (2004) and elsewhere for English to Spanish and the Cast3LB treebank encoding. We report on the quality and coverage of the automatic f-structure annotation. Following the pipeline and integrated models of Cahill et al. (2004), we extract wide-coverage probabilistic LFG approximations and parse unseen Spanish text into f-structures. We also extend Bikel’s (2002) Multilingual Parse Engine to include a Spanish language module. Using the retrained Bikel parser in the pipeline model gives the best results against a manually constructed gold standard (73.20% predsonly f-score). We also extract Spanish lexical resources: 4090 semantic form types with 98 frame types. Subcategorised prepositions and particles are included in the frames

    Automatic annotation of the Penn-treebank with LFG f-structure information

    Get PDF
    Lexical-Functional Grammar f-structures are abstract syntactic representations approximating basic predicate-argument structure. Treebanks annotated with f-structure information are required as training resources for stochastic versions of unification and constraint-based grammars and for the automatic extraction of such resources. In a number of papers (Frank, 2000; Sadler, van Genabith and Way, 2000) have developed methods for automatically annotating treebank resources with f-structure information. However, to date, these methods have only been applied to treebank fragments of the order of a few hundred trees. In the present paper we present a new method that scales and has been applied to a complete treebank, in our case the WSJ section of Penn-II (Marcus et al, 1994), with more than 1,000,000 words in about 50,000 sentences

    Content Differences in Syntactic and Semantic Representations

    Full text link
    Syntactic analysis plays an important role in semantic parsing, but the nature of this role remains a topic of ongoing debate. The debate has been constrained by the scarcity of empirical comparative studies between syntactic and semantic schemes, which hinders the development of parsing methods informed by the details of target schemes and constructions. We target this gap, and take Universal Dependencies (UD) and UCCA as a test case. After abstracting away from differences of convention or formalism, we find that most content divergences can be ascribed to: (1) UCCA's distinction between a Scene and a non-Scene; (2) UCCA's distinction between primary relations, secondary ones and participants; (3) different treatment of multi-word expressions, and (4) different treatment of inter-clause linkage. We further discuss the long tail of cases where the two schemes take markedly different approaches. Finally, we show that the proposed comparison methodology can be used for fine-grained evaluation of UCCA parsing, highlighting both challenges and potential sources for improvement. The substantial differences between the schemes suggest that semantic parsers are likely to benefit downstream text understanding applications beyond their syntactic counterparts.Comment: NAACL-HLT 2019 camera read

    Wide-coverage deep statistical parsing using automatic dependency structure annotation

    Get PDF
    A number of researchers (Lin 1995; Carroll, Briscoe, and Sanfilippo 1998; Carroll et al. 2002; Clark and Hockenmaier 2002; King et al. 2003; Preiss 2003; Kaplan et al. 2004;Miyao and Tsujii 2004) have convincingly argued for the use of dependency (rather than CFG-tree) representations for parser evaluation. Preiss (2003) and Kaplan et al. (2004) conducted a number of experiments comparing “deep” hand-crafted wide-coverage with “shallow” treebank- and machine-learning based parsers at the level of dependencies, using simple and automatic methods to convert tree output generated by the shallow parsers into dependencies. In this article, we revisit the experiments in Preiss (2003) and Kaplan et al. (2004), this time using the sophisticated automatic LFG f-structure annotation methodologies of Cahill et al. (2002b, 2004) and Burke (2006), with surprising results. We compare various PCFG and history-based parsers (based on Collins, 1999; Charniak, 2000; Bikel, 2002) to find a baseline parsing system that fits best into our automatic dependency structure annotation technique. This combined system of syntactic parser and dependency structure annotation is compared to two hand-crafted, deep constraint-based parsers (Carroll and Briscoe 2002; Riezler et al. 2002). We evaluate using dependency-based gold standards (DCU 105, PARC 700, CBS 500 and dependencies for WSJ Section 22) and use the Approximate Randomization Test (Noreen 1989) to test the statistical significance of the results. Our experiments show that machine-learning-based shallow grammars augmented with sophisticated automatic dependency annotation technology outperform hand-crafted, deep, widecoverage constraint grammars. Currently our best system achieves an f-score of 82.73% against the PARC 700 Dependency Bank (King et al. 2003), a statistically significant improvement of 2.18%over the most recent results of 80.55%for the hand-crafted LFG grammar and XLE parsing system of Riezler et al. (2002), and an f-score of 80.23% against the CBS 500 Dependency Bank (Carroll, Briscoe, and Sanfilippo 1998), a statistically significant 3.66% improvement over the 76.57% achieved by the hand-crafted RASP grammar and parsing system of Carroll and Briscoe (2002)

    Universal dependencies for Irish

    Get PDF
    Les ressources linguistiques permettant aux études cross-langues de se développer sont très importantes pour les langues minoritaires telles que l’irlandais, car elles favorisent le partage des ressources pour palier au problème du manque de données. Le projet «Universal Dependencies » (UD) a pour but de faciliter les études cross-langues des arbres syntaxiques, des structures linguistiques et de l’analyse syntaxique. L’objectif principal de ce projet est de former un ensemble harmonieux d’arbres syntaxiques en utilisant un schéma d’annotations universelles. Dans cet article, nous présentons la transformation de l’arbre de dépendance syntaxique irlandais (IDT) (Lynn, 2016) au schéma d’annotations universelles du projet UD, suivie d’une description claire des changements structurels nécessaires à cette conversion. Le nouvel arbre est ainsi appelé « Irish Universal Dependency Treebank » ( IUDT ). Language resources that enable cross-lingual studies have become increasingly valuable for lesserresourced languages such as Irish, as they allow for easier sharing of resources, thus overcoming the problem of data scarcity. The Universal Dependencies (UD) Project1 is an initiative aimed at cross-lingual studies of treebanks, linguistic structures and parsing. Its goal is to create a set of multilingual harmonised treebanks that are designed according to a universal annotation scheme. In this paper, we report on the conversion of the Irish Dependency Treebank (IDT) (Lynn, 2016) to a UD version of the treebank which we term the Irish Universal Dependency Treebank (IUDT). We report on the mapping of the IDT labelling scheme to the UD scheme, along with a clear description of the structural changes required in this conversion

    Designing a Dependency Representation and Grammar Definition Corpus for Finnish

    Get PDF
    We outline the design and creation of a syntactically and morphologically annotated corpora of Finnish for use by the research community. We motivate a definitional, systematic “grammar definition corpus” as a first step in an three-year annotation effort to help create higher-quality, better-documented extensive parsebanks at a later stage. The syntactic representation, consisting of a dependency structure and a basic set of dependency functions, is outlined with examples. Reference is made to double-blind annotation experiments to measure the applicability of the newgrammar definition corpus methodology.Peer reviewe

    Improving a Strong Neural Parser with Conjunction-Specific Features

    Full text link
    While dependency parsers reach very high overall accuracy, some dependency relations are much harder than others. In particular, dependency parsers perform poorly in coordination construction (i.e., correctly attaching the "conj" relation). We extend a state-of-the-art dependency parser with conjunction-specific features, focusing on the similarity between the conjuncts head words. Training the extended parser yields an improvement in "conj" attachment as well as in overall dependency parsing accuracy on the Stanford dependency conversion of the Penn TreeBank

    Statistical Parsing by Machine Learning from a Classical Arabic Treebank

    Get PDF
    Research into statistical parsing for English has enjoyed over a decade of successful results. However, adapting these models to other languages has met with difficulties. Previous comparative work has shown that Modern Arabic is one of the most difficult languages to parse due to rich morphology and free word order. Classical Arabic is the ancient form of Arabic, and is understudied in computational linguistics, relative to its worldwide reach as the language of the Quran. The thesis is based on seven publications that make significant contributions to knowledge relating to annotating and parsing Classical Arabic. Classical Arabic has been studied in depth by grammarians for over a thousand years using a traditional grammar known as i’rāb (إعغاة ). Using this grammar to develop a representation for parsing is challenging, as it describes syntax using a hybrid of phrase-structure and dependency relations. This work aims to advance the state-of-the-art for hybrid parsing by introducing a formal representation for annotation and a resource for machine learning. The main contributions are the first treebank for Classical Arabic and the first statistical dependency-based parser in any language for ellipsis, dropped pronouns and hybrid representations. A central argument of this thesis is that using a hybrid representation closely aligned to traditional grammar leads to improved parsing for Arabic. To test this hypothesis, two approaches are compared. As a reference, a pure dependency parser is adapted using graph transformations, resulting in an 87.47% F1-score. This is compared to an integrated parsing model with an F1-score of 89.03%, demonstrating that joint dependency-constituency parsing is better suited to Classical Arabic. The Quran was chosen for annotation as a large body of work exists providing detailed syntactic analysis. Volunteer crowdsourcing is used for annotation in combination with expert supervision. A practical result of the annotation effort is the corpus website: http://corpus.quran.com, an educational resource with over two million users per year
    corecore