2,548 research outputs found

    Morpheus: Automated Safety Verification of Data-Dependent Parser Combinator Programs

    Get PDF
    Parser combinators are a well-known mechanism used for the compositional construction of parsers, and have shown to be particularly useful in writing parsers for rich grammars with data-dependencies and global state. Verifying applications written using them, however, has proven to be challenging in large part because of the inherently effectful nature of the parsers being composed and the difficulty in reasoning about the arbitrarily rich data-dependent semantic actions that can be associated with parsing actions. In this paper, we address these challenges by defining a parser combinator framework called Morpheus equipped with abstractions for defining composable effects tailored for parsing and semantic actions, and a rich specification language used to define safety properties over the constituent parsers comprising a program. Even though its abstractions yield many of the same expressivity benefits as other parser combinator systems, Morpheus is carefully engineered to yield a substantially more tractable automated verification pathway. We demonstrate its utility in verifying a number of realistic, challenging parsing applications, including several cases that involve non-trivial data-dependent relations

    Discontinuous grammar as a foreign language

    Get PDF
    [Abstract] In order to achieve deep natural language understanding, syntactic constituent parsing is a vital step, highly demanded by many artificial intelligence systems to process both text and speech. One of the most recent proposals is the use of standard sequence-to-sequence models to perform constituent parsing as a machine translation task, instead of applying task-specific parsers. While they show a competitive performance, these text-to-parse transducers are still lagging behind classic techniques in terms of accuracy, coverage and speed. To close the gap, we here extend the framework of sequence-to-sequence models for constituent parsing, not only by providing a more powerful neural architecture for improving their performance, but also by enlarging their coverage to handle the most complex syntactic phenomena: discontinuous structures. To that end, we design several novel linearizations that can fully produce discontinuities and, for the first time, we test a sequence-to-sequence model on the main discontinuous benchmarks, obtaining competitive results on par with task-specific discontinuous constituent parsers and achieving state-of-the-art scores on the (discontinuous) English Penn Treebank.Xunta de Galicia; ED431G 2019/01Xunta de Galicia; ED431C 2020/11We acknowledge the European Research Council (ERC), which has funded this research under the European Union’s Horizon 2020 research and innovation programme (FASTPARSE, grant agreement No 714150) and the Horizon Europe research and innovation programme (SALSA, grant agreement No 101100615), ERDF/ MICINN-AEI (SCANNER-UDC, PID2020-113230RB-C21), Xunta de Galicia (ED431C 2020/11), and Centro de Investigación de Galicia ‘‘CITIC”, funded by Xunta de Galicia and the European Union (ERDF - Galicia 2014–2020 Program), by grant ED431G 2019/01. Funding for open access charge: Universidade da Coruña/CISUG

    Promocijas darbs

    Get PDF
    Elektroniskā versija nesatur pielikumusPromocijas darbs veltīts hibrīda latviešu valodas gramatikas modeļa izstrādei un transformēšanai uz Universālo atkarību (Universal Dependencies, UD) modeli. Promocijas darbā ir aizsākts jauns latviešu valodas izpētes virziens – sintaktiski marķētos tekstos balstīti pētījumi. Darba rezultātā ir izstrādāts un aprobēts fundamentāls, latviešu valodai iepriekš nebijis valodas resurss – mašīnlasāms sintaktiski marķēts korpuss 17 tūkstošu teikumu apmērā. Teikumi ir marķēti atbilstoši diviem dažādiem sintaktiskās marķēšanas modeļiem – darbā radītajam frāžu struktūru un atkarību gramatikas hibrīdam un starptautiski aprobētajam UD modelim. Izveidotais valodas resurss publiski pieejams gan lejuplādei, gan tiešsaistes meklēšanai abos iepriekš minētajos marķējuma veidos. Pētījuma laikā radīta rīku kopa un latviešu valodas sintaktiski marķētā korpusa veidošanai vajadzīgā infrastruktūra. Tajā skaitā tika definēti plašam valodas pārklājumam nepieciešamie LU MII eksperimentālā hibrīdā gramatikas modeļa paplašinājumi. Tāpat tika analizētas iespējas atbilstoši hibrīdmodelim marķētus datus pārveidot uz atkarību modeli, un tika radīts atvasināts UD korpuss. Izveidotais sintaktiski marķētais korpuss ir kalpojis par pamatu, lai varētu radīt augstas precizitātes (91%) parsētājus latviešu valodai. Savukārt dalība UD iniciatīvā ir veicinājusi latviešu valodas un arī citu fleksīvu valodu resursu starptautisko atpazīstamību un fleksīvām valodām piemērotāku rīku izveidi datorlingvistikā – pētniecības jomā, kuras vēsturiskā izcelsme pamatā meklējama darbā ar analītiskajām valodām. Atslēgvārdi: sintakses korpuss, Universal Dependencies, valodu tehnoloģijasThe given doctoral thesis describes the creation of a hybrid grammar model for the Latvian language, as well as its subsequent conversion to a Universal Dependencies (UD) grammar model. The thesis also lays the groundwork for Latvian language research through syntactically annotated texts. In this work, a fundamental Latvian language resource was developed and evaluated for the first time – a machine-readable treebank of 17 thousand syntactically annotated sentences. The sentences are annotated according to two syntactic annotation models: the hybrid grammar model developed in the thesis, and the internationally recognised UD model. Both annotated versions of the treebank are publicly available for downloading or querying online. Over the course of the study, a set of tools and infrastructure necessary for treebank creation and maintenance were developed. The language coverage of the IMCS UL experimental hybrid model was extended, and the possibilities were defined for converting data annotated according to the hybrid grammar model to the dependency grammar model. Based on this work, a derived UD treebank was created. The resulting treebank has served as a basis for the development of high accuracy (91%) Latvian language parsers. Furthermore, the participation in the UD initiative has promoted the international recognition of Latvian and other inflective languages and the development of better-fitted tools for inflective language processing in computational linguistics, which historically has been more oriented towards analytic languages. Keywords: treebank, Universal Dependencies, language technologie

    Workshop Proceedings of the 12th edition of the KONVENS conference

    Get PDF
    The 2014 issue of KONVENS is even more a forum for exchange: its main topic is the interaction between Computational Linguistics and Information Science, and the synergies such interaction, cooperation and integrated views can produce. This topic at the crossroads of different research traditions which deal with natural language as a container of knowledge, and with methods to extract and manage knowledge that is linguistically represented is close to the heart of many researchers at the Institut für Informationswissenschaft und Sprachtechnologie of Universität Hildesheim: it has long been one of the institute’s research topics, and it has received even more attention over the last few years

    Advances in automatic terminology processing: methodology and applications in focus

    Get PDF
    A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy.The information and knowledge era, in which we are living, creates challenges in many fields, and terminology is not an exception. The challenges include an exponential growth in the number of specialised documents that are available, in which terms are presented, and the number of newly introduced concepts and terms, which are already beyond our (manual) capacity. A promising solution to this ‘information overload’ would be to employ automatic or semi-automatic procedures to enable individuals and/or small groups to efficiently build high quality terminologies from their own resources which closely reflect their individual objectives and viewpoints. Automatic terminology processing (ATP) techniques have already proved to be quite reliable, and can save human time in terminology processing. However, they are not without weaknesses, one of which is that these techniques often consider terms to be independent lexical units satisfying some criteria, when terms are, in fact, integral parts of a coherent system (a terminology). This observation is supported by the discussion of the notion of terms and terminology and the review of existing approaches in ATP presented in this thesis. In order to overcome the aforementioned weakness, we propose a novel methodology in ATP which is able to extract a terminology as a whole. The proposed methodology is based on knowledge patterns automatically extracted from glossaries, which we considered to be valuable, but overlooked resources. These automatically identified knowledge patterns are used to extract terms, their relations and descriptions from corpora. The extracted information can facilitate the construction of a terminology as a coherent system. The study also aims to discuss applications of ATP, and describes an experiment in which ATP is integrated into a new NLP application: multiplechoice test item generation. The successful integration of the system shows that ATP is a viable technology, and should be exploited more by other NLP applications

    Grammatical Triples Extraction for the Distant Reading of Textual Corpora

    Get PDF
    Grammatical triples extraction has become increasingly important for the analysis of large, textual corpora. By providing insight into the sentence-level linguistic features of a corpus, extracted triples have supported interpretations of some of the most relevant problems of our time. The growing importance of triples extraction for analyzing large corpora has put the quality of extracted triples under new scrutiny, however. Triples outputs are known to have large amounts of erroneous triples. The extraction of erroneous triples poses a risk for understanding a textual corpus because erroneous triples can be nonfactual and even analogous to misinformation. Disciplines such as the social sciences, history, and literature rely on accurate representations of events. In some cases, misrepresentations of language can be as problematic as describing a historical event that never occurred. The present research proposes a method of triples extraction that has been designed to meet the increasing need for high-accuracy triples outputs for the analysis of text. We propose a solution aimed at reducing errors related to: a) ungrammatical extractions; b) double counting; and c) the missed detection of triples. To improve the accuracy of triples extraction, we implement a series of 12 linguistic rules that leverage syntactic dependency parsing. For its case studies, this dissertation draws upon three data sets: a) Wikipedia; b) the 19th-century British Parliamentary debates, also known as Hansard; and c) half a year of online news articles (Aug. 2021 - Dec. 2021) from FOX News and NPR. In its final chapter, this dissertation offers a pedagogical piece that applies triples extraction to teach concepts related to data analysis. Extracted triples are thus evaluated through two means: a) in Chapter 1, precision and recall is used to vet the accuracy of the present method and b) in chapters 2 and 3, we use human observation to show how the present method of triples extraction can give an accurate and insightful perspective into textual corpora that rivals and, in some cases, exceeds existing methods

    Recent Advances in Research on Island Phenomena

    Get PDF
    In natural languages, filler-gap dependencies can straddle across an unbounded distance. Since the 1960s, the term “island” has been used to describe syntactic structures from which extraction is impossible or impeded. While examples from English are ubiquitous, attested counterexamples in the Mainland Scandinavian languages have continuously been dismissed as illusory and alternative accounts for the underlying structure of such cases have been proposed. However, since such extractions are pervasive in spoken Mainland Scandinavian, these languages may not have been given the attention that they deserve in the syntax literature. In addition, recent research suggests that extraction from certain types of island structures in English might not be as unacceptable as previously assumed either. These findings break new empirical ground, question perceived knowledge, and may indeed have substantial ramifications for syntactic theory. This volume provides an overview of state-of-the-art research on island phenomena primarily in English and the Scandinavian languages, focusing on how languages compare to English, with the aim to shed new light on the nature of island constraints from different theoretical perspectives

    Multiple input parsing and lexical analysis

    Get PDF

    Functional programming applications

    Get PDF
    Functional programming languages have distinct advantages over imperative lan guages. These include ease of reasoning, formally or informally, about programs and the concise and elegant expression of complex algorithms. We use several large programming tasks to investigate various aspects of the pro duction of a complete system in a functional language. These include the overall development of an implementation, the implementation of the core algorithms and the implementation of the external interface. We do not attempt verifications of the implementations but instead adopt a specification style that aids informal reasoning about the corresponding implementation. We employ the language Miranda1 l as our example functional programming lan guage
    corecore