Search CORE

157 research outputs found

Left Recursion in Parsing Expression Grammars

Author: Aho
Birman
Bílka
Cooney
Fabio Mascarenhas
Ford
Ford
Ford
Frost
Gosling
Grune
Hanson
Hutton
Ierusalimschy
Ierusalimschy
Johnstone
Kahn
Mascarenhas
Medeiros
Medeiros
Mizushima
Parr
Parr
Redziejowski
Redziejowski
Ridge
Roberto Ierusalimschy
Scott
Scott
Sérgio Medeiros
Tisher
Tisher
Tomita
Tratt
Warth
Warth
Warth
Winskel
Publication venue: 'Elsevier BV'
Publication date: 13/02/2014
Field of study

Parsing Expression Grammars (PEGs) are a formalism that can describe all deterministic context-free languages through a set of rules that specify a top-down parser for some language. PEGs are easy to use, and there are efficient implementations of PEG libraries in several programming languages. A frequently missed feature of PEGs is left recursion, which is commonly used in Context-Free Grammars (CFGs) to encode left-associative operations. We present a simple conservative extension to the semantics of PEGs that gives useful meaning to direct and indirect left-recursive rules, and show that our extensions make it easy to express left-recursive idioms from CFGs in PEGs, with similar results. We prove the conservativeness of these extensions, and also prove that they work with any left-recursive PEG. PEGs can also be compiled to programs in a low-level parsing machine. We present an extension to the semantics of the operations of this parsing machine that let it interpret left-recursive PEGs, and prove that this extension is correct with regards to our semantics for left-recursive PEGs.Comment: Extended version of the paper "Left Recursion in Parsing Expression Grammars", that was published on 2012 Brazilian Symposium on Programming Language

arXiv.org e-Print Archive

Crossref

Parallel parsing made practical

Author: Barenghi Alessandro
CRESPI REGHIZZI Stefano
Mandrioli Dino
Panella Federica
Pradella Matteo
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

The property of local parsability allows to parse inputs through inspecting only a bounded-length string around the current token. This in turn enables the construction of a scalable, data-parallel parsing algorithm, which is presented in this work. Such an algorithm is easily amenable to be automatically generated via a parser generator tool, which was realized, and is also presented in the following. Furthermore, to complete the framework of a parallel input analysis, a parallel scanner can also combined with the parser. To prove the practicality of a parallel lexing and parsing approach, we report the results of the adaptation of JSON and Lua to a form fit for parallel parsing (i.e. an operator-precedence grammar) through simple grammar changes and scanning transformations. The approach is validated with performance figures from both high performance and embedded multicore platforms, obtained analyzing real-world inputs as a test-bench. The results show that our approach matches or dominates the performances of production-grade LR parsers in sequential execution, and achieves significant speedups and good scaling on multi-core machines. The work is concluded by a broad and critical survey of the past work on parallel parsing and future directions on the integration with semantic analysis and incremental parsing

Archivio istituzionale della ricerca - Politecnico di Milano

Crossref

Island Grammar-based Parsing using GLL and Tom

Author: Afroozeh Ali
Bach Jean-Christophe
Johnstone Adrian
Manders Maarten
Moreau Pierre-Etienne
Scott Elizabeth
Van Den Brand Mark
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

International audienceExtending a language by embedding within it another language presents significant parsing challenges, especially if the embedding is recursive. The composite grammar is likely to be nondeterministic as a result of tokens that are valid in both the host and the embedded language. In this paper we examine the challenges of embedding the Tom language into a variety of general-purpose high level languages. Tom provides syntax and semantics for advanced pattern matching and tree rewriting facilities. Embedded Tom constructs are translated into the host language by a preprocessor, the output of which is a composite program written purely in the host language. Tom implementations exist for Java, C, C#, Python and Caml. The current parser is complex and difficult to maintain. In this paper, we describe how Tom can be parsed using island grammars implemented with the Generalised LL (GLL) parsing algorithm. The grammar is, as might be expected, ambiguous. Extracting the correct derivation relies on our disambiguation strategy which is based on pattern matching within the parse forest. We describe different classes of ambiguity and propose patterns for resolving them

CiteSeerX

Crossref

INRIA a CCSD electronic archive server

Neural Combinatory Constituency Parsing

Author: CHEN Zhousi
チンチュウシ
陳宙斯
Publication venue
Publication date: 25/03/2023
Field of study

東京都立大学Tokyo Metropolitan University博士（情報科学）doctoral thesi

Tokyo Metropolitan University Institutional Repository Miyako-Dori / 首都大学東京機関リポジトリ

Incremental parsing algorithms for speech-editing mathematics and computer code

Author: Isaac Marina Jennifer
Publication venue: Kingston University
Publication date
Field of study

The provision of speech control for editing plain language text has existed for a long time, but does not extend to structured content such as mathematics. The requirements of a user interface for a spoken mathematics editor are explored through the lens of an intuitive natural user interface (NUI) for speech control, the desired properties of which are based on a combination of existing literature on NUIs and intuitive user interfaces. An important aspect of an intuitive NUI is timely update of display of the content in response to editing actions. This is not feasible using batch parsing alone, and this issue will be more serious for larger documents such as computer program code. The solution is an incremental parser designed to work with operator precedence (OP) grammars. The contribution to knowledge provided by this thesis is to improve the efficiency in terms of processing time, of the OP incremental parsing algorithm developed by Heeman, and extend it to handle the distfix (mixfix) operators described by Attanayake to model brackets and mathematical functions. This is implemented successfully for the TalkMaths system and shows a greatly reduced response time compared with using batch scanning and parsing alone. The author is not aware of any other incremental OP parser that handles such operators. Furthermore, a proposal is made for modifications to the data structures produced by Attanayake's parser, along with appropriate adjustments to the incremental parser, that will in the future, facilitate application of OP grammar to program code or other structured content by changing the definition of its content language

Kingston University Research Repository

Formal Language Recognition with the Java Type Checker

Author: Gil Yossi
Levy Tomer
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 30th European Conference on Object-Oriented Programming (ECOOP 2016)
Publication date: 01/01/2016
Field of study

This paper is a theoretical study of a practical problem: the automatic generation of Java Fluent APIs from their specification. We explain why the problem\u27s core lies with the expressive power of Java generics. Our main result is that automatic generation is possible whenever the specification is an instance of the set of deterministic context-free languages, a set which contains most "practical" languages. Other contributions include a collection of techniques and idioms of the limited meta-programming possible with Java generics, and an empirical measurement demonstrating that the runtime of the "javac" compiler of Java may be exponential in the program\u27s length, even for programs composed of a handful of lines and which do not rely on overly complex use of generics

Dagstuhl Research Online Publication Server

A program generator for recognition, parsing and transduction with syntactic patterns

Author: Steen G.J. van der
Publication venue: CWI
Publication date: 01/01/1988
Field of study

CWI's Institutional Repository

Protecting Systems From Exploits Using Language-Theoretic Security

Author: Anantharaman Prashant
Publication venue: Dartmouth Digital Commons
Publication date: 03/05/2022
Field of study

Any computer program processing input from the user or network must validate the input. Input-handling vulnerabilities occur in programs when the software component responsible for filtering malicious input---the parser---does not perform validation adequately. Consequently, parsers are among the most targeted components since they defend the rest of the program from malicious input. This thesis adopts the Language-Theoretic Security (LangSec) principle to understand what tools and research are needed to prevent exploits that target parsers. LangSec proposes specifying the syntactic structure of the input format as a formal grammar. We then build a recognizer for this formal grammar to validate any input before the rest of the program acts on it. To ensure that these recognizers represent the data format, programmers often rely on parser generators or parser combinators tools to build the parsers. This thesis propels several sub-fields in LangSec by proposing new techniques to find bugs in implementations, novel categorizations of vulnerabilities, and new parsing algorithms and tools to handle practical data formats. To this end, this thesis comprises five parts that tackle various tenets of LangSec. First, I categorize various input-handling vulnerabilities and exploits using two frameworks. First, I use the mismorphisms framework to reason about vulnerabilities. This framework helps us reason about the root causes leading to various vulnerabilities. Next, we built a categorization framework using various LangSec anti-patterns, such as parser differentials and insufficient input validation. Finally, we built a catalog of more than 30 popular vulnerabilities to demonstrate the categorization frameworks. Second, I built parsers for various Internet of Things and power grid network protocols and the iccMAX file format using parser combinator libraries. The parsers I built for power grid protocols were deployed and tested on power grid substation networks as an intrusion detection tool. The parser I built for the iccMAX file format led to several corrections and modifications to the iccMAX specifications and reference implementations. Third, I present SPARTA, a novel tool I built that generates Rust code that type checks Portable Data Format (PDF) files. The type checker I helped build strictly enforces the constraints in the PDF specification to find deviations. Our checker has contributed to at least four significant clarifications and corrections to the PDF 2.0 specification and various open-source PDF tools. In addition to our checker, we also built a practical tool, PDFFixer, to dynamically patch type errors in PDF files. Fourth, I present ParseSmith, a tool to build verified parsers for real-world data formats. Most parsing tools available for data formats are insufficient to handle practical formats or have not been verified for their correctness. I built a verified parsing tool in Dafny that builds on ideas from attribute grammars, data-dependent grammars, and parsing expression grammars to tackle various constructs commonly seen in network formats. I prove that our parsers run in linear time and always terminate for well-formed grammars. Finally, I provide the earliest systematic comparison of various data description languages (DDLs) and their parser generation tools. DDLs are used to describe and parse commonly used data formats, such as image formats. Next, I conducted an expert elicitation qualitative study to derive various metrics that I use to compare the DDLs. I also systematically compare these DDLs based on sample data descriptions available with the DDLs---checking for correctness and resilience

Dartmouth Digital Commons (Dartmouth College)

The design and implementation of Object Grammars

Author: Cook William R.
Loh Alex
van der Storm Tijs
Publication venue
Publication date: 01/01/2014
Field of study

An Object Grammar is a variation on traditional BNF grammars, where the notation is extended to support declarative bidirectional mappings between text and object graphs. The two directions for interpreting Object Grammars are parsing and formatting. Parsing transforms text into an object graph by recognizing syntactic features and creating the corresponding object structure. In the reverse direction, formatting recognizes object graph features and generates an appropriate textual presentation. The key to Object Grammars is the expressive power of the mapping, which decouples the syntactic structure from the graph structure. To handle graphs, Object Grammars support declarative annotations for resolving textual names that refer to arbitrary objects in the graph structure. Predicates on the semantic structure provide additional control over the mapping. Furthermore, Object Grammars are compositional so that languages may be defined in a modular fashion. We have implemented our approach to Object Grammars as one of the foundations of the Ens (o) over bar system and illustrate the utility of our approach by showing how it enables definition and composition of domain-specific languages (DSLs). (C) 2014 Elsevier B.V. All rights reserved.</p

Proceedings - University of Groningen

Crossref

University of Groningen

ARTS repository - University of Groningen

CWI's Institutional Repository

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Dissertations of the University of Groningen

An Abstract Machine for Unification Grammars

Author: Wintner Shuly
Publication venue
Publication date: 01/01/1997
Field of study

This work describes the design and implementation of an abstract machine, Amalia, for the linguistic formalism ALE, which is based on typed feature structures. This formalism is one of the most widely accepted in computational linguistics and has been used for designing grammars in various linguistic theories, most notably HPSG. Amalia is composed of data structures and a set of instructions, augmented by a compiler from the grammatical formalism to the abstract instructions, and a (portable) interpreter of the abstract instructions. The effect of each instruction is defined using a low-level language that can be executed on ordinary hardware. The advantages of the abstract machine approach are twofold. From a theoretical point of view, the abstract machine gives a well-defined operational semantics to the grammatical formalism. This ensures that grammars specified using our system are endowed with well defined meaning. It enables, for example, to formally verify the correctness of a compiler for HPSG, given an independent definition. From a practical point of view, Amalia is the first system that employs a direct compilation scheme for unification grammars that are based on typed feature structures. The use of amalia results in a much improved performance over existing systems. In order to test the machine on a realistic application, we have developed a small-scale, HPSG-based grammar for a fragment of the Hebrew language, using Amalia as the development platform. This is the first application of HPSG to a Semitic language.Comment: Doctoral Thesis, 96 pages, many postscript figures, uses pstricks, pst-node, psfig, fullname and a macros fil

arXiv.org e-Print Archive

CiteSeerX