Search CORE

219 research outputs found

Left Recursion in Parsing Expression Grammars

Author: Aho
Birman
Bílka
Cooney
Fabio Mascarenhas
Ford
Ford
Ford
Frost
Gosling
Grune
Hanson
Hutton
Ierusalimschy
Ierusalimschy
Johnstone
Kahn
Mascarenhas
Medeiros
Medeiros
Mizushima
Parr
Parr
Redziejowski
Redziejowski
Ridge
Roberto Ierusalimschy
Scott
Scott
Sérgio Medeiros
Tisher
Tisher
Tomita
Tratt
Warth
Warth
Warth
Winskel
Publication venue: 'Elsevier BV'
Publication date: 13/02/2014
Field of study

Parsing Expression Grammars (PEGs) are a formalism that can describe all deterministic context-free languages through a set of rules that specify a top-down parser for some language. PEGs are easy to use, and there are efficient implementations of PEG libraries in several programming languages. A frequently missed feature of PEGs is left recursion, which is commonly used in Context-Free Grammars (CFGs) to encode left-associative operations. We present a simple conservative extension to the semantics of PEGs that gives useful meaning to direct and indirect left-recursive rules, and show that our extensions make it easy to express left-recursive idioms from CFGs in PEGs, with similar results. We prove the conservativeness of these extensions, and also prove that they work with any left-recursive PEG. PEGs can also be compiled to programs in a low-level parsing machine. We present an extension to the semantics of the operations of this parsing machine that let it interpret left-recursive PEGs, and prove that this extension is correct with regards to our semantics for left-recursive PEGs.Comment: Extended version of the paper "Left Recursion in Parsing Expression Grammars", that was published on 2012 Brazilian Symposium on Programming Language

arXiv.org e-Print Archive

Adaptation of LR parsing to production system interpretation

Author: Slothouber Louis Paul
Publication venue: W&M ScholarWorks
Publication date: 01/01/1989
Field of study

This thesis presents such a new production system architecture, called a palimpsest parser, that adapts LR parsing technology to the process of controlled production system interpretation. Two unique characteristics of this architecture facilitate the construction and execution of large production systems: the rate at which productions fire is independent of production system size, and the modularity inherent in production systems is preserved and enhanced. In addition, individual productions may be evaluated in either a forward or backward direction, production systems can be integrated with other production systems and procedural programs, and production system modules can be compiled into libraries and used by other production systems.;Controlled production systems are compiled into palimpsest parsers as follows. Initially, the palimpsest transformation is applied to all productions to transform them into context-free grammar rules with associated disambiguation predicates and semantics. This grammar and the control grammar are then concatenated and compiled into modified LR(0) parse tables using conventional parser generation techniques. the resulting parse tables, disambiguation predicates, and semantics, in conjunction with a modified LR(0) parsing algorithm, constitute a palimpsest parser. When executed, this palimpsest parser correctly interprets the original controlled production system. Moreover, on any given cycle, the palimpsest parser only attempts to instantiate those productions that are allowed to fire by the control language grammar. Tests conducted with simulated production systems have consistently exhibited firing rates in excess of 1000 productions per second on a conventional microcomputer

College of William & Mary: W&M Publish

Neural Combinatory Constituency Parsing

Author: CHEN Zhousi
チンチュウシ
陳宙斯
Publication venue
Publication date: 25/03/2023
Field of study

東京都立大学Tokyo Metropolitan University博士（情報科学）doctoral thesi

Tokyo Metropolitan University Institutional Repository Miyako-Dori / 首都大学東京機関リポジトリ

Incremental parsing algorithms for speech-editing mathematics and computer code

Author: Isaac Marina Jennifer
Publication venue: Kingston University
Publication date
Field of study

The provision of speech control for editing plain language text has existed for a long time, but does not extend to structured content such as mathematics. The requirements of a user interface for a spoken mathematics editor are explored through the lens of an intuitive natural user interface (NUI) for speech control, the desired properties of which are based on a combination of existing literature on NUIs and intuitive user interfaces. An important aspect of an intuitive NUI is timely update of display of the content in response to editing actions. This is not feasible using batch parsing alone, and this issue will be more serious for larger documents such as computer program code. The solution is an incremental parser designed to work with operator precedence (OP) grammars. The contribution to knowledge provided by this thesis is to improve the efficiency in terms of processing time, of the OP incremental parsing algorithm developed by Heeman, and extend it to handle the distfix (mixfix) operators described by Attanayake to model brackets and mathematical functions. This is implemented successfully for the TalkMaths system and shows a greatly reduced response time compared with using batch scanning and parsing alone. The author is not aware of any other incremental OP parser that handles such operators. Furthermore, a proposal is made for modifications to the data structures produced by Attanayake's parser, along with appropriate adjustments to the incremental parser, that will in the future, facilitate application of OP grammar to program code or other structured content by changing the definition of its content language

Recommended from our members

Draco 1.3 users manual

Author: Arango Guillermo
Leite Julio C.
Neighbors James M.
Publication venue: eScholarship, University of California
Publication date: 30/10/1984
Field of study

eScholarship - University of California

A parser generator system to handle complete syntax

Author: Ossher Harold Leon
Publication venue: Faculty of Science, Computer Science
Publication date: 01/01/1982
Field of study

To define a language completely, it is necessary to define both its syntax and semantics. If these definitions are in a suitable form, the parser and code-generator of a compiler, respectively, can be generated from them. This thesis addresses the problem of syntax definition and automatic parser generation

Rhodes Repository (SEALS)

Formal Language Recognition with the Java Type Checker

Author: Gil Yossi
Levy Tomer
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 30th European Conference on Object-Oriented Programming (ECOOP 2016)
Publication date: 01/01/2016
Field of study

This paper is a theoretical study of a practical problem: the automatic generation of Java Fluent APIs from their specification. We explain why the problem\u27s core lies with the expressive power of Java generics. Our main result is that automatic generation is possible whenever the specification is an instance of the set of deterministic context-free languages, a set which contains most "practical" languages. Other contributions include a collection of techniques and idioms of the limited meta-programming possible with Java generics, and an empirical measurement demonstrating that the runtime of the "javac" compiler of Java may be exponential in the program\u27s length, even for programs composed of a handful of lines and which do not rely on overly complex use of generics

Dagstuhl Research Online Publication Server

Theory and practice in the construction of efficient interpreters

Author: Chapman Nigel Paul
Publication venue
Publication date: 01/10/1980
Field of study

Various characteristics of a programming language, or of the hardware on which it is to be implemented, may make interpretation a more attractive implementation technique than compilation into machine instructions. Many interpretive techniques can be employed; this thesis is mainly concerned with an efficient and flexible technique using a form of interpretive code known as indirect threaded code (ITC). An extended example of its use is given by the Setl-s implementation of Setl, a programming language based on mathematical set theory. The ITC format, in which pointers to system routines are embedded in the code, is described and its extension to cope with polymorphic operators. The operand formats and some of the system routines are described in detail to illustrate the effect of the language design on the interpreter. Setl must be compiled into indirect threaded code and its elaborate syntax demands the use of a sophisticated parser. In Setl-s an LR(1) parser is implemented as a data structure which is interpreted in a way resembling that in which ITC is interpreted at runtime. Qualitative and quantitative aspects of the compiler, interpreter and system as a whole are discussed. The semantics of a language can be defined mathematically using denotational semantics. By setting up a suitable domain structure, it is possible to devise a semantic definition which embodies the essential features of ITC. This definition can be related, on the one hand to the standard semantics of the language, and on the other to its implementation as an ITC-based interpreter. This is done for a simple language known as X10. Finally, an indication is given of how this approach could be extended to describe Setl-s, and of the insight gained from such a description. Some possible applications of the theoretical analysis in the building of ITC-based interpreters are suggested

Structured Learning with Inexact Search: Advances in Shift-Reduce CCG Parsing

Author: Xu Wenduan
Publication venue: University of Cambridge
Publication date: 07/12/2017
Field of study

Statistical shift-reduce parsing involves the interplay of representation learning, structured learning, and inexact search. This dissertation considers approaches that tightly integrate these three elements and explores three novel models for shift-reduce CCG parsing. First, I develop a dependency model, in which the selection of shift-reduce action sequences producing a dependency structure is treated as a hidden variable; the key components of the model are a dependency oracle and a learning algorithm that integrates the dependency oracle, the structured perceptron, and beam search. Second, I present expected F-measure training and show how to derive a globally normalized RNN model, in which beam search is naturally incorporated and used in conjunction with the objective to learn shift-reduce action sequences optimized for the final evaluation metric. Finally, I describe an LSTM model that is able to construct parser state representations incrementally by following the shift-reduce syntactic derivation process; I show expected F-measure training, which is agnostic to the underlying neural network, can be applied in this setting to obtain globally normalized greedy and beam-search LSTM shift-reduce parsers.The Carnegie Trust for the Universities of Scotland; The Cambridge Trus

Protecting Systems From Exploits Using Language-Theoretic Security

Author: Anantharaman Prashant
Publication venue: Dartmouth Digital Commons
Publication date: 03/05/2022
Field of study

Any computer program processing input from the user or network must validate the input. Input-handling vulnerabilities occur in programs when the software component responsible for filtering malicious input---the parser---does not perform validation adequately. Consequently, parsers are among the most targeted components since they defend the rest of the program from malicious input. This thesis adopts the Language-Theoretic Security (LangSec) principle to understand what tools and research are needed to prevent exploits that target parsers. LangSec proposes specifying the syntactic structure of the input format as a formal grammar. We then build a recognizer for this formal grammar to validate any input before the rest of the program acts on it. To ensure that these recognizers represent the data format, programmers often rely on parser generators or parser combinators tools to build the parsers. This thesis propels several sub-fields in LangSec by proposing new techniques to find bugs in implementations, novel categorizations of vulnerabilities, and new parsing algorithms and tools to handle practical data formats. To this end, this thesis comprises five parts that tackle various tenets of LangSec. First, I categorize various input-handling vulnerabilities and exploits using two frameworks. First, I use the mismorphisms framework to reason about vulnerabilities. This framework helps us reason about the root causes leading to various vulnerabilities. Next, we built a categorization framework using various LangSec anti-patterns, such as parser differentials and insufficient input validation. Finally, we built a catalog of more than 30 popular vulnerabilities to demonstrate the categorization frameworks. Second, I built parsers for various Internet of Things and power grid network protocols and the iccMAX file format using parser combinator libraries. The parsers I built for power grid protocols were deployed and tested on power grid substation networks as an intrusion detection tool. The parser I built for the iccMAX file format led to several corrections and modifications to the iccMAX specifications and reference implementations. Third, I present SPARTA, a novel tool I built that generates Rust code that type checks Portable Data Format (PDF) files. The type checker I helped build strictly enforces the constraints in the PDF specification to find deviations. Our checker has contributed to at least four significant clarifications and corrections to the PDF 2.0 specification and various open-source PDF tools. In addition to our checker, we also built a practical tool, PDFFixer, to dynamically patch type errors in PDF files. Fourth, I present ParseSmith, a tool to build verified parsers for real-world data formats. Most parsing tools available for data formats are insufficient to handle practical formats or have not been verified for their correctness. I built a verified parsing tool in Dafny that builds on ideas from attribute grammars, data-dependent grammars, and parsing expression grammars to tackle various constructs commonly seen in network formats. I prove that our parsers run in linear time and always terminate for well-formed grammars. Finally, I provide the earliest systematic comparison of various data description languages (DDLs) and their parser generation tools. DDLs are used to describe and parse commonly used data formats, such as image formats. Next, I conducted an expert elicitation qualitative study to derive various metrics that I use to compare the DDLs. I also systematically compare these DDLs based on sample data descriptions available with the DDLs---checking for correctness and resilience