219 research outputs found

    Left Recursion in Parsing Expression Grammars

    Full text link
    Parsing Expression Grammars (PEGs) are a formalism that can describe all deterministic context-free languages through a set of rules that specify a top-down parser for some language. PEGs are easy to use, and there are efficient implementations of PEG libraries in several programming languages. A frequently missed feature of PEGs is left recursion, which is commonly used in Context-Free Grammars (CFGs) to encode left-associative operations. We present a simple conservative extension to the semantics of PEGs that gives useful meaning to direct and indirect left-recursive rules, and show that our extensions make it easy to express left-recursive idioms from CFGs in PEGs, with similar results. We prove the conservativeness of these extensions, and also prove that they work with any left-recursive PEG. PEGs can also be compiled to programs in a low-level parsing machine. We present an extension to the semantics of the operations of this parsing machine that let it interpret left-recursive PEGs, and prove that this extension is correct with regards to our semantics for left-recursive PEGs.Comment: Extended version of the paper "Left Recursion in Parsing Expression Grammars", that was published on 2012 Brazilian Symposium on Programming Language

    Adaptation of LR parsing to production system interpretation

    Get PDF
    This thesis presents such a new production system architecture, called a palimpsest parser, that adapts LR parsing technology to the process of controlled production system interpretation. Two unique characteristics of this architecture facilitate the construction and execution of large production systems: the rate at which productions fire is independent of production system size, and the modularity inherent in production systems is preserved and enhanced. In addition, individual productions may be evaluated in either a forward or backward direction, production systems can be integrated with other production systems and procedural programs, and production system modules can be compiled into libraries and used by other production systems.;Controlled production systems are compiled into palimpsest parsers as follows. Initially, the palimpsest transformation is applied to all productions to transform them into context-free grammar rules with associated disambiguation predicates and semantics. This grammar and the control grammar are then concatenated and compiled into modified LR(0) parse tables using conventional parser generation techniques. the resulting parse tables, disambiguation predicates, and semantics, in conjunction with a modified LR(0) parsing algorithm, constitute a palimpsest parser. When executed, this palimpsest parser correctly interprets the original controlled production system. Moreover, on any given cycle, the palimpsest parser only attempts to instantiate those productions that are allowed to fire by the control language grammar. Tests conducted with simulated production systems have consistently exhibited firing rates in excess of 1000 productions per second on a conventional microcomputer

    Neural Combinatory Constituency Parsing

    Get PDF
    東京都立大学Tokyo Metropolitan University博士(情報科学)doctoral thesi

    Incremental parsing algorithms for speech-editing mathematics and computer code

    Get PDF
    The provision of speech control for editing plain language text has existed for a long time, but does not extend to structured content such as mathematics. The requirements of a user interface for a spoken mathematics editor are explored through the lens of an intuitive natural user interface (NUI) for speech control, the desired properties of which are based on a combination of existing literature on NUIs and intuitive user interfaces. An important aspect of an intuitive NUI is timely update of display of the content in response to editing actions. This is not feasible using batch parsing alone, and this issue will be more serious for larger documents such as computer program code. The solution is an incremental parser designed to work with operator precedence (OP) grammars. The contribution to knowledge provided by this thesis is to improve the efficiency in terms of processing time, of the OP incremental parsing algorithm developed by Heeman, and extend it to handle the distfix (mixfix) operators described by Attanayake to model brackets and mathematical functions. This is implemented successfully for the TalkMaths system and shows a greatly reduced response time compared with using batch scanning and parsing alone. The author is not aware of any other incremental OP parser that handles such operators. Furthermore, a proposal is made for modifications to the data structures produced by Attanayake's parser, along with appropriate adjustments to the incremental parser, that will in the future, facilitate application of OP grammar to program code or other structured content by changing the definition of its content language

    A parser generator system to handle complete syntax

    Get PDF
    To define a language completely, it is necessary to define both its syntax and semantics. If these definitions are in a suitable form, the parser and code-generator of a compiler, respectively, can be generated from them. This thesis addresses the problem of syntax definition and automatic parser generation

    Formal Language Recognition with the Java Type Checker

    Get PDF
    This paper is a theoretical study of a practical problem: the automatic generation of Java Fluent APIs from their specification. We explain why the problem\u27s core lies with the expressive power of Java generics. Our main result is that automatic generation is possible whenever the specification is an instance of the set of deterministic context-free languages, a set which contains most "practical" languages. Other contributions include a collection of techniques and idioms of the limited meta-programming possible with Java generics, and an empirical measurement demonstrating that the runtime of the "javac" compiler of Java may be exponential in the program\u27s length, even for programs composed of a handful of lines and which do not rely on overly complex use of generics

    Theory and practice in the construction of efficient interpreters

    Get PDF
    Various characteristics of a programming language, or of the hardware on which it is to be implemented, may make interpretation a more attractive implementation technique than compilation into machine instructions. Many interpretive techniques can be employed; this thesis is mainly concerned with an efficient and flexible technique using a form of interpretive code known as indirect threaded code (ITC). An extended example of its use is given by the Setl-s implementation of Setl, a programming language based on mathematical set theory. The ITC format, in which pointers to system routines are embedded in the code, is described and its extension to cope with polymorphic operators. The operand formats and some of the system routines are described in detail to illustrate the effect of the language design on the interpreter. Setl must be compiled into indirect threaded code and its elaborate syntax demands the use of a sophisticated parser. In Setl-s an LR(1) parser is implemented as a data structure which is interpreted in a way resembling that in which ITC is interpreted at runtime. Qualitative and quantitative aspects of the compiler, interpreter and system as a whole are discussed. The semantics of a language can be defined mathematically using denotational semantics. By setting up a suitable domain structure, it is possible to devise a semantic definition which embodies the essential features of ITC. This definition can be related, on the one hand to the standard semantics of the language, and on the other to its implementation as an ITC-based interpreter. This is done for a simple language known as X10. Finally, an indication is given of how this approach could be extended to describe Setl-s, and of the insight gained from such a description. Some possible applications of the theoretical analysis in the building of ITC-based interpreters are suggested

    Structured Learning with Inexact Search: Advances in Shift-Reduce CCG Parsing

    Get PDF
    Statistical shift-reduce parsing involves the interplay of representation learning, structured learning, and inexact search. This dissertation considers approaches that tightly integrate these three elements and explores three novel models for shift-reduce CCG parsing. First, I develop a dependency model, in which the selection of shift-reduce action sequences producing a dependency structure is treated as a hidden variable; the key components of the model are a dependency oracle and a learning algorithm that integrates the dependency oracle, the structured perceptron, and beam search. Second, I present expected F-measure training and show how to derive a globally normalized RNN model, in which beam search is naturally incorporated and used in conjunction with the objective to learn shift-reduce action sequences optimized for the final evaluation metric. Finally, I describe an LSTM model that is able to construct parser state representations incrementally by following the shift-reduce syntactic derivation process; I show expected F-measure training, which is agnostic to the underlying neural network, can be applied in this setting to obtain globally normalized greedy and beam-search LSTM shift-reduce parsers.The Carnegie Trust for the Universities of Scotland; The Cambridge Trus

    Protecting Systems From Exploits Using Language-Theoretic Security

    Get PDF
    Any computer program processing input from the user or network must validate the input. Input-handling vulnerabilities occur in programs when the software component responsible for filtering malicious input---the parser---does not perform validation adequately. Consequently, parsers are among the most targeted components since they defend the rest of the program from malicious input. This thesis adopts the Language-Theoretic Security (LangSec) principle to understand what tools and research are needed to prevent exploits that target parsers. LangSec proposes specifying the syntactic structure of the input format as a formal grammar. We then build a recognizer for this formal grammar to validate any input before the rest of the program acts on it. To ensure that these recognizers represent the data format, programmers often rely on parser generators or parser combinators tools to build the parsers. This thesis propels several sub-fields in LangSec by proposing new techniques to find bugs in implementations, novel categorizations of vulnerabilities, and new parsing algorithms and tools to handle practical data formats. To this end, this thesis comprises five parts that tackle various tenets of LangSec. First, I categorize various input-handling vulnerabilities and exploits using two frameworks. First, I use the mismorphisms framework to reason about vulnerabilities. This framework helps us reason about the root causes leading to various vulnerabilities. Next, we built a categorization framework using various LangSec anti-patterns, such as parser differentials and insufficient input validation. Finally, we built a catalog of more than 30 popular vulnerabilities to demonstrate the categorization frameworks. Second, I built parsers for various Internet of Things and power grid network protocols and the iccMAX file format using parser combinator libraries. The parsers I built for power grid protocols were deployed and tested on power grid substation networks as an intrusion detection tool. The parser I built for the iccMAX file format led to several corrections and modifications to the iccMAX specifications and reference implementations. Third, I present SPARTA, a novel tool I built that generates Rust code that type checks Portable Data Format (PDF) files. The type checker I helped build strictly enforces the constraints in the PDF specification to find deviations. Our checker has contributed to at least four significant clarifications and corrections to the PDF 2.0 specification and various open-source PDF tools. In addition to our checker, we also built a practical tool, PDFFixer, to dynamically patch type errors in PDF files. Fourth, I present ParseSmith, a tool to build verified parsers for real-world data formats. Most parsing tools available for data formats are insufficient to handle practical formats or have not been verified for their correctness. I built a verified parsing tool in Dafny that builds on ideas from attribute grammars, data-dependent grammars, and parsing expression grammars to tackle various constructs commonly seen in network formats. I prove that our parsers run in linear time and always terminate for well-formed grammars. Finally, I provide the earliest systematic comparison of various data description languages (DDLs) and their parser generation tools. DDLs are used to describe and parse commonly used data formats, such as image formats. Next, I conducted an expert elicitation qualitative study to derive various metrics that I use to compare the DDLs. I also systematically compare these DDLs based on sample data descriptions available with the DDLs---checking for correctness and resilience
    corecore