158 research outputs found

    Practical LR Parser Generation

    Full text link
    Parsing is a fundamental building block in modern compilers, and for industrial programming languages, it is a surprisingly involved task. There are known approaches to generate parsers automatically, but the prevailing consensus is that automatic parser generation is not practical for real programming languages: LR/LALR parsers are considered to be far too restrictive in the grammars they support, and LR parsers are often considered too inefficient in practice. As a result, virtually all modern languages use recursive-descent parsers written by hand, a lengthy and error-prone process that dramatically increases the barrier to new programming language development. In this work we demonstrate that, contrary to the prevailing consensus, we can have the best of both worlds: for a very general, practical class of grammars -- a strict superset of Knuth's canonical LR -- we can generate parsers automatically, and the resulting parser code, as well as the generation procedure itself, is highly efficient. This advance relies on several new ideas, including novel automata optimization procedures; a new grammar transformation ("CPS"); per-symbol attributes; recursive-descent actions; and an extension of canonical LR parsing, which we refer to as XLR, which endows shift/reduce parsers with the power of bounded nondeterministic choice. With these ingredients, we can automatically generate efficient parsers for virtually all programming languages that are intuitively easy to parse -- a claim we support experimentally, by implementing the new algorithms in a new software tool called langcc, and running them on syntax specifications for Golang 1.17.8 and Python 3.9.12. The tool handles both languages automatically, and the generated code, when run on standard codebases, is 1.2x faster than the corresponding hand-written parser for Golang, and 4.3x faster than the CPython parser, respectively

    Pattern matching in compilers

    Get PDF
    In this thesis we develop tools for effective and flexible pattern matching. We introduce a new pattern matching system called amethyst. Amethyst is not only a generator of parsers of programming languages, but can also serve as an alternative to tools for matching regular expressions. Our framework also produces dynamic parsers. Its intended use is in the context of IDE (accurate syntax highlighting and error detection on the fly). Amethyst offers pattern matching of general data structures. This makes it a useful tool for implementing compiler optimizations such as constant folding, instruction scheduling, and dataflow analysis in general. The parsers produced are essentially top-down parsers. Linear time complexity is obtained by introducing the novel notion of structured grammars and regularized regular expressions. Amethyst uses techniques known from compiler optimizations to produce effective parsers.Comment: master thesi

    Happy-GLL: modular, reusable and complete top-down parsers for parameterized nonterminals

    Full text link
    Parser generators and parser combinator libraries are the most popular tools for producing parsers. Parser combinators use the host language to provide reusable components in the form of higher-order functions with parsers as parameters. Very few parser generators support this kind of reuse through abstraction and even fewer generate parsers that are as modular and reusable as the parts of the grammar for which they are produced. This paper presents a strategy for generating modular, reusable and complete top-down parsers from syntax descriptions with parameterized nonterminals, based on the FUN-GLL variant of the GLL algorithm. The strategy is discussed and demonstrated as a novel back-end for the Happy parser generator. Happy grammars can contain `parameterized nonterminals' in which parameters abstract over grammar symbols, granting an abstraction mechanism to define reusable grammar operators. However, the existing Happy back-ends do not deliver on the full potential of parameterized nonterminals as parameterized nonterminals cannot be reused across grammars. Moreover, the parser generation process may fail to terminate or may result in exponentially large parsers generated in an exponential amount of time. The GLL back-end presented in this paper implements parameterized nonterminals successfully by generating higher-order functions that resemble parser combinators, inheriting all the advantages of top-down parsing. The back-end is capable of generating parsers for the full class of context-free grammars, generates parsers in linear time and generates parsers that find all derivations of the input string. To our knowledge, the presented GLL back-end makes Happy the first parser generator that combines all these features. This paper describes the translation procedure of the GLL back-end and compares it to the LALR and GLR back-ends of Happy in several experiments.Comment: 15 page

    Jparsec - a parser combinator for Javascript

    Get PDF
    Parser combinators have been a popular parsing approach in recent years. Compared with traditional parsers, a parser combinator has both readability and maintenance advantages. This project aims to construct a lightweight parser construct library for Javascript called Jparsec. Based on the modular nature of a parser combinator, the implementation uses higher-order functions. JavaScript provides a friendly and simple way to use higher-order functions, so the main construction method of this project will use JavaScript\u27s lambda functions. In practical applications, a parser combinator is mainly used as a tool, such as parsing JSON files. In order to verify the utility of parser combinators, this project uses a parser combinator to parse a partial Lua grammar. Lua is a widely used programming language, serving as a good test case for my parser combinator

    A Variant of Earley Parsing

    Full text link
    The Earley algorithm is a widely used parsing method in natural language processing applications. We introduce a variant of Earley parsing that is based on a ``delayed'' recognition of constituents. This allows us to start the recognition of a constituent only in cases in which all of its subconstituents have been found within the input string. This is particularly advantageous in several cases in which partial analysis of a constituent cannot be completed and in general in all cases of productions sharing some suffix of their right-hand sides (even for different left-hand side nonterminals). Although the two algorithms result in the same asymptotic time and space complexity, from a practical perspective our algorithm improves the time and space requirements of the original method, as shown by reported experimental results.Comment: 12 pages, 1 Postscript figure, uses psfig.tex and llncs.st

    flap: A Deterministic Parser with Fused Lexing

    Full text link
    Lexers and parsers are typically defined separately and connected by a token stream. This separate definition is important for modularity and reduces the potential for parsing ambiguity. However, materializing tokens as data structures and case-switching on tokens comes with a cost. We show how to fuse separately-defined lexers and parsers, drastically improving performance without compromising modularity or increasing ambiguity. We propose a deterministic variant of Greibach Normal Form that ensures deterministic parsing with a single token of lookahead and makes fusion strikingly simple, and prove that normalizing context free expressions into the deterministic normal form is semantics-preserving. Our staged parser combinator library, flap, provides a standard interface, but generates specialized token-free code that runs two to six times faster than ocamlyacc on a range of benchmarks.Comment: PLDI 2023 with appendi
    • …
    corecore