158 research outputs found
Practical LR Parser Generation
Parsing is a fundamental building block in modern compilers, and for
industrial programming languages, it is a surprisingly involved task. There are
known approaches to generate parsers automatically, but the prevailing
consensus is that automatic parser generation is not practical for real
programming languages: LR/LALR parsers are considered to be far too restrictive
in the grammars they support, and LR parsers are often considered too
inefficient in practice. As a result, virtually all modern languages use
recursive-descent parsers written by hand, a lengthy and error-prone process
that dramatically increases the barrier to new programming language
development.
In this work we demonstrate that, contrary to the prevailing consensus, we
can have the best of both worlds: for a very general, practical class of
grammars -- a strict superset of Knuth's canonical LR -- we can generate
parsers automatically, and the resulting parser code, as well as the generation
procedure itself, is highly efficient. This advance relies on several new
ideas, including novel automata optimization procedures; a new grammar
transformation ("CPS"); per-symbol attributes; recursive-descent actions; and
an extension of canonical LR parsing, which we refer to as XLR, which endows
shift/reduce parsers with the power of bounded nondeterministic choice.
With these ingredients, we can automatically generate efficient parsers for
virtually all programming languages that are intuitively easy to parse -- a
claim we support experimentally, by implementing the new algorithms in a new
software tool called langcc, and running them on syntax specifications for
Golang 1.17.8 and Python 3.9.12. The tool handles both languages automatically,
and the generated code, when run on standard codebases, is 1.2x faster than the
corresponding hand-written parser for Golang, and 4.3x faster than the CPython
parser, respectively
Pattern matching in compilers
In this thesis we develop tools for effective and flexible pattern matching.
We introduce a new pattern matching system called amethyst. Amethyst is not
only a generator of parsers of programming languages, but can also serve as an
alternative to tools for matching regular expressions.
Our framework also produces dynamic parsers. Its intended use is in the
context of IDE (accurate syntax highlighting and error detection on the fly).
Amethyst offers pattern matching of general data structures. This makes it a
useful tool for implementing compiler optimizations such as constant folding,
instruction scheduling, and dataflow analysis in general.
The parsers produced are essentially top-down parsers. Linear time complexity
is obtained by introducing the novel notion of structured grammars and
regularized regular expressions. Amethyst uses techniques known from compiler
optimizations to produce effective parsers.Comment: master thesi
Happy-GLL: modular, reusable and complete top-down parsers for parameterized nonterminals
Parser generators and parser combinator libraries are the most popular tools
for producing parsers. Parser combinators use the host language to provide
reusable components in the form of higher-order functions with parsers as
parameters. Very few parser generators support this kind of reuse through
abstraction and even fewer generate parsers that are as modular and reusable as
the parts of the grammar for which they are produced. This paper presents a
strategy for generating modular, reusable and complete top-down parsers from
syntax descriptions with parameterized nonterminals, based on the FUN-GLL
variant of the GLL algorithm.
The strategy is discussed and demonstrated as a novel back-end for the Happy
parser generator. Happy grammars can contain `parameterized nonterminals' in
which parameters abstract over grammar symbols, granting an abstraction
mechanism to define reusable grammar operators. However, the existing Happy
back-ends do not deliver on the full potential of parameterized nonterminals as
parameterized nonterminals cannot be reused across grammars. Moreover, the
parser generation process may fail to terminate or may result in exponentially
large parsers generated in an exponential amount of time.
The GLL back-end presented in this paper implements parameterized
nonterminals successfully by generating higher-order functions that resemble
parser combinators, inheriting all the advantages of top-down parsing. The
back-end is capable of generating parsers for the full class of context-free
grammars, generates parsers in linear time and generates parsers that find all
derivations of the input string. To our knowledge, the presented GLL back-end
makes Happy the first parser generator that combines all these features.
This paper describes the translation procedure of the GLL back-end and
compares it to the LALR and GLR back-ends of Happy in several experiments.Comment: 15 page
Jparsec - a parser combinator for Javascript
Parser combinators have been a popular parsing approach in recent years. Compared with traditional parsers, a parser combinator has both readability and maintenance advantages.
This project aims to construct a lightweight parser construct library for Javascript called Jparsec. Based on the modular nature of a parser combinator, the implementation uses higher-order functions. JavaScript provides a friendly and simple way to use higher-order functions, so the main construction method of this project will use JavaScript\u27s lambda functions. In practical applications, a parser combinator is mainly used as a tool, such as parsing JSON files.
In order to verify the utility of parser combinators, this project uses a parser combinator to parse a partial Lua grammar. Lua is a widely used programming language, serving as a good test case for my parser combinator
A Variant of Earley Parsing
The Earley algorithm is a widely used parsing method in natural language
processing applications. We introduce a variant of Earley parsing that is based
on a ``delayed'' recognition of constituents. This allows us to start the
recognition of a constituent only in cases in which all of its subconstituents
have been found within the input string. This is particularly advantageous in
several cases in which partial analysis of a constituent cannot be completed
and in general in all cases of productions sharing some suffix of their
right-hand sides (even for different left-hand side nonterminals). Although the
two algorithms result in the same asymptotic time and space complexity, from a
practical perspective our algorithm improves the time and space requirements of
the original method, as shown by reported experimental results.Comment: 12 pages, 1 Postscript figure, uses psfig.tex and llncs.st
flap: A Deterministic Parser with Fused Lexing
Lexers and parsers are typically defined separately and connected by a token
stream. This separate definition is important for modularity and reduces the
potential for parsing ambiguity. However, materializing tokens as data
structures and case-switching on tokens comes with a cost. We show how to fuse
separately-defined lexers and parsers, drastically improving performance
without compromising modularity or increasing ambiguity. We propose a
deterministic variant of Greibach Normal Form that ensures deterministic
parsing with a single token of lookahead and makes fusion strikingly simple,
and prove that normalizing context free expressions into the deterministic
normal form is semantics-preserving. Our staged parser combinator library,
flap, provides a standard interface, but generates specialized token-free code
that runs two to six times faster than ocamlyacc on a range of benchmarks.Comment: PLDI 2023 with appendi
- …