437 research outputs found
An Efficient Probabilistic Context-Free Parsing Algorithm that Computes Prefix Probabilities
We describe an extension of Earley's parser for stochastic context-free
grammars that computes the following quantities given a stochastic context-free
grammar and an input string: a) probabilities of successive prefixes being
generated by the grammar; b) probabilities of substrings being generated by the
nonterminals, including the entire string being generated by the grammar; c)
most likely (Viterbi) parse of the string; d) posterior expected number of
applications of each grammar production, as required for reestimating rule
probabilities. (a) and (b) are computed incrementally in a single left-to-right
pass over the input. Our algorithm compares favorably to standard bottom-up
parsing methods for SCFGs in that it works efficiently on sparse grammars by
making use of Earley's top-down control structure. It can process any
context-free rule format without conversion to some normal form, and combines
computations for (a) through (d) in a single algorithm. Finally, the algorithm
has simple extensions for processing partially bracketed inputs, and for
finding partial parses and their likelihoods on ungrammatical inputs.Comment: 45 pages. Slightly shortened version to appear in Computational
Linguistics 2
On the Relation between Context-Free Grammars and Parsing Expression Grammars
Context-Free Grammars (CFGs) and Parsing Expression Grammars (PEGs) have
several similarities and a few differences in both their syntax and semantics,
but they are usually presented through formalisms that hinder a proper
comparison. In this paper we present a new formalism for CFGs that highlights
the similarities and differences between them. The new formalism borrows from
PEGs the use of parsing expressions and the recognition-based semantics. We
show how one way of removing non-determinism from this formalism yields a
formalism with the semantics of PEGs. We also prove, based on these new
formalisms, how LL(1) grammars define the same language whether interpreted as
CFGs or as PEGs, and also show how strong-LL(k), right-linear, and LL-regular
grammars have simple language-preserving translations from CFGs to PEGs
Pattern matching in compilers
In this thesis we develop tools for effective and flexible pattern matching.
We introduce a new pattern matching system called amethyst. Amethyst is not
only a generator of parsers of programming languages, but can also serve as an
alternative to tools for matching regular expressions.
Our framework also produces dynamic parsers. Its intended use is in the
context of IDE (accurate syntax highlighting and error detection on the fly).
Amethyst offers pattern matching of general data structures. This makes it a
useful tool for implementing compiler optimizations such as constant folding,
instruction scheduling, and dataflow analysis in general.
The parsers produced are essentially top-down parsers. Linear time complexity
is obtained by introducing the novel notion of structured grammars and
regularized regular expressions. Amethyst uses techniques known from compiler
optimizations to produce effective parsers.Comment: master thesi
Jparsec - a parser combinator for Javascript
Parser combinators have been a popular parsing approach in recent years. Compared with traditional parsers, a parser combinator has both readability and maintenance advantages.
This project aims to construct a lightweight parser construct library for Javascript called Jparsec. Based on the modular nature of a parser combinator, the implementation uses higher-order functions. JavaScript provides a friendly and simple way to use higher-order functions, so the main construction method of this project will use JavaScript\u27s lambda functions. In practical applications, a parser combinator is mainly used as a tool, such as parsing JSON files.
In order to verify the utility of parser combinators, this project uses a parser combinator to parse a partial Lua grammar. Lua is a widely used programming language, serving as a good test case for my parser combinator
LLLR Parsing: a Combination of LL and LR Parsing
A new parsing method called LLLR parsing is defined and a method for producing LLLR parsers is described. An LLLR parser uses an LL parser as its backbone and parses as much of its input string using LL parsing as possible. To resolve LL conflicts it triggers small embedded LR parsers. An embedded LR parser starts parsing the remaining input and once the LL conflict is resolved, the LR parser produces the left parse of the substring it has just parsed and passes the control back to the backbone LL parser. The LLLR(k) parser can be constructed for any LR(k) grammar. It produces the left parse of the input string without any backtracking and, if used for a syntax-directed translation, it evaluates semantic actions using the top-down strategy just like the canonical LL(k) parser. An LLLR(k) parser is appropriate for grammars where the LL(k) conflicting nonterminals either appear relatively close to the bottom of the derivation trees or produce short substrings. In such cases an LLLR parser can perform a significantly better error recovery than an LR parser since the most part of the input string is parsed with the backbone LL parser. LLLR parsing is similar to LL(^*) parsing except that it (a) uses LR(k) parsers instead of finite automata to resolve the LL(k) conflicts and (b) does not perform any backtracking
Syntactic analysis of LR(k) languages
PhD ThesisA method of syntactic analysis, termed LA(m)LR(k), is discussed
theoretically. Knuth's LR(k) algorithm is included as the special
case m = k. A simpler variant, SLA(m)LR(k) is also described, which
in the case SLA(k)LR(O) is equivalent to the SLR(k) algorithm as
defined by DeRemer. Both variants have the LR(k) property of
immediate detection of syntactic errors.
The case m = 1 k = 0 is examined in detail, when the methods
provide a practical parsing technique of greater generality than
precedence methods in current use. A formal comparison is made with
the weak precedence algorithm.
The implementation of an SLA(1)LR(O) parser (SLR) is described,
involving numerous space and time optimisations. Of importance is a
technique for bypassing unnecessary steps in a syntactic derivation.
Direct comparisons are made, primarily with the simple precedence
parser of the highly efficient Stanford AlgolW compiler, and confirm
the practical feasibility of the SLR parser.The Science Research Council
Practical LR Parser Generation
Parsing is a fundamental building block in modern compilers, and for
industrial programming languages, it is a surprisingly involved task. There are
known approaches to generate parsers automatically, but the prevailing
consensus is that automatic parser generation is not practical for real
programming languages: LR/LALR parsers are considered to be far too restrictive
in the grammars they support, and LR parsers are often considered too
inefficient in practice. As a result, virtually all modern languages use
recursive-descent parsers written by hand, a lengthy and error-prone process
that dramatically increases the barrier to new programming language
development.
In this work we demonstrate that, contrary to the prevailing consensus, we
can have the best of both worlds: for a very general, practical class of
grammars -- a strict superset of Knuth's canonical LR -- we can generate
parsers automatically, and the resulting parser code, as well as the generation
procedure itself, is highly efficient. This advance relies on several new
ideas, including novel automata optimization procedures; a new grammar
transformation ("CPS"); per-symbol attributes; recursive-descent actions; and
an extension of canonical LR parsing, which we refer to as XLR, which endows
shift/reduce parsers with the power of bounded nondeterministic choice.
With these ingredients, we can automatically generate efficient parsers for
virtually all programming languages that are intuitively easy to parse -- a
claim we support experimentally, by implementing the new algorithms in a new
software tool called langcc, and running them on syntax specifications for
Golang 1.17.8 and Python 3.9.12. The tool handles both languages automatically,
and the generated code, when run on standard codebases, is 1.2x faster than the
corresponding hand-written parser for Golang, and 4.3x faster than the CPython
parser, respectively
Stream Processing using Grammars and Regular Expressions
In this dissertation we study regular expression based parsing and the use of
grammatical specifications for the synthesis of fast, streaming
string-processing programs.
In the first part we develop two linear-time algorithms for regular
expression based parsing with Perl-style greedy disambiguation. The first
algorithm operates in two passes in a semi-streaming fashion, using a constant
amount of working memory and an auxiliary tape storage which is written in the
first pass and consumed by the second. The second algorithm is a single-pass
and optimally streaming algorithm which outputs as much of the parse tree as is
semantically possible based on the input prefix read so far, and resorts to
buffering as many symbols as is required to resolve the next choice. Optimality
is obtained by performing a PSPACE-complete pre-analysis on the regular
expression.
In the second part we present Kleenex, a language for expressing
high-performance streaming string processing programs as regular grammars with
embedded semantic actions, and its compilation to streaming string transducers
with worst-case linear-time performance. Its underlying theory is based on
transducer decomposition into oracle and action machines, and a finite-state
specialization of the streaming parsing algorithm presented in the first part.
In the second part we also develop a new linear-time streaming parsing
algorithm for parsing expression grammars (PEG) which generalizes the regular
grammars of Kleenex. The algorithm is based on a bottom-up tabulation algorithm
reformulated using least fixed points and evaluated using an instance of the
chaotic iteration scheme by Cousot and Cousot
- …