35,304 research outputs found
A Fuzzy Approach to Erroneous Inputs in Context-Free Language Recognition
Using fuzzy context-free grammars one can easily describe a finite number of ways to derive incorrect strings together with their degree of correctness. However, in general there is an infinite number of ways to perform a certain task wrongly. In this paper we introduce a generalization of fuzzy context-free grammars, the so-called fuzzy context-free -grammars, to model the situation of making a finite choice out of an infinity of possible grammatical errors during each context-free derivation step. Under minor assumptions on the parameter this model happens to be a very general framework to describe correctly as well as erroneously derived sentences by a single generating mechanism.
Our first result characterizes the generating capacity of these fuzzy context-free -grammars. As consequences we obtain: (i) bounds on modeling grammatical errors within the framework of fuzzy context-free grammars, and (ii) the fact that the family of languages generated by fuzzy context-free -grammars shares closure properties very similar to those of the family of ordinary context-free languages.
The second part of the paper is devoted to a few algorithms to recognize fuzzy context-free languages: viz. a variant of a functional version of Cocke-Younger- Kasami's algorithm and some recursive descent algorithms. These algorithms turn out to be robust in some very elementary sense and they can easily be extended to corresponding parsing algorithms
LL(1) Parsing with Derivatives and Zippers
In this paper, we present an efficient, functional, and formally verified
parsing algorithm for LL(1) context-free expressions based on the concept of
derivatives of formal languages. Parsing with derivatives is an elegant parsing
technique, which, in the general case, suffers from cubic worst-case time
complexity and slow performance in practice. We specialise the parsing with
derivatives algorithm to LL(1) context-free expressions, where alternatives can
be chosen given a single token of lookahead. We formalise the notion of LL(1)
expressions and show how to efficiently check the LL(1) property. Next, we
present a novel linear-time parsing with derivatives algorithm for LL(1)
expressions operating on a zipper-inspired data structure. We prove the
algorithm correct in Coq and present an implementation as a parser combinators
framework in Scala, with enumeration and pretty printing capabilities.Comment: Appeared at PLDI'20 under the title "Zippy LL(1) Parsing with
Derivatives
Operator precedence for data-dependent grammars
Constructing parsers based on declarative specification of operator precedence is a very old research topic, and there are various existing approaches. However, these approaches are either tied to a particular parsing technique, or cannot deal with all corner cases found in programming languages. In this paper we present an implementation of declarative specification of operator precedence for general parsing that (1) is independent of the underlying parsing algorithm, (2) does not require any grammar transformation that increases the size of the grammar, (3) preserves the shape of parse trees of the original, natural grammar, and (4) can deal with intricate cases of operator precedence found in functional programming languages such as OCaml. Our new approach to operator precedence is formulated using data-dependent grammars, which extend context-free grammars with arbitrary computation, variable binding and constraints. We implemented our approach using Iguana, a data-dependent parsing framework, and evaluated it by parsing Java and OCaml source files. The results show that our approach is practical for parsing programming languages with complicated operator precedence rules
Ambiguity Detection: Scaling to Scannerless
Static ambiguity detection would be an important aspect of language
workbenches for textual software languages. However, the challenge is
that automatic ambiguity detection in context-free grammars is undecidable
in general. Sophisticated approximations and optimizations do exist,
but these do not scale to grammars for so-called ``scannerless parsers'', as of yet.
We extend previous work on ambiguity detection for context-free grammars to
cover disambiguation techniques that are typical for scannerless parsing,
such as longest match and reserved keywords.
This paper contributes a new algorithm for ambiguity detection in
character-level grammars, a prototype implementation of this algorithm and
validation on several real grammars. The total run-time of ambiguity
detection for character-level grammars for languages such as C and Java is
significantly reduced, without loss of precision.
The result is that efficient ambiguity detection in realistic grammars is
possible and may therefore become a tool in language workbenches
Ambiguity Detection: Scaling to Scannerless
Static ambiguity detection would be an important aspect of language
workbenches for textual software languages. However, the challenge is
that automatic ambiguity detection in context-free grammars is undecidable
in general. Sophisticated approximations and optimizations do exist,
but these do not scale to grammars for so-called ``scannerless parsers'', as of yet.
We extend previous work on ambiguity detection for context-free grammars to
cover disambiguation techniques that are typical for scannerless parsing,
such as longest match and reserved keywords.
This paper contributes a new algorithm for ambiguity detection in
character-level grammars, a prototype implementation of this algorithm and
validation on several real grammars. The total run-time of ambiguity
detection for character-level grammars for languages such as C and Java is
significantly reduced, without loss of precision.
The result is that efficient ambiguity detection in realistic grammars is
possible and may therefore become a tool in language workbenches
Constrained Decoding for Code Language Models via Efficient Left and Right Quotienting of Context-Sensitive Grammars
Large Language Models are powerful tools for program synthesis and advanced
auto-completion, but come with no guarantee that their output code is
syntactically correct. This paper contributes an incremental parser that allows
early rejection of syntactically incorrect code, as well as efficient detection
of complete programs for fill-in-the-middle (FItM) tasks. We develop
Earley-style parsers that operate over left and right quotients of arbitrary
context-free grammars, and we extend our incremental parsing and quotient
operations to several context-sensitive features present in the grammars of
many common programming languages. The result of these contributions is an
efficient, general, and well-grounded method for left and right quotient
parsing.
To validate our theoretical contributions -- and the practical effectiveness
of certain design decisions -- we evaluate our method on the particularly
difficult case of FItM completion for Python 3. Our results demonstrate that
constrained generation can significantly reduce the incidence of syntax errors
in recommended code.Comment: 20 pages, Code available at
https://github.com/amazon-science/incremental-parsin
Towards Zero-Overhead Disambiguation of Deep Priority Conflicts
**Context** Context-free grammars are widely used for language prototyping
and implementation. They allow formalizing the syntax of domain-specific or
general-purpose programming languages concisely and declaratively. However, the
natural and concise way of writing a context-free grammar is often ambiguous.
Therefore, grammar formalisms support extensions in the form of *declarative
disambiguation rules* to specify operator precedence and associativity, solving
ambiguities that are caused by the subset of the grammar that corresponds to
expressions.
**Inquiry** Implementing support for declarative disambiguation within a
parser typically comes with one or more of the following limitations in
practice: a lack of parsing performance, or a lack of modularity (i.e.,
disallowing the composition of grammar fragments of potentially different
languages). The latter subject is generally addressed by scannerless
generalized parsers. We aim to equip scannerless generalized parsers with novel
disambiguation methods that are inherently performant, without compromising the
concerns of modularity and language composition.
**Approach** In this paper, we present a novel low-overhead implementation
technique for disambiguating deep associativity and priority conflicts in
scannerless generalized parsers with lightweight data-dependency.
**Knowledge** Ambiguities with respect to operator precedence and
associativity arise from combining the various operators of a language. While
*shallow conflicts* can be resolved efficiently by one-level tree patterns,
*deep conflicts* require more elaborate techniques, because they can occur
arbitrarily nested in a tree. Current state-of-the-art approaches to solving
deep priority conflicts come with a severe performance overhead.
**Grounding** We evaluated our new approach against state-of-the-art
declarative disambiguation mechanisms. By parsing a corpus of popular
open-source repositories written in Java and OCaml, we found that our approach
yields speedups of up to 1.73x over a grammar rewriting technique when parsing
programs with deep priority conflicts--with a modest overhead of 1-2 % when
parsing programs without deep conflicts.
**Importance** A recent empirical study shows that deep priority conflicts
are indeed wide-spread in real-world programs. The study shows that in a corpus
of popular OCaml projects on Github, up to 17 % of the source files contain
deep priority conflicts. However, there is no solution in the literature that
addresses efficient disambiguation of deep priority conflicts, with support for
modular and composable syntax definitions
- …