41 research outputs found
Reachability and Error Diagnosis in LR(1) Parsers
International audienceGiven an LR(1) automaton, what are the states in which an error can be detected? For each such " error state " , what is a minimal input sentence that causes an error in this state? We propose an algorithm that answers these questions. This allows building a collection of pairs of an erroneous input sentence and a (handwritten) diagnostic message, ensuring that this collection covers every error state, and maintaining this property as the grammar evolves. We report on an application of this technique to the CompCert ISO C99 parser, and discuss its strengths and limitations
Faster reachability analysis for LR(1) parsers
International audienceWe present a novel algorithm for reachability in an LR(1) automaton. For each transition in the automaton, the problem is to determine under what conditions this transition can be taken, that is, which (minimal) input fragment and which lookahead symbol allow taking this transition. Our algorithm outperforms Pottier's algorithm (2016) by up to three orders of magnitude on real-world grammars. Among other applications, this vastly improves the scalability of Jeffery's error reporting technique (2003), where a mapping of (reachable) error states to messages must be created and maintained
Reachability and error diagnosis in LR(1) automata
National audienceGiven an LR(1) automaton, what are the states in which an error can be detected? For each such " error state " , what is a minimal input sentence that causes an error in this state? We propose an algorithm that answers these questions. Such an algorithm allows building a collection of pairs of an erroneous input sentence and a diagnostic message, ensuring that this collection covers every error state, and maintaining this property as the grammar evolves. We report on an application of this technique to the CompCert ISO C99 parser, and discuss its strengths and limitations
Coverage directed algorithms for test suite construction from LR-automata
Thesis (MSc)--Stellenbosch University, 2022.ENGLISH ABSTRACT: Bugs in software can have disastrous results in terms of both economic cost and human lives. Parsers can have bugs, like any other type of software,
and must therefore be thoroughly tested in order to ensure that a parser recognizes its intended language accurately. However, parsers often need to recognize many different variations and combinations of grammar structures
which can make it time consuming and difficult to construct test suites by hand. We therefore require automated methods of test suite construction for these systems.
Currently, the majority of test suite construction algorithms focus on the grammar describing the language to be recognized by the parser. In this thesis we take a different approach. We consider the LR-automaton that
recognizes the target language and use the context information encoded in the automaton. Specifically, we define a new class of algorithm and coverage criteria over a variant of the LR-automaton that we define, called an LR-graph.
We define methods of constructing positive test suites, using paths over this LR-graph, as well as mutations on valid paths to construct negative test suites.
We evaluate the performance of our new algorithms against other state-of-the-art algorithms. We do this by comparing coverage achieved over various systems, some smaller systems used in a university level compilers course and other larger, real-world systems. We find good performance of our algorithms over these systems, when compared to algorithms that produce test suites of equivalent size.
Our evaluation has uncovered a problem in grammar-based testing algorithms that we call bias. Bias can lead to significant variation in coverage
achieved over a system, which can in turn lead to a flawed comparison of two
algorithms or unrealized performance when a test suite is used in practice.
We therefore define bias and measure it for all grammar-based test suite construction algorithms we use in this thesis.AFRIKAANSE OPSOMMING: Foute in sagteware kan rampspoedige resultate hˆe in terme van beide eko nomiese koste en menselewens. Ontleders kan foute hˆe soos enige ander
tipe sagteware en moet daarom deeglik getoets word om te verseker dat ’n
ontleder sy beoogde taal akkuraat herken. Ontleders moet egter dikwels baie
verskillende variasies en kombinasies van grammatikastrukture herken wat
dit tydrowend en moeilik kan maak om toetsreekse met die hand te bou.
Ons benodig dus outomatiese metodes van toetsreeks-konstruksie vir hierdie
stelsels.
Tans fokus die meeste toetsreeks-konstruksiealgoritmes op die grammatika
wat die taal beskryf wat deur die ontleder herken moet word. In hierdie tesis
volg ons ’n ander benadering. Ons beskou die LR-outomaat wat die teikentaal
herken en gebruik die konteksinligting wat in die outomaat ge¨enkodeer is.
Spesifiek, ons definieer ’n nuwe klas algoritme en dekkingskriteria oor ’n
variant van die LR-outomaat wat ons definieer, wat ’n LR-grafiek genoem
word. Ons definieer metodes om positiewe toetsreekse te konstrueer deur
paaie oor hierdie LR-grafiek te gebruik, asook mutasies op geldige paaie om
negatiewe toetsreekse te konstrueer.
Ons evalueer die werkverrigting van ons nuwe algoritmes teenoor ander
moderne algoritmes. Ons doen dit deur dekking wat oor verskeie stelsels
behaal is, te vergelyk, sommige kleiner stelsels wat in ’n samestellerskursus op universiteitsvlak en ander groter werklike stelsels gebruik word. Ons vind
goeie werkverrigting van ons algoritmes oor hierdie stelsels, in vergelyking
met algoritmes wat toetsreekse van ekwivalente grootte produseer.
Ons evaluering het ’n probleem in grammatika-gebaseerde toetsalgoritmes
ontdek wat ons vooroordeel noem. Vooroordeel kan lei tot aansienlike variasie
in dekking wat oor ’n stelsel behaal word, wat weer kan lei tot ’n gebrekkige
vergelyking van twee algoritmes of ongerealiseerde prestasie wanneer ’n toets reeks in die praktyk gebruik word. Ons definieer dus vooroordeel en meet dit
vir alle grammatika-gebaseerde toetsreeks-konstruksiealgoritmes wat ons in
hierdie tesis gebruik.Master
Having Fun With 31.521 Shell Scripts
Statically parsing shell scripts is, due to various peculiarities of the shell language, a challenge. One of the difficulties is that the shell language is designed to be executed by intertwining reading chunks of syntax with semantic actions. We have analyzed a corpus of 31.521 POSIX shell scripts occurring as maintainer scripts in the Debian GNU/Linux distribution. Our parser, which makes use of recent developments in parser generation technology, succeeds on 99.9% of the corpus. The architecture of our tool allows us to easily plug in various statistical analyzers on the syntax trees constructed from the shell scripts. The statistics obtained by our tool are the basis for the definition of a model which we plan to use in the future for the formal verification of scripts
A Simple, Possibly Correct LR Parser for C11
International audienceThe syntax of the C programming language is described in the C11 standard by an ambiguous context-free grammar, accompanied with English prose that describes the concept of " scope " and indicates how certain ambiguous code fragments should be interpreted. Based on these elements, the problem of implementing a compliant C11 parser is not entirely trivial. We review the main sources of difficulty and describe a relatively simple solution to the problem. Our solution employs the well-known technique of combining an LALR(1) parser with a " lexical feedback " mechanism. It draws on folklore knowledge and adds several original aspects , including: a twist on lexical feedback that allows a smooth interaction with lookahead; a simplified and powerful treatment of scopes; and a few amendments in the grammar. Although not formally verified, our parser avoids several pitfalls that other implementations have fallen prey to. We believe that its simplicity, its mostly-declarative nature, and its high similarity with the C11 grammar are strong informal arguments in favor of its correctness. Our parser is accompanied with a small suite of " tricky " C11 programs. We hope that it may serve as a reference or a starting point in the implementation of compilers and analysis tools
Security-Policy Analysis with eXtended Unix Tools
During our fieldwork with real-world organizations---including those in Public Key Infrastructure (PKI), network configuration management, and the electrical power grid---we repeatedly noticed that security policies and related security artifacts are hard to manage. We observed three core limitations of security policy analysis that contribute to this difficulty. First, there is a gap between policy languages and the tools available to practitioners. Traditional Unix text-processing tools are useful, but practitioners cannot use these tools to operate on the high-level languages in which security policies are expressed and implemented. Second, practitioners cannot process policy at multiple levels of abstraction but they need this capability because many high-level languages encode hierarchical object models. Finally, practitioners need feedback to be able to measure how security policies and policy artifacts that implement those policies change over time. We designed and built our eXtended Unix tools (XUTools) to address these limitations of security policy analysis. First, our XUTools operate upon context-free languages so that they can operate upon the hierarchical object models of high-level policy languages. Second, our XUTools operate on parse trees so that practitioners can process and analyze texts at multiple levels of abstraction. Finally, our XUTools enable new computational experiments on multi-versioned structured texts and our tools allow practitioners to measure security policies and how they change over time. Just as programmers use high-level languages to program more efficiently, so can practitioners use these tools to analyze texts relative to a high-level language. Throughout the historical transmission of text, people have identified meaningful substrings of text and categorized them into groups such as sentences, pages, lines, function blocks, and books to name a few. Our research interprets these useful structures as different context-free languages by which we can analyze text. XUTools are already in demand by practitioners in a variety of domains and articles on our research have been featured in various news outlets that include ComputerWorld, CIO Magazine, Communications of the ACM, and Slashdot