41 research outputs found

    Reachability and Error Diagnosis in LR(1) Parsers

    Get PDF
    International audienceGiven an LR(1) automaton, what are the states in which an error can be detected? For each such " error state " , what is a minimal input sentence that causes an error in this state? We propose an algorithm that answers these questions. This allows building a collection of pairs of an erroneous input sentence and a (handwritten) diagnostic message, ensuring that this collection covers every error state, and maintaining this property as the grammar evolves. We report on an application of this technique to the CompCert ISO C99 parser, and discuss its strengths and limitations

    Faster reachability analysis for LR(1) parsers

    Get PDF
    International audienceWe present a novel algorithm for reachability in an LR(1) automaton. For each transition in the automaton, the problem is to determine under what conditions this transition can be taken, that is, which (minimal) input fragment and which lookahead symbol allow taking this transition. Our algorithm outperforms Pottier's algorithm (2016) by up to three orders of magnitude on real-world grammars. Among other applications, this vastly improves the scalability of Jeffery's error reporting technique (2003), where a mapping of (reachable) error states to messages must be created and maintained

    Reachability and error diagnosis in LR(1) automata

    Get PDF
    National audienceGiven an LR(1) automaton, what are the states in which an error can be detected? For each such " error state " , what is a minimal input sentence that causes an error in this state? We propose an algorithm that answers these questions. Such an algorithm allows building a collection of pairs of an erroneous input sentence and a diagnostic message, ensuring that this collection covers every error state, and maintaining this property as the grammar evolves. We report on an application of this technique to the CompCert ISO C99 parser, and discuss its strengths and limitations

    Coverage directed algorithms for test suite construction from LR-automata

    Get PDF
    Thesis (MSc)--Stellenbosch University, 2022.ENGLISH ABSTRACT: Bugs in software can have disastrous results in terms of both economic cost and human lives. Parsers can have bugs, like any other type of software, and must therefore be thoroughly tested in order to ensure that a parser recognizes its intended language accurately. However, parsers often need to recognize many different variations and combinations of grammar structures which can make it time consuming and difficult to construct test suites by hand. We therefore require automated methods of test suite construction for these systems. Currently, the majority of test suite construction algorithms focus on the grammar describing the language to be recognized by the parser. In this thesis we take a different approach. We consider the LR-automaton that recognizes the target language and use the context information encoded in the automaton. Specifically, we define a new class of algorithm and coverage criteria over a variant of the LR-automaton that we define, called an LR-graph. We define methods of constructing positive test suites, using paths over this LR-graph, as well as mutations on valid paths to construct negative test suites. We evaluate the performance of our new algorithms against other state-of-the-art algorithms. We do this by comparing coverage achieved over various systems, some smaller systems used in a university level compilers course and other larger, real-world systems. We find good performance of our algorithms over these systems, when compared to algorithms that produce test suites of equivalent size. Our evaluation has uncovered a problem in grammar-based testing algorithms that we call bias. Bias can lead to significant variation in coverage achieved over a system, which can in turn lead to a flawed comparison of two algorithms or unrealized performance when a test suite is used in practice. We therefore define bias and measure it for all grammar-based test suite construction algorithms we use in this thesis.AFRIKAANSE OPSOMMING: Foute in sagteware kan rampspoedige resultate hˆe in terme van beide eko nomiese koste en menselewens. Ontleders kan foute hˆe soos enige ander tipe sagteware en moet daarom deeglik getoets word om te verseker dat ’n ontleder sy beoogde taal akkuraat herken. Ontleders moet egter dikwels baie verskillende variasies en kombinasies van grammatikastrukture herken wat dit tydrowend en moeilik kan maak om toetsreekse met die hand te bou. Ons benodig dus outomatiese metodes van toetsreeks-konstruksie vir hierdie stelsels. Tans fokus die meeste toetsreeks-konstruksiealgoritmes op die grammatika wat die taal beskryf wat deur die ontleder herken moet word. In hierdie tesis volg ons ’n ander benadering. Ons beskou die LR-outomaat wat die teikentaal herken en gebruik die konteksinligting wat in die outomaat ge¨enkodeer is. Spesifiek, ons definieer ’n nuwe klas algoritme en dekkingskriteria oor ’n variant van die LR-outomaat wat ons definieer, wat ’n LR-grafiek genoem word. Ons definieer metodes om positiewe toetsreekse te konstrueer deur paaie oor hierdie LR-grafiek te gebruik, asook mutasies op geldige paaie om negatiewe toetsreekse te konstrueer. Ons evalueer die werkverrigting van ons nuwe algoritmes teenoor ander moderne algoritmes. Ons doen dit deur dekking wat oor verskeie stelsels behaal is, te vergelyk, sommige kleiner stelsels wat in ’n samestellerskursus op universiteitsvlak en ander groter werklike stelsels gebruik word. Ons vind goeie werkverrigting van ons algoritmes oor hierdie stelsels, in vergelyking met algoritmes wat toetsreekse van ekwivalente grootte produseer. Ons evaluering het ’n probleem in grammatika-gebaseerde toetsalgoritmes ontdek wat ons vooroordeel noem. Vooroordeel kan lei tot aansienlike variasie in dekking wat oor ’n stelsel behaal word, wat weer kan lei tot ’n gebrekkige vergelyking van twee algoritmes of ongerealiseerde prestasie wanneer ’n toets reeks in die praktyk gebruik word. Ons definieer dus vooroordeel en meet dit vir alle grammatika-gebaseerde toetsreeks-konstruksiealgoritmes wat ons in hierdie tesis gebruik.Master

    Having Fun With 31.521 Shell Scripts

    Get PDF
    Statically parsing shell scripts is, due to various peculiarities of the shell language, a challenge. One of the difficulties is that the shell language is designed to be executed by intertwining reading chunks of syntax with semantic actions. We have analyzed a corpus of 31.521 POSIX shell scripts occurring as maintainer scripts in the Debian GNU/Linux distribution. Our parser, which makes use of recent developments in parser generation technology, succeeds on 99.9% of the corpus. The architecture of our tool allows us to easily plug in various statistical analyzers on the syntax trees constructed from the shell scripts. The statistics obtained by our tool are the basis for the definition of a model which we plan to use in the future for the formal verification of scripts

    A Simple, Possibly Correct LR Parser for C11

    Get PDF
    International audienceThe syntax of the C programming language is described in the C11 standard by an ambiguous context-free grammar, accompanied with English prose that describes the concept of " scope " and indicates how certain ambiguous code fragments should be interpreted. Based on these elements, the problem of implementing a compliant C11 parser is not entirely trivial. We review the main sources of difficulty and describe a relatively simple solution to the problem. Our solution employs the well-known technique of combining an LALR(1) parser with a " lexical feedback " mechanism. It draws on folklore knowledge and adds several original aspects , including: a twist on lexical feedback that allows a smooth interaction with lookahead; a simplified and powerful treatment of scopes; and a few amendments in the grammar. Although not formally verified, our parser avoids several pitfalls that other implementations have fallen prey to. We believe that its simplicity, its mostly-declarative nature, and its high similarity with the C11 grammar are strong informal arguments in favor of its correctness. Our parser is accompanied with a small suite of " tricky " C11 programs. We hope that it may serve as a reference or a starting point in the implementation of compilers and analysis tools

    Security-Policy Analysis with eXtended Unix Tools

    Get PDF
    During our fieldwork with real-world organizations---including those in Public Key Infrastructure (PKI), network configuration management, and the electrical power grid---we repeatedly noticed that security policies and related security artifacts are hard to manage. We observed three core limitations of security policy analysis that contribute to this difficulty. First, there is a gap between policy languages and the tools available to practitioners. Traditional Unix text-processing tools are useful, but practitioners cannot use these tools to operate on the high-level languages in which security policies are expressed and implemented. Second, practitioners cannot process policy at multiple levels of abstraction but they need this capability because many high-level languages encode hierarchical object models. Finally, practitioners need feedback to be able to measure how security policies and policy artifacts that implement those policies change over time. We designed and built our eXtended Unix tools (XUTools) to address these limitations of security policy analysis. First, our XUTools operate upon context-free languages so that they can operate upon the hierarchical object models of high-level policy languages. Second, our XUTools operate on parse trees so that practitioners can process and analyze texts at multiple levels of abstraction. Finally, our XUTools enable new computational experiments on multi-versioned structured texts and our tools allow practitioners to measure security policies and how they change over time. Just as programmers use high-level languages to program more efficiently, so can practitioners use these tools to analyze texts relative to a high-level language. Throughout the historical transmission of text, people have identified meaningful substrings of text and categorized them into groups such as sentences, pages, lines, function blocks, and books to name a few. Our research interprets these useful structures as different context-free languages by which we can analyze text. XUTools are already in demand by practitioners in a variety of domains and articles on our research have been featured in various news outlets that include ComputerWorld, CIO Magazine, Communications of the ACM, and Slashdot
    corecore