4 research outputs found

    LL(1) Parsing with Derivatives and Zippers

    Full text link
    In this paper, we present an efficient, functional, and formally verified parsing algorithm for LL(1) context-free expressions based on the concept of derivatives of formal languages. Parsing with derivatives is an elegant parsing technique, which, in the general case, suffers from cubic worst-case time complexity and slow performance in practice. We specialise the parsing with derivatives algorithm to LL(1) context-free expressions, where alternatives can be chosen given a single token of lookahead. We formalise the notion of LL(1) expressions and show how to efficiently check the LL(1) property. Next, we present a novel linear-time parsing with derivatives algorithm for LL(1) expressions operating on a zipper-inspired data structure. We prove the algorithm correct in Coq and present an implementation as a parser combinators framework in Scala, with enumeration and pretty printing capabilities.Comment: Appeared at PLDI'20 under the title "Zippy LL(1) Parsing with Derivatives

    Protecting Systems From Exploits Using Language-Theoretic Security

    Get PDF
    Any computer program processing input from the user or network must validate the input. Input-handling vulnerabilities occur in programs when the software component responsible for filtering malicious input---the parser---does not perform validation adequately. Consequently, parsers are among the most targeted components since they defend the rest of the program from malicious input. This thesis adopts the Language-Theoretic Security (LangSec) principle to understand what tools and research are needed to prevent exploits that target parsers. LangSec proposes specifying the syntactic structure of the input format as a formal grammar. We then build a recognizer for this formal grammar to validate any input before the rest of the program acts on it. To ensure that these recognizers represent the data format, programmers often rely on parser generators or parser combinators tools to build the parsers. This thesis propels several sub-fields in LangSec by proposing new techniques to find bugs in implementations, novel categorizations of vulnerabilities, and new parsing algorithms and tools to handle practical data formats. To this end, this thesis comprises five parts that tackle various tenets of LangSec. First, I categorize various input-handling vulnerabilities and exploits using two frameworks. First, I use the mismorphisms framework to reason about vulnerabilities. This framework helps us reason about the root causes leading to various vulnerabilities. Next, we built a categorization framework using various LangSec anti-patterns, such as parser differentials and insufficient input validation. Finally, we built a catalog of more than 30 popular vulnerabilities to demonstrate the categorization frameworks. Second, I built parsers for various Internet of Things and power grid network protocols and the iccMAX file format using parser combinator libraries. The parsers I built for power grid protocols were deployed and tested on power grid substation networks as an intrusion detection tool. The parser I built for the iccMAX file format led to several corrections and modifications to the iccMAX specifications and reference implementations. Third, I present SPARTA, a novel tool I built that generates Rust code that type checks Portable Data Format (PDF) files. The type checker I helped build strictly enforces the constraints in the PDF specification to find deviations. Our checker has contributed to at least four significant clarifications and corrections to the PDF 2.0 specification and various open-source PDF tools. In addition to our checker, we also built a practical tool, PDFFixer, to dynamically patch type errors in PDF files. Fourth, I present ParseSmith, a tool to build verified parsers for real-world data formats. Most parsing tools available for data formats are insufficient to handle practical formats or have not been verified for their correctness. I built a verified parsing tool in Dafny that builds on ideas from attribute grammars, data-dependent grammars, and parsing expression grammars to tackle various constructs commonly seen in network formats. I prove that our parsers run in linear time and always terminate for well-formed grammars. Finally, I provide the earliest systematic comparison of various data description languages (DDLs) and their parser generation tools. DDLs are used to describe and parse commonly used data formats, such as image formats. Next, I conducted an expert elicitation qualitative study to derive various metrics that I use to compare the DDLs. I also systematically compare these DDLs based on sample data descriptions available with the DDLs---checking for correctness and resilience

    Network operator intent : a basis for user-friendly network configuration and analysis

    Get PDF
    Two important network management activities are configuration (making the network behave in a desirable way) and analysis (querying the network’s state). A challenge common to these activities is specifying operator intent. Seemingly simple configurations such as “no network user should exceed their allocated bandwidth” or questions like “how many network devices are in the library?” are difficult to formulate in practice, e.g. they may require multiple tools (like access control lists, firewalls, databases, or accounting software) and a detailed knowledge of the network. This requires a high degree of expertise and experience, and even then, mistakes are common. An understanding of the core concepts that network operators manipulate and analyse is needed so that more effective, efficient, and user-friendly tools and processes can be created. To address this, we create a taxonomy of languages for configuring networks, and use it to evaluate three such languages to learn how operators can express their intent. We identify factors such as language features, testing, state modeling, documentation, and tool support. Then, we interview network operators to understand what they want to express. We analyse the interviews and identify nine orthogonal dimensions which frequently appear in expressions of operator intent. We use these concepts, and our taxonomy, as the basis for a language for querying both business- and network-domain data. We evaluate our language and find that it reduces the number and complexity of queries needed to answer questions about networks. We also conduct a user study, and find that our language reduces novices’ cognitive load while increasing their accuracy and efficiency. With our language, users better understand how to approach questions, can more easily express themselves, and make fewer mistakes when interpreting data. Overall, we find that operator intent can, at one extreme, be expressed directly, as primitives like flow rules, packet counters, or CLI commands, and at another extreme as human-readable statements which are automatically translated and implemented. The former gives operators precise control, but the latter may be easier to use. We also find that there is more to expressing intent than syntax and semantics as usability, redundancy, state manipulation, and ecosystems all play a role. Our findings also show the importance of incorporating business-domain concepts in network management tools. By understanding operator intent we can reduce errors, improve both human-human and human-computer communication, create more usable tools, and make network operators more effective

    College of Arts and Sciences

    Full text link
    Cornell University Courses of Study Vol. 94 2002/200
    corecore