Search CORE

30 research outputs found

Symbolic Lookaheads for Bottom-up Parsing

Author: Quaglia Paola
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 41st International Symposium on Mathematical Foundations of Computer Science (MFCS 2016)
Publication date: 01/01/2016
Field of study

We present algorithms for the construction of LALR(1) parsing tables, and of LR(1) parsing tables of reduced size. We first define specialized characteristic automata whose states are parametric w.r.t. variables symbolically representing lookahead-sets. The propagation flow of lookaheads is kept in the form of a system of recursive equations, which is resolved to obtain the concrete LALR(1) table. By inspection of the LALR(1) automaton and of its lookahead ropagation flow, we decide whether the grammar is LR(1) or not. In the positive case, an LR(1) parsing table of reduced size is computed by refinement of the LALR(1) table

Dagstuhl Research Online Publication Server

An Investigation into Performance-related Issues of Regular Expression Matching

Author: van Litsenborgh Pieter Steyn
Publication venue
Publication date: 24/11/2022
Field of study

Computer Scienc

Stellenbosch University SUNScholar Repository

Practical Dynamic Symbolic Execution for JavaScript

Author: Loring Blake
Publication venue
Publication date: 16/03/2021
Field of study

Royal Holloway - Pure

音声翻訳における文解析技法について

Author: Tomita Masaru
Publication venue: 京都大学
Publication date: 23/07/1994
Field of study

本文データは平成22年度国立国会図書館の学位論文(博士)のデジタル化実施により作成された画像ファイルを基にpdf変換したものである京都大学0048新制・論文博士博士(工学)乙第8652号論工博第2893号新制||工||968(附属図書館)UT51-94-R411(主査)教授長尾真, 教授堂下修司, 教授池田克夫学位規則第4条第2項該当Doctor of EngineeringKyoto UniversityDFA

Kyoto University Research Information Repository

A text pattern-matching tool based on Parsing Expression Grammars

Author: Aho
Clarke
Ford
Griswold
Hagen
Hutton
Ierusalimschy
Ierusalimschy
Knuth
Laurikari
Thompson
Wall
Publication venue: 'Wiley'
Publication date
Field of study

Crossref

Recommended from our members

Symbolic Model Learning: New Algorithms and Applications

Author: Argyros Georgios
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2019
Field of study

In this thesis, we study algorithms which can be used to extract, or learn, formal mathematical models from software systems and then using these models to test whether the given software systems satisfy certain security properties such as robustness against code injection attacks. Specifically, we focus on studying learning algorithms for automata and transducers and the symbolic extensions of these models, namely symbolic finite automata (SFAs). In a high level, this thesis contributes the following results: 1. In the first part of the thesis, we present a unified treatment of many common variations of the seminal L* algorithm for learning deterministic finite automata (DFAs) as a congruence learning algorithm for the underlying Nerode congruence which forms the basis of automata theory. Under this formulation the basic data structures used by different variations are unified as different ways to implement the Nerode congruence using queries. 2. Next, building on the new formulation of L*-style algorithms we proceed to develop new algorithms for learning transducer models. Firstly, we present the first algorithm for learning deterministic partial transducers. Furthermore, we extend my algorithm into non-deterministic models by introducing a novel, generalized congruence relation over string transformations which is able to capture a subclass of string transformations with regular lookahead. We demonstrate that this class is able to capture many practical string transformation from the domain of string sanitizers in Web applications. 3. Classical learning algorithms for automata and transducers operate over finite alphabets and have a query complexity that scales linearly with the size of the alphabet. However, in practice, this dependence on the alphabet size hinders the performance of the algorithms. To address this issue, we develop the MAT* algorithm for learning symbolic finite state automata (SFAs) which operate over infinite alphabets. In practice, the MAT* learning algorithm allow us to plug custom transition learning algorithms which will efficiently infer the predicates in the transitions of the SFA without querying the whole alphabet set. 4. Finally, we use our learning algorithm toolbox as the basis for the development of a set of black-box testing algorithms. More specifically, we present Grammar Oriented Filter Auditing (GOFA), a novel technique which allows one to utilize my learning algorithms to evaluate the robustness of a string sanitizer or filter against a set of attack strings given as a context-free grammar. Furthermore, because such grammars are many times unavailable, we developed sfadiff a differential testing technique based on symbolic automata learning which can be used in order to perform differential testing of two different parser implementations using SFA learning algorithms and we demonstrate how our algorithm can be used to develop program fingerprints. We evaluate our algorithms against state-of-the-art Web Application Firewalls and discover over 15 previously unknown vulnerabilities which result in evading the firewalls and performing code injection attacks in the backend Web application. Finally, we show how our learning algorithms can uncover vulnerabilities which are missed by other black-box methods such as fuzzing and grammar-based testing

Columbia University Academic Commons

Efficient algorithms for hard problems in nondeterministic tree automata

Author: Almeida Ricardo Manuel de Oliveira
Publication venue: The University of Edinburgh
Publication date: 30/11/2017
Field of study

We present PTIME language-preserving techniques for the reduction of non-deterministic tree automata, both for the case of finite trees and for infinite trees. Our techniques are based on new transition removing and state merging results, which rely on binary relations that compare the downward and upward behaviours of states in the automaton. We use downward/upward simulation preorders and the more general but EXPTIME-complete trace inclusion relations, for which we introduce good under-approximations computable in polynomial time. We provide a complete picture of combinations of downward and upward simulation/trace inclusions which can be used in our reduction techniques. We define an algorithm that puts together all the reduction results found for finite trees, and implemented it under the name minotaut, a tool built on top of the well-known tree automata library libvata. We tested minotaut on large collections of automata from program verification provenience, as well as on different classes of randomly generated automata. Our algorithm yields substantially smaller and sparser automata than all previously known reduction techniques, and it is still fast enough to handle large instances. Taking reduction of automata on finite trees one step further, we then introduce saturation, a technique that consists of adding new transitions to an automaton while preserving its language. We implemented this technique on minotaut and we show how it can make subsequent state-merge and transition-removal operations more effective. Thus we obtain a PTIME algorithm that reduces the number of states of tree automata even more than before. Additionally, we explore how minotaut alone can play an important role when performing hard operations like complementation, allowing to obtain smaller complement automata and at lower computation times overall. We then show how saturation can extend this contribution even further. An overview of the tool, highlighting some of its implementation features, is presented as well

Edinburgh Research Archive

Protecting Systems From Exploits Using Language-Theoretic Security

Author: Anantharaman Prashant
Publication venue: Dartmouth Digital Commons
Publication date: 03/05/2022
Field of study

Any computer program processing input from the user or network must validate the input. Input-handling vulnerabilities occur in programs when the software component responsible for filtering malicious input---the parser---does not perform validation adequately. Consequently, parsers are among the most targeted components since they defend the rest of the program from malicious input. This thesis adopts the Language-Theoretic Security (LangSec) principle to understand what tools and research are needed to prevent exploits that target parsers. LangSec proposes specifying the syntactic structure of the input format as a formal grammar. We then build a recognizer for this formal grammar to validate any input before the rest of the program acts on it. To ensure that these recognizers represent the data format, programmers often rely on parser generators or parser combinators tools to build the parsers. This thesis propels several sub-fields in LangSec by proposing new techniques to find bugs in implementations, novel categorizations of vulnerabilities, and new parsing algorithms and tools to handle practical data formats. To this end, this thesis comprises five parts that tackle various tenets of LangSec. First, I categorize various input-handling vulnerabilities and exploits using two frameworks. First, I use the mismorphisms framework to reason about vulnerabilities. This framework helps us reason about the root causes leading to various vulnerabilities. Next, we built a categorization framework using various LangSec anti-patterns, such as parser differentials and insufficient input validation. Finally, we built a catalog of more than 30 popular vulnerabilities to demonstrate the categorization frameworks. Second, I built parsers for various Internet of Things and power grid network protocols and the iccMAX file format using parser combinator libraries. The parsers I built for power grid protocols were deployed and tested on power grid substation networks as an intrusion detection tool. The parser I built for the iccMAX file format led to several corrections and modifications to the iccMAX specifications and reference implementations. Third, I present SPARTA, a novel tool I built that generates Rust code that type checks Portable Data Format (PDF) files. The type checker I helped build strictly enforces the constraints in the PDF specification to find deviations. Our checker has contributed to at least four significant clarifications and corrections to the PDF 2.0 specification and various open-source PDF tools. In addition to our checker, we also built a practical tool, PDFFixer, to dynamically patch type errors in PDF files. Fourth, I present ParseSmith, a tool to build verified parsers for real-world data formats. Most parsing tools available for data formats are insufficient to handle practical formats or have not been verified for their correctness. I built a verified parsing tool in Dafny that builds on ideas from attribute grammars, data-dependent grammars, and parsing expression grammars to tackle various constructs commonly seen in network formats. I prove that our parsers run in linear time and always terminate for well-formed grammars. Finally, I provide the earliest systematic comparison of various data description languages (DDLs) and their parser generation tools. DDLs are used to describe and parse commonly used data formats, such as image formats. Next, I conducted an expert elicitation qualitative study to derive various metrics that I use to compare the DDLs. I also systematically compare these DDLs based on sample data descriptions available with the DDLs---checking for correctness and resilience

Dartmouth Digital Commons (Dartmouth College)

Verified Propagation Redundancy and Compositional UNSAT Checking in CakeML

Author: Heule Marijn J.H.
Myreen Magnus
Tan Yong Kiam
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2023
Field of study

Modern SAT solvers can emit independently-checkable proof certificates to validate their results. The state-of-the-art proof system that allows for compact proof certificates is propagation redundancy (PR). However, the only existing method to validate proofs in this system with a formally verified tool requires a transformation to a weaker proof system, which can result in a significant blowup in the size of the proof and increased proof validation time. This article describes the first approach to formally verify PR proofs on a succinct representation. We present (i) a new Linear PR (LPR) proof format, (ii) an extension of the DPR-trim tool to efficiently convert PR proofs into LPR format, and (iii) cake_lpr, a verified LPR proof checker developed in CakeML. We also enhance these tools with (iv) a new compositional proof format designed to enable separate (parallel) proof checking. The LPR format is backwards compatible with the existing LRAT format, but extends LRAT with support for the addition of PR clauses. Moreover, cake_lpr is verified using CakeML ’s binary code extraction toolchain, which yields correctness guarantees for its machine code (binary) implementation. This further distinguishes our clausal proof checker from existing checkers because unverified extraction and compilation tools are removed from its trusted computing base. We experimentally show that: LPR provides efficiency gains over existing proof formats; cake_lpr ’s strong correctness guarantees are obtained without significant sacrifice in its performance; and the compositional proof format enables scalable parallel proof checking for large proofs

Chalmers Research